Should RL rewards diminish over time?What is a time-step in a Markov Decision Process?Ensure convergence of DDQN if true Q-values are very closeHow to define the final / terminal state for Q learning?How do I avoid an agent to tend to terminate in a negative state when time needs to be taken into account?Encourage Deep Q to seek short-term rewardCan someone please help me validate my MDP?Monte Carlo learning for Reinforcement learningHow would one implement a multi-agent environment with asynchronous action and rewards per agent?

Can the jeopardy of being judged be fought against in court?

Why would a life-insurance company agree to a 20-year guaranteed life annuity which is expected to pay out more than the principal?

How do gelatinous cubes reproduce?

Seeking NRCS soil survey data at state or county scale?

Heavy condensation inside car during winter. Tried multiple things, but no results!

Plot the Pascalian triangle

Conversion of space characters into space tokens

Why do some applications have files with no extension?

Single Player Python Battleship Game

What are these tiny kidney bean sized things in my rotisserie chicken

How to response to requests to retest, in hope that the bug is gone?

In Flanders Fields

Why can "bubo" ("owl") be feminine or masculine?

Hanging string lights from stone

Does the sun cross other spiral arms in its movement around the galaxy's center?

How do shared hosting providers know you own a domain when you point the DNS to their server?

Water Jug Problem in AI

Monoids of endomorphisms of nonisomorphic groups

What is the scientific term to describe the operation of a bong?

Why does radiocarbon dating only work in nonliving creatures?

How to understand quality of Google Maps transport info in advance?

Are there any (natural) scientists in Middle-earth?

Produce the random variable for an asset from a uniformly distributed random varible

Should I not drive with this huge chipped rim?

Should RL rewards diminish over time?

What is a time-step in a Markov Decision Process?Ensure convergence of DDQN if true Q-values are very closeHow to define the final / terminal state for Q learning?How do I avoid an agent to tend to terminate in a negative state when time needs to be taken into account?Encourage Deep Q to seek short-term rewardCan someone please help me validate my MDP?Monte Carlo learning for Reinforcement learningHow would one implement a multi-agent environment with asynchronous action and rewards per agent?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;

Should a reward be cumulative or diminish over time?

For example, say an agent performed a good action at time $t$ and received a positive reward $R$. If reward is cumulative, $R$ is carried on through for the rest of the episode, and summed to any future rewards. However, if $R$ were to diminish over time (say with some scaling $fracRsqrtt$), then wouldn't that encourage the agent to keep taking actions to increasing its reward?

With cumulative rewards, the reward can both increase and decrease depending on the agents actions. But if the agent receives one good reward $R$ and then does nothing for a long time, it still has the original reward it received (encouraging it to do less?). However, if rewards diminish over time, in theory that would encourage the agent to keep taking actions to maximise rewards.

I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of $R$).

asked Aug 11 at 7:35

PyRsquared

1183 bronze badges

add a comment
|

Should a reward be cumulative or diminish over time?

asked Aug 11 at 7:35

PyRsquared

1183 bronze badges

add a comment
|

Should a reward be cumulative or diminish over time?

asked Aug 11 at 7:35

PyRsquared

1183 bronze badges

Should a reward be cumulative or diminish over time?

reinforcement-learning rewards

asked Aug 11 at 7:35

PyRsquared

1183 bronze badges

asked Aug 11 at 7:35

PyRsquared

1183 bronze badges

asked Aug 11 at 7:35

PyRsquared

1183 bronze badges

asked Aug 11 at 7:35

PyRsquared

1183 bronze badges

asked Aug 11 at 7:35

PyRsquared

1183 bronze badges

add a comment
|

1 Answer
1

active

oldest

votes

RL agents - implemented correctly - do not take previous rewards into account when making decisions. For instance value functions only assess potential future reward. The state value or expected return (aka utility) $G$ from a starting state $s$ may be defined like this:

$$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$

Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
expected value given following the policy $pi$ for action selection.

There are a few variations of this, depending on setting and which value function you are interested in. However, all value functions used in RL look at future sums of reward from the decision point when the action is taken. Past rewards are not taken into account.

An agent may still select to take an early high reward over a longer term reward, if:

The choice between two rewards is exclusive

The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.

If your problem is that an agent selects a low early reward when it could ignore it in favour of something larger later, then you should check the discount factor you are using. If you want a RL agent to take a long term view, then the discount factor needs to be close to $1.0$.

The premise of your question however is that somehow a RL agent would become "lazy" or "complacent" because it already had enough reward. That is not an issue that occurs in RL due to the way that it is formulated. Not only are past rewards not accounted for when calculating return values from states, but there is also no formula in RL for an agent receiving "enough" total reward like a creature satisfying its hunger - the maximisation is applied always in all states.

There is no need to somehow decay past rewards in any memory structure, and in fact no real way to do this, as there is no data structure that accumulates past rewards used by any RL agent. You may still collect this information for displaying results or analysing performance, but the agent doesn't ever use $r_t$ to figure out what $a_t$ should be.

I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R

You have probably formulated the reward function incorrectly for your problem in that case. A cumulative reward scheme (where an agent receives reward $a$ at $t=1$ then $a+b$ on $t=2$ then $a+b+c$ on $t=3$ etc) would be quite specialist and you have likely misunderstood how to represent the agent's goals. I suggest ask a separate question about your specific environment and your proposed reward scheme if you cannot resolve this.

edited Aug 11 at 8:17

answered Aug 11 at 8:02

Neil Slater

9,3821 gold badge7 silver badges22 bronze badges

1

$begingroup$
Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
$endgroup$
– DuttaA
Aug 11 at 8:11

$begingroup$
By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
$endgroup$
– DuttaA
Aug 11 at 8:13

1

$begingroup$
@DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
$endgroup$
– Neil Slater
Aug 11 at 8:15

$begingroup$
Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
$endgroup$
– PyRsquared
Aug 11 at 9:04

add a comment
|

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f13893%2fshould-rl-rewards-diminish-over-time%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

$$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$

Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
expected value given following the policy $pi$ for action selection.

An agent may still select to take an early high reward over a longer term reward, if:

The choice between two rewards is exclusive

The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.

I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R

edited Aug 11 at 8:17

answered Aug 11 at 8:02

Neil Slater

9,3821 gold badge7 silver badges22 bronze badges

1

$begingroup$
Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
$endgroup$
– DuttaA
Aug 11 at 8:11

$begingroup$
By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
$endgroup$
– DuttaA
Aug 11 at 8:13

1

$begingroup$
@DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
$endgroup$
– Neil Slater
Aug 11 at 8:15

$begingroup$
Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
$endgroup$
– PyRsquared
Aug 11 at 9:04

add a comment
|

$$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$

Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
expected value given following the policy $pi$ for action selection.

An agent may still select to take an early high reward over a longer term reward, if:

The choice between two rewards is exclusive

The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.

I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R

edited Aug 11 at 8:17

answered Aug 11 at 8:02

Neil Slater

9,3821 gold badge7 silver badges22 bronze badges

1

$begingroup$
Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
$endgroup$
– DuttaA
Aug 11 at 8:11

$begingroup$
By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
$endgroup$
– DuttaA
Aug 11 at 8:13

1

$begingroup$
@DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
$endgroup$
– Neil Slater
Aug 11 at 8:15

$begingroup$
Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
$endgroup$
– PyRsquared
Aug 11 at 9:04

add a comment
|

$$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$

Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
expected value given following the policy $pi$ for action selection.

An agent may still select to take an early high reward over a longer term reward, if:

The choice between two rewards is exclusive

The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.

I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R

edited Aug 11 at 8:17

answered Aug 11 at 8:02

Neil Slater

9,3821 gold badge7 silver badges22 bronze badges

$$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$

Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
expected value given following the policy $pi$ for action selection.

An agent may still select to take an early high reward over a longer term reward, if:

The choice between two rewards is exclusive

The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.

I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R

edited Aug 11 at 8:17

answered Aug 11 at 8:02

Neil Slater

9,3821 gold badge7 silver badges22 bronze badges

edited Aug 11 at 8:17

answered Aug 11 at 8:02

Neil Slater

9,3821 gold badge7 silver badges22 bronze badges

answered Aug 11 at 8:02

Neil Slater

9,3821 gold badge7 silver badges22 bronze badges

answered Aug 11 at 8:02

Neil Slater

9,3821 gold badge7 silver badges22 bronze badges

1

$begingroup$
Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
$endgroup$
– DuttaA
Aug 11 at 8:11

$begingroup$
By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
$endgroup$
– DuttaA
Aug 11 at 8:13

1

$begingroup$
@DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
$endgroup$
– Neil Slater
Aug 11 at 8:15

$begingroup$
Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
$endgroup$
– PyRsquared
Aug 11 at 9:04

add a comment
|

1

$begingroup$
Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
$endgroup$
– DuttaA
Aug 11 at 8:11

$begingroup$
By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
$endgroup$
– DuttaA
Aug 11 at 8:13

1

$begingroup$
@DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
$endgroup$
– Neil Slater
Aug 11 at 8:15

$begingroup$
Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
$endgroup$
– PyRsquared
Aug 11 at 9:04

Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?

– DuttaA
Aug 11 at 8:11

By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.

– DuttaA
Aug 11 at 8:13

@DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.

– Neil Slater
Aug 11 at 8:15

Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.

– PyRsquared
Aug 11 at 9:04

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Artificial Intelligence Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

vHeaT EFpVhuG6 PaE FggZJwzh4btQ,mvtOaOnnm2h uo

搜尋此網誌

Bsrgvty

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tamil (spriik) Luke uk diar | Nawigatjuun

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tamil (spriik) Luke uk diar | Nawigatjuun

1 Answer
1

1 Answer
1

1 Answer
1