Should RL rewards diminish over time?What is a time-step in a Markov Decision Process?Ensure convergence of DDQN if true Q-values are very closeHow to define the final / terminal state for Q learning?How do I avoid an agent to tend to terminate in a negative state when time needs to be taken into account?Encourage Deep Q to seek short-term rewardCan someone please help me validate my MDP?Monte Carlo learning for Reinforcement learningHow would one implement a multi-agent environment with asynchronous action and rewards per agent?

Can the jeopardy of being judged be fought against in court?

Why would a life-insurance company agree to a 20-year guaranteed life annuity which is expected to pay out more than the principal?

How do gelatinous cubes reproduce?

Seeking NRCS soil survey data at state or county scale?

Heavy condensation inside car during winter. Tried multiple things, but no results!

Plot the Pascalian triangle

Conversion of space characters into space tokens

Why do some applications have files with no extension?

Single Player Python Battleship Game

What are these tiny kidney bean sized things in my rotisserie chicken

How to response to requests to retest, in hope that the bug is gone?

In Flanders Fields

Why can "bubo" ("owl") be feminine or masculine?

Hanging string lights from stone

Does the sun cross other spiral arms in its movement around the galaxy's center?

How do shared hosting providers know you own a domain when you point the DNS to their server?

Water Jug Problem in AI

Monoids of endomorphisms of nonisomorphic groups

What is the scientific term to describe the operation of a bong?

Why does radiocarbon dating only work in nonliving creatures?

How to understand quality of Google Maps transport info in advance?

Are there any (natural) scientists in Middle-earth?

Produce the random variable for an asset from a uniformly distributed random varible

Should I not drive with this huge chipped rim?



Should RL rewards diminish over time?


What is a time-step in a Markov Decision Process?Ensure convergence of DDQN if true Q-values are very closeHow to define the final / terminal state for Q learning?How do I avoid an agent to tend to terminate in a negative state when time needs to be taken into account?Encourage Deep Q to seek short-term rewardCan someone please help me validate my MDP?Monte Carlo learning for Reinforcement learningHow would one implement a multi-agent environment with asynchronous action and rewards per agent?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty
margin-bottom:0;









2














$begingroup$


Should a reward be cumulative or diminish over time?



For example, say an agent performed a good action at time $t$ and received a positive reward $R$. If reward is cumulative, $R$ is carried on through for the rest of the episode, and summed to any future rewards. However, if $R$ were to diminish over time (say with some scaling $fracRsqrtt$), then wouldn't that encourage the agent to keep taking actions to increasing its reward?



With cumulative rewards, the reward can both increase and decrease depending on the agents actions. But if the agent receives one good reward $R$ and then does nothing for a long time, it still has the original reward it received (encouraging it to do less?). However, if rewards diminish over time, in theory that would encourage the agent to keep taking actions to maximise rewards.



I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of $R$).










share|improve this question










$endgroup$





















    2














    $begingroup$


    Should a reward be cumulative or diminish over time?



    For example, say an agent performed a good action at time $t$ and received a positive reward $R$. If reward is cumulative, $R$ is carried on through for the rest of the episode, and summed to any future rewards. However, if $R$ were to diminish over time (say with some scaling $fracRsqrtt$), then wouldn't that encourage the agent to keep taking actions to increasing its reward?



    With cumulative rewards, the reward can both increase and decrease depending on the agents actions. But if the agent receives one good reward $R$ and then does nothing for a long time, it still has the original reward it received (encouraging it to do less?). However, if rewards diminish over time, in theory that would encourage the agent to keep taking actions to maximise rewards.



    I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of $R$).










    share|improve this question










    $endgroup$

















      2












      2








      2


      1



      $begingroup$


      Should a reward be cumulative or diminish over time?



      For example, say an agent performed a good action at time $t$ and received a positive reward $R$. If reward is cumulative, $R$ is carried on through for the rest of the episode, and summed to any future rewards. However, if $R$ were to diminish over time (say with some scaling $fracRsqrtt$), then wouldn't that encourage the agent to keep taking actions to increasing its reward?



      With cumulative rewards, the reward can both increase and decrease depending on the agents actions. But if the agent receives one good reward $R$ and then does nothing for a long time, it still has the original reward it received (encouraging it to do less?). However, if rewards diminish over time, in theory that would encourage the agent to keep taking actions to maximise rewards.



      I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of $R$).










      share|improve this question










      $endgroup$




      Should a reward be cumulative or diminish over time?



      For example, say an agent performed a good action at time $t$ and received a positive reward $R$. If reward is cumulative, $R$ is carried on through for the rest of the episode, and summed to any future rewards. However, if $R$ were to diminish over time (say with some scaling $fracRsqrtt$), then wouldn't that encourage the agent to keep taking actions to increasing its reward?



      With cumulative rewards, the reward can both increase and decrease depending on the agents actions. But if the agent receives one good reward $R$ and then does nothing for a long time, it still has the original reward it received (encouraging it to do less?). However, if rewards diminish over time, in theory that would encourage the agent to keep taking actions to maximise rewards.



      I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of $R$).







      reinforcement-learning rewards






      share|improve this question














      share|improve this question











      share|improve this question




      share|improve this question










      asked Aug 11 at 7:35









      PyRsquaredPyRsquared

      1183 bronze badges




      1183 bronze badges























          1 Answer
          1






          active

          oldest

          votes


















          5
















          $begingroup$

          RL agents - implemented correctly - do not take previous rewards into account when making decisions. For instance value functions only assess potential future reward. The state value or expected return (aka utility) $G$ from a starting state $s$ may be defined like this:



          $$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$



          Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
          expected value given following the policy $pi$ for action selection.



          There are a few variations of this, depending on setting and which value function you are interested in. However, all value functions used in RL look at future sums of reward from the decision point when the action is taken. Past rewards are not taken into account.



          An agent may still select to take an early high reward over a longer term reward, if:



          • The choice between two rewards is exclusive


          • The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.


          If your problem is that an agent selects a low early reward when it could ignore it in favour of something larger later, then you should check the discount factor you are using. If you want a RL agent to take a long term view, then the discount factor needs to be close to $1.0$.



          The premise of your question however is that somehow a RL agent would become "lazy" or "complacent" because it already had enough reward. That is not an issue that occurs in RL due to the way that it is formulated. Not only are past rewards not accounted for when calculating return values from states, but there is also no formula in RL for an agent receiving "enough" total reward like a creature satisfying its hunger - the maximisation is applied always in all states.



          There is no need to somehow decay past rewards in any memory structure, and in fact no real way to do this, as there is no data structure that accumulates past rewards used by any RL agent. You may still collect this information for displaying results or analysing performance, but the agent doesn't ever use $r_t$ to figure out what $a_t$ should be.




          I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R




          You have probably formulated the reward function incorrectly for your problem in that case. A cumulative reward scheme (where an agent receives reward $a$ at $t=1$ then $a+b$ on $t=2$ then $a+b+c$ on $t=3$ etc) would be quite specialist and you have likely misunderstood how to represent the agent's goals. I suggest ask a separate question about your specific environment and your proposed reward scheme if you cannot resolve this.






          share|improve this answer












          $endgroup$










          • 1




            $begingroup$
            Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
            $endgroup$
            – DuttaA
            Aug 11 at 8:11











          • $begingroup$
            By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
            $endgroup$
            – DuttaA
            Aug 11 at 8:13







          • 1




            $begingroup$
            @DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
            $endgroup$
            – Neil Slater
            Aug 11 at 8:15










          • $begingroup$
            Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
            $endgroup$
            – PyRsquared
            Aug 11 at 9:04












          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "658"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          noCode: true, onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );














          draft saved

          draft discarded
















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f13893%2fshould-rl-rewards-diminish-over-time%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown


























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          5
















          $begingroup$

          RL agents - implemented correctly - do not take previous rewards into account when making decisions. For instance value functions only assess potential future reward. The state value or expected return (aka utility) $G$ from a starting state $s$ may be defined like this:



          $$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$



          Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
          expected value given following the policy $pi$ for action selection.



          There are a few variations of this, depending on setting and which value function you are interested in. However, all value functions used in RL look at future sums of reward from the decision point when the action is taken. Past rewards are not taken into account.



          An agent may still select to take an early high reward over a longer term reward, if:



          • The choice between two rewards is exclusive


          • The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.


          If your problem is that an agent selects a low early reward when it could ignore it in favour of something larger later, then you should check the discount factor you are using. If you want a RL agent to take a long term view, then the discount factor needs to be close to $1.0$.



          The premise of your question however is that somehow a RL agent would become "lazy" or "complacent" because it already had enough reward. That is not an issue that occurs in RL due to the way that it is formulated. Not only are past rewards not accounted for when calculating return values from states, but there is also no formula in RL for an agent receiving "enough" total reward like a creature satisfying its hunger - the maximisation is applied always in all states.



          There is no need to somehow decay past rewards in any memory structure, and in fact no real way to do this, as there is no data structure that accumulates past rewards used by any RL agent. You may still collect this information for displaying results or analysing performance, but the agent doesn't ever use $r_t$ to figure out what $a_t$ should be.




          I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R




          You have probably formulated the reward function incorrectly for your problem in that case. A cumulative reward scheme (where an agent receives reward $a$ at $t=1$ then $a+b$ on $t=2$ then $a+b+c$ on $t=3$ etc) would be quite specialist and you have likely misunderstood how to represent the agent's goals. I suggest ask a separate question about your specific environment and your proposed reward scheme if you cannot resolve this.






          share|improve this answer












          $endgroup$










          • 1




            $begingroup$
            Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
            $endgroup$
            – DuttaA
            Aug 11 at 8:11











          • $begingroup$
            By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
            $endgroup$
            – DuttaA
            Aug 11 at 8:13







          • 1




            $begingroup$
            @DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
            $endgroup$
            – Neil Slater
            Aug 11 at 8:15










          • $begingroup$
            Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
            $endgroup$
            – PyRsquared
            Aug 11 at 9:04















          5
















          $begingroup$

          RL agents - implemented correctly - do not take previous rewards into account when making decisions. For instance value functions only assess potential future reward. The state value or expected return (aka utility) $G$ from a starting state $s$ may be defined like this:



          $$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$



          Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
          expected value given following the policy $pi$ for action selection.



          There are a few variations of this, depending on setting and which value function you are interested in. However, all value functions used in RL look at future sums of reward from the decision point when the action is taken. Past rewards are not taken into account.



          An agent may still select to take an early high reward over a longer term reward, if:



          • The choice between two rewards is exclusive


          • The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.


          If your problem is that an agent selects a low early reward when it could ignore it in favour of something larger later, then you should check the discount factor you are using. If you want a RL agent to take a long term view, then the discount factor needs to be close to $1.0$.



          The premise of your question however is that somehow a RL agent would become "lazy" or "complacent" because it already had enough reward. That is not an issue that occurs in RL due to the way that it is formulated. Not only are past rewards not accounted for when calculating return values from states, but there is also no formula in RL for an agent receiving "enough" total reward like a creature satisfying its hunger - the maximisation is applied always in all states.



          There is no need to somehow decay past rewards in any memory structure, and in fact no real way to do this, as there is no data structure that accumulates past rewards used by any RL agent. You may still collect this information for displaying results or analysing performance, but the agent doesn't ever use $r_t$ to figure out what $a_t$ should be.




          I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R




          You have probably formulated the reward function incorrectly for your problem in that case. A cumulative reward scheme (where an agent receives reward $a$ at $t=1$ then $a+b$ on $t=2$ then $a+b+c$ on $t=3$ etc) would be quite specialist and you have likely misunderstood how to represent the agent's goals. I suggest ask a separate question about your specific environment and your proposed reward scheme if you cannot resolve this.






          share|improve this answer












          $endgroup$










          • 1




            $begingroup$
            Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
            $endgroup$
            – DuttaA
            Aug 11 at 8:11











          • $begingroup$
            By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
            $endgroup$
            – DuttaA
            Aug 11 at 8:13







          • 1




            $begingroup$
            @DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
            $endgroup$
            – Neil Slater
            Aug 11 at 8:15










          • $begingroup$
            Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
            $endgroup$
            – PyRsquared
            Aug 11 at 9:04













          5














          5










          5







          $begingroup$

          RL agents - implemented correctly - do not take previous rewards into account when making decisions. For instance value functions only assess potential future reward. The state value or expected return (aka utility) $G$ from a starting state $s$ may be defined like this:



          $$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$



          Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
          expected value given following the policy $pi$ for action selection.



          There are a few variations of this, depending on setting and which value function you are interested in. However, all value functions used in RL look at future sums of reward from the decision point when the action is taken. Past rewards are not taken into account.



          An agent may still select to take an early high reward over a longer term reward, if:



          • The choice between two rewards is exclusive


          • The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.


          If your problem is that an agent selects a low early reward when it could ignore it in favour of something larger later, then you should check the discount factor you are using. If you want a RL agent to take a long term view, then the discount factor needs to be close to $1.0$.



          The premise of your question however is that somehow a RL agent would become "lazy" or "complacent" because it already had enough reward. That is not an issue that occurs in RL due to the way that it is formulated. Not only are past rewards not accounted for when calculating return values from states, but there is also no formula in RL for an agent receiving "enough" total reward like a creature satisfying its hunger - the maximisation is applied always in all states.



          There is no need to somehow decay past rewards in any memory structure, and in fact no real way to do this, as there is no data structure that accumulates past rewards used by any RL agent. You may still collect this information for displaying results or analysing performance, but the agent doesn't ever use $r_t$ to figure out what $a_t$ should be.




          I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R




          You have probably formulated the reward function incorrectly for your problem in that case. A cumulative reward scheme (where an agent receives reward $a$ at $t=1$ then $a+b$ on $t=2$ then $a+b+c$ on $t=3$ etc) would be quite specialist and you have likely misunderstood how to represent the agent's goals. I suggest ask a separate question about your specific environment and your proposed reward scheme if you cannot resolve this.






          share|improve this answer












          $endgroup$



          RL agents - implemented correctly - do not take previous rewards into account when making decisions. For instance value functions only assess potential future reward. The state value or expected return (aka utility) $G$ from a starting state $s$ may be defined like this:



          $$v(s) = mathbbE_pi[G_t|S_t=s] = mathbbE_pi[sum_k=0^infty gamma^kR_t+k+1|S_t=s] $$



          Where $R_t$ is the reward distribution at time $t$, and $mathbbE_pi$ stands for
          expected value given following the policy $pi$ for action selection.



          There are a few variations of this, depending on setting and which value function you are interested in. However, all value functions used in RL look at future sums of reward from the decision point when the action is taken. Past rewards are not taken into account.



          An agent may still select to take an early high reward over a longer term reward, if:



          • The choice between two rewards is exclusive


          • The return is higher for the early reward. This may depend on the discounting factor, $gamma$, where low values make the agent prefer more immediate rewards.


          If your problem is that an agent selects a low early reward when it could ignore it in favour of something larger later, then you should check the discount factor you are using. If you want a RL agent to take a long term view, then the discount factor needs to be close to $1.0$.



          The premise of your question however is that somehow a RL agent would become "lazy" or "complacent" because it already had enough reward. That is not an issue that occurs in RL due to the way that it is formulated. Not only are past rewards not accounted for when calculating return values from states, but there is also no formula in RL for an agent receiving "enough" total reward like a creature satisfying its hunger - the maximisation is applied always in all states.



          There is no need to somehow decay past rewards in any memory structure, and in fact no real way to do this, as there is no data structure that accumulates past rewards used by any RL agent. You may still collect this information for displaying results or analysing performance, but the agent doesn't ever use $r_t$ to figure out what $a_t$ should be.




          I found that for certain applications and certain hyperparameters, if reward is cumulative, the agent simply takes a good action at the beginning of the episode, and then is happy to do nothing for the rest of the episode (because it still has a reward of R




          You have probably formulated the reward function incorrectly for your problem in that case. A cumulative reward scheme (where an agent receives reward $a$ at $t=1$ then $a+b$ on $t=2$ then $a+b+c$ on $t=3$ etc) would be quite specialist and you have likely misunderstood how to represent the agent's goals. I suggest ask a separate question about your specific environment and your proposed reward scheme if you cannot resolve this.







          share|improve this answer















          share|improve this answer




          share|improve this answer








          edited Aug 11 at 8:17

























          answered Aug 11 at 8:02









          Neil SlaterNeil Slater

          9,3821 gold badge7 silver badges22 bronze badges




          9,3821 gold badge7 silver badges22 bronze badges










          • 1




            $begingroup$
            Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
            $endgroup$
            – DuttaA
            Aug 11 at 8:11











          • $begingroup$
            By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
            $endgroup$
            – DuttaA
            Aug 11 at 8:13







          • 1




            $begingroup$
            @DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
            $endgroup$
            – Neil Slater
            Aug 11 at 8:15










          • $begingroup$
            Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
            $endgroup$
            – PyRsquared
            Aug 11 at 9:04












          • 1




            $begingroup$
            Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
            $endgroup$
            – DuttaA
            Aug 11 at 8:11











          • $begingroup$
            By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
            $endgroup$
            – DuttaA
            Aug 11 at 8:13







          • 1




            $begingroup$
            @DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
            $endgroup$
            – Neil Slater
            Aug 11 at 8:15










          • $begingroup$
            Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
            $endgroup$
            – PyRsquared
            Aug 11 at 9:04







          1




          1




          $begingroup$
          Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
          $endgroup$
          – DuttaA
          Aug 11 at 8:11





          $begingroup$
          Although the user tries to potray a forward view of the environment saying that the success of an agent depends upon the entire reward the agent sees...Like diamonds, collected and evaluated at the end.....But still isn't the logic of the OP flawed since in RL we are trying to find Maxima which is unbounded and so agent always tries to perform better whether or not there exists a pressuring function or not (as long as we include costs for things like wandering off I.e a step cost)?
          $endgroup$
          – DuttaA
          Aug 11 at 8:11













          $begingroup$
          By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
          $endgroup$
          – DuttaA
          Aug 11 at 8:13





          $begingroup$
          By this I mean even though you make the agent see a lesser reward it hardly matters because it is kind of relative...If you scale the maximum reward will be less, if you don't scale max reward will be more...The only goal of the agent is to reach max reward..(unless trade-off conditions like steps taken to reach terminal step vs step cost ) scenarios exist. Still it will maximize reward + cost for the agent, but we humans might not be able to see it since we are more interested in rewards.
          $endgroup$
          – DuttaA
          Aug 11 at 8:13





          1




          1




          $begingroup$
          @DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
          $endgroup$
          – Neil Slater
          Aug 11 at 8:15




          $begingroup$
          @DuttaA: Yes I think you are correct. I suspect the OP has misunderstood how to set up a reward function for their problem. I have added a short section at the end about that.
          $endgroup$
          – Neil Slater
          Aug 11 at 8:15












          $begingroup$
          Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
          $endgroup$
          – PyRsquared
          Aug 11 at 9:04




          $begingroup$
          Thanks for the explanation. I think I may have just been misunderstanding the reward function and how the agent interacts with it.
          $endgroup$
          – PyRsquared
          Aug 11 at 9:04


















          draft saved

          draft discarded















































          Thanks for contributing an answer to Artificial Intelligence Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f13893%2fshould-rl-rewards-diminish-over-time%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown









          Popular posts from this blog

          Tamil (spriik) Luke uk diar | Nawigatjuun

          Align equal signs while including text over equalitiesAMS align: left aligned text/math plus multicolumn alignmentMultiple alignmentsAligning equations in multiple placesNumbering and aligning an equation with multiple columnsHow to align one equation with another multline equationUsing \ in environments inside the begintabularxNumber equations and preserving alignment of equal signsHow can I align equations to the left and to the right?Double equation alignment problem within align enviromentAligned within align: Why are they right-aligned?

          Where does the image of a data connector as a sharp metal spike originate from?Where does the concept of infected people turning into zombies only after death originate from?Where does the motif of a reanimated human head originate?Where did the notion that Dragons could speak originate?Where does the archetypal image of the 'Grey' alien come from?Where did the suffix '-Man' originate?Where does the notion of being injured or killed by an illusion originate?Where did the term “sophont” originate?Where does the trope of magic spells being driven by advanced technology originate from?Where did the term “the living impaired” originate?