Applying reinforcement learning theory to reduce felt temporal distance

Computer science for mind hackers: it is a basic principle of reinforcement learning to distinguish between reward and value, where the reward of a state is the immediate, intrinsic desirability of the state, whereas the value of the state is proportional to the rewards of the other states that you can reach from that state.

For example, suppose that I’m playing a competitive game of chess, and in addition to winning I happen to like capturing my opponent’s pieces, even when it doesn’t contribute to winning. I assign a reward of 10 points to winning, -10 to losing, 0 to a stalemate, and 1 point to each piece that I capture in the game. Now my opponent offers me a chance to capture one of his pawns, an action that would give me one point worth of reward. But when I look at the situation more closely, I see that it’s a trap: if I did capture the piece, I would be forced into a set of moves that would inevitably result in my defeat. So the value, or long-term reward, of that state is actually something close to -9.

Once I realize this, I also realize that making that move is almost exactly equivalent to agreeing to resign in exchange for my opponent letting me capture one of his pieces. My defeat won’t be instant, but by making that move, I would nonetheless be choosing to lose.

Now consider a dilemma that I might be faced with when coming home late some evening. I have no food at home, but I’m feeling exhausted and don’t want to bother with going to the store, and I’ve already eaten today anyway. But I also know that if I wake up with no food in the house, then I will quickly end up with low energy, which makes it harder to go to the store, which means my energy levels will drop further, and so on until I’ll finally get something to eat much later, after wasting a long time in an uncomfortable state.

Typically, temporal discounting means that I’m aware of this in the evening, but nonetheless skip the visit to the store. The penalty from not going feels remote, whereas the discomfort of going feels close, and that ends up dominating my decision-making. Besides, I can always hope that the next morning will be an exception, and I’ll actually get myself to go to the store right from the moment when I wake up!

And I haven’t tried this out for very long, but it feels like explicitly framing the different actions in terms of reward and value could be useful in reducing the impact of that experienced distance. I skip the visit to the store because being hungry in the morning is something that seems remote. But if I think that skipping the visit is exactly the same thing as choosing to be hungry in the morning, and that the value of skipping the visit is not the momentary relief of being home earlier but rather the inevitable consequence of the causal chain that it sets in motion – culminating in hours of hunger and low energy – then that feels a lot different.

And of course, I can propagate the consequences earlier back in time as well: if I think that I simply won’t have the energy to get food when I finally come home, then I should realize that I need to go buy the food before setting out on that trip. Otherwise I’ll again set in motion a causal chain whose end result is being hungry. So then not going shopping before I leave becomes exactly the same thing as being hungry next morning.

More examples of the same:

  • Slightly earlier I considered taking a shower, and realized that if I’d take a shower in my current state of mind I’d inevitably make it into a bath as well. So I wasn’t really just considering whether to take a shower, but whether to take a shower *and* a bath. That said, I wasn’t in a hurry anywhere and there didn’t seem to be a big harm in also taking the bath, so I decided to go ahead with it.
  • While in the shower/bath, I started thinking about this post, and decided that I wanted to get it written. But I also wanted to enjoy my hot bath for a while longer. Considering it, I realized that staying in the bath for too long might cause me to lose my motivation for writing this, so there was a chance that staying in the bath would become the same thing as choosing not to get this written. I decided that the risk wasn’t worth it, and got up.
  • If I’m going somewhere and I choose a route that causes me to walk past a fast-food place selling something that I know I shouldn’t eat, and I know that the sight of that fast-food place is very likely to tempt me to eat there anyway, then choosing that particular route is the same thing as choosing to go eat something that I know I shouldn’t.

Cross-posted to Less Wrong. Related post: Applied cognitive science – learning from a faux pas.

No comments

Trackbacks/Pingbacks

  1. Don’t trust people, trust their components | Kaj Sotala - […] help themselves” – though we might also hold that knowing their own poor self-control, they shouldn’t have gotten themselves…

Leave a Reply