The Bellman Equation

I’ve been taking classes in AI and Machine Learning and I’ve already bumped into this one on a few separate occasions. In my experience I usually see it in the context of Markov Decision Processes. Here it is:

 V^\pi(s) = R(s,\pi(s)) + \gamma \sum_{s'} P(s'|s,\pi(s)) V^\pi(s')

This guy is pretty nasty looking, but he’s actually not so bad once you get to know him. First let me list out what everything in there is supposed to represent…

  •  V^\pi(s) : Think of this as value of being in a given state of the world (s) given that you tend to behave in a certain way (\pi).
  • \pi: In machine learning we call this the policy, but that’s just another of saying an agent’s behavior. It maps the current state of the world to an action \pi(s)=a. Think of it this way: Let’s say my state s=”hungry.” This state tends to be mapped to the action a=”go eat a sandwich.”
  • R(s,\pi(s)): Now we’re looking at the right-hand side of the equation. This is what’s called the reward function. It’s just a number in arbitrary units that tells us how good (or bad) it feels to be in state s, and performing action \pi(s)=a.
    • Note: If we stopped here the equation above would look kind of stupid, right? The value of an action at a given state is equal to the reward from taking an action at a given state? Seems kind of redundant. Luckily it has more to say!
  • \gamma: The discount rate where 0 \leq\gamma<1. More on this later.
  • P(s'|s,\pi(s)): Think of s' as any state of the world after performing your action \pi(s)=a. Say you finish your sandwich, then your new state could be “I’m not hungry” or “I’m now dating Selena Gomez.” P(s'|s,\pi(s)) represents the probability of transition to that new state, given your current state and action.

With the meaning of these variables defined we can think of the 2nd term in the equation as an expected value. Specifically, the expected value of our future returns given our present behavior:

 V^\pi(s) = R(s,\pi(s)) + \gamma \mathbb{E}[V^\pi(S)|s,\pi]

Even though you won’t typically see the Bellman Equation unless you take some very specialized coursework in machine learning, I couldn’t help but feel sense of familiarity with this one the first time I saw it. Then it hit me: In my economics classes! Every business student is familiar with discounted cash flow. In this case we’re not trying to calculate the net present value of an investment, we’re trying to calculate the net present value of an agent’s behavior.

The discount rate \gamma tells us to what degree we should value the outcome of our future behavior. A \gamma of close to 0 and an agent will become hedonistic, concerned only with the current state of the world. A \gamma close to 1 and the agent is willing to forgo present reward for future gain, but perhaps to a fault–who cares if I have a guaranteed way of becoming a billionaire 200 years from now? The value of \gamma is up to us, but it’s a matter of striking the right balance between the present and the future.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s