Dynamic Programming


Dynamic programming (DP) refers to a collection of methods in both mathematical optimization and computer programming. What all methods have in common is that they simplify a complicated problem by breaking it down into simpler sub-problems in a recursive manner. The focus of this article is going to be the role of DP in mathematical optimization and for solving finite Markov Decision Processes (MDPs) in particular. Later in the article there is going to be a segment about DP in computer programming.

Bellman equations are the key components of DP in mathematical optimization. To read more about them and MDPs in general, I refer you to my previous article Markov Decision Processes. As a reminder, the Bellman equations for the state-value function $v_{\pi}$ and action-value function $q_{\pi}$ are:

$$ v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_{\pi}(s^\prime) \big) $$
$$ q_{\pi}(s,a) = \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma \sum_{a^\prime} \pi(a^\prime|s^\prime) q(a^\prime,s^\prime) \big) $$

The Bellman optimality equations are the Bellman equations for the optimal state-value function $v_$ and optimal action-value function $q_$:

$$ v_*(s) = \max_a \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_*(s^\prime) \big) $$
$$ q_*(s,a) = \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma \max_{a^\prime} q_*(s^\prime,a^\prime) \big) $$

The equations are going to be used as update rules in iterative algorithms. Remember that it’s trivial to find an optimal policy when you know $v_$ or $q_$.

Note that DP algorithms have limited utility in reinforcement learning (RL) since they assume perfect knowledge of the dynamics of the MDP and are very computationally expensive. However DP is still useful for many other problems and provides a good foundation for understanding many RL-algorithms.

The aim of this article is to be a short introduction to DP with a focus on solving MDPs. To read more about this subject, I recommend Reinforcement Learning: An Introduction (Sutton, Barto, 2018). The source code can be found at https://github.com/CarlFredriksson/dynamic_programming.

Policy Evaluation

Policy evaluation is about computing estimates of value functions, for example the state-value function $v_{\pi}$ for a given policy $\pi$. To compute the exact solution, one could theoretically solve the system of linear Bellman equations. However, the computational complexity grows quickly with the size of the MDP and iterative methods are often more efficient. We can use the Bellman equation for $v_{\pi}$ as an update rule

$$ v_{k+1}(s) = \sum_a \pi(a|s) \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_{k}(s^\prime) \big) $$

Applied iteratively, this update rule will make $v_{k+1}$ become closer to $v_{\pi}$ and eventually converge as $k \to \infty$. We of course don’t have infinite time and instead will have to accept approximations and stop when the updates become satisfactorily small. The updates can be done by remembering all the old values $v_{k}$ or by sweeping over the state space, updating values in place. If sweeping, the updates will use a mix of new and old values. This is not strictly the update rule as defined above, but $v_{k+1}$ will still converge to $v_{\pi}$. Sweeping uses less memory and usually converges faster so it’s normally the version that is used in DP.

The idea of updating estimates using other estimates is called bootstrapping and is used extensively in RL.

Policy Improvement

To solve an MDP, our goal is to find an optimal policy. The next step after approximating $v_{\pi}$ using policy evaluation is to improve the current policy $\pi$. To create a new improved policy $\pi^\prime$, we go over each state and select actions such that expected return is maximized assuming $\pi$ being followed afterwards. In other words, we select actions greedily with respect to $q_\pi$:

$$ \begin{aligned} \pi^\prime(s) &= \underset{a}{\operatorname{argmax}} q_\pi(s,a) \\ &= \underset{a}{\operatorname{argmax}} E[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s, A_t=a] \\ &= \underset{a}{\operatorname{argmax}} \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_\pi(s^\prime) \big) \end{aligned} $$

When selecting actions this way we know that the new policy $\pi^\prime$ is going to be as good as, or better than the old policy $\pi$. If $\pi^\prime$ is as good as, but not better than $\pi$ it means that $\pi$ and $\pi^\prime$ are both optimal. Like in policy evaluation, policy improvement is usually implemented as a sweep.

Policy Iteration

Now that we have tools for evaluating and improving policies we simply have to put them together. Policy iteration refers to alternating policy evaluation and policy improvement for multiple iterations until a satisfactory policy is found. Unless an optimal policy has been found, each iteration will produce a strictly better policy. Policy iteration will converge to an optimal policy in finite time since finite MDPs have a finite number of possible deterministic policies. To speed up convergence, policy evaluation will start with the value estimates of the previous policy.

Value Iteration

There is a drawback to policy iteration in that the policy evaluation step can be computationally expensive. When doing policy evaluation iteratively, $v_{\pi}$ converges only in the limit and we will have to accept approximations. But the ultimate goal is not to compute $v_{\pi}$ as accurately as possible each policy iteration, it is to find an optimal policy. How close to the real state-value function do we actually need to get? For the gridworld in example 4.1 from (Sutton, Barto, 2018), only three iterations of policy evaluation on the random policy was needed before the result of the corresponding policy improvement wouldn’t change. This can be seen in figure 4.1 from (Sutton, Barto, 2018).


This was only an example, but it holds in general that the policy evaluation step can be truncated without removing the convergence guarantees of policy iteration. An important special case is when policy evaluation is stopped after a single iteration. This is called value iteration. Instead of having two steps it can be combined into a single update rule

$$ v_{k+1}(s) = \max_a \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_k(s^\prime) \big) $$

This is almost the same update rule as policy evaluation, but using the Bellman optimality equation for $v_*$ instead. Value iteration is usually implemented as a sweep, with every sweep effectively being a combination of one policy evaluation sweep and one policy iteration sweep. Usually doing a few policy evaluation sweeps for every policy improvement sweep will lead to faster convergence. All policy iteration algorithms with a truncated policy evaluation step can be implemented as a sequence of sweeps, with some sweeps using policy evaluation updates and some using value iteration updates.

Generalized Policy Iteration

Policy iteration consists of policy evaluation and policy improvement steps that are alternated. As mentioned, the number of iterations of policy evaluation before doing policy improvement can be varied. There are other methods of finding optimal, or approximately optimal, policies that similarly alternate policy evaluation and improvement but vary other aspects. One such aspect is the granularity of the updates, which means that some states could get multiple updates before others gets a single one. This is often the case for RL-methods where the agent doesn’t have complete knowledge of the dynamics function $p$ and instead rely on exploration and sample returns.

The term generalized policy iteration (GPI) refers the general idea of alternating policy evaluation and policy improvement. It is an important idea since almost all RL methods are a form of GPI.

Gridworld Example


We will use example 4.1 from (Sutton, Barto, 2018). States $1,2,…,14$ are non-terminal and state $15$ is terminal. State $15$ is in both the top-left and the bottom-right. Actions “up”, “down”, “right”, and “left” are available in every state. Actions deterministically move the agent in their respective direction except when the action would move the agent out of the gridworld. In that case the state remains unchanged. Actions in all non-terminal states give the agent a reward of -1. It is an undiscounted episodic task.

Since the dynamics are completely deterministic we can implement them as a lookup table $(s,a) \to (s^\prime,r)$. The actions $(0,1,2,3)$ correspond to “up”, “down”, “right”, and “left” respectively. The state numbers have all been reduced by one to make them start at zero which makes the policy evaluation and value iteration functions cleaner.

    (0, 0): (0, -1),
    (0, 1): (4, -1),
    (0, 2): (1, -1),
    (0, 3): (14, -1),
    (1, 0): (1, -1),
    (1, 1): (5, -1),
    (1, 2): (2, -1),
    (1, 3): (0, -1),
    (14, 0): (14, 0),
    (14, 1): (14, 0),
    (14, 2): (14, 0),
    (14, 3): (14, 0),

A policy evaluation sweep can be implemented as follows:

def policy_evaluation_sweep(dynamics, policy, state_values):
    num_states = len(state_values)
    for state in range(num_states):
        value = 0
        num_actions = len(policy[state])
        for action in range(num_actions):
            new_state, reward = dynamics[(state, action)]
            value += policy[state, action] * (reward + state_values[new_state])
        state_values[state] = value

I also implemented a non-sweeping version for comparison with figure 4.1 from (Sutton, Barto, 2018), since they used a non-sweeping version when producing the results in that figure.

def policy_evaluation(dynamics, policy, state_values):
    num_states = len(state_values)
    new_state_values = np.zeros(num_states)
    for state in range(num_states):
        value = 0
        num_actions = len(policy[state])
        for action in range(num_actions):
            new_state, reward = dynamics[(state, action)]
            value += policy[state, action] * (reward + state_values[new_state])
        new_state_values[state] = value
    return new_state_values

Policy improvement can be implemented as follows:

def value_iteration_sweep(dynamics, policy, state_values):
    num_states = len(state_values)
    for state in range(num_states):
        max_value = -np.inf
        num_actions = len(policy[state])
        for action in range(num_actions):
            new_state, reward = dynamics[(state, action)]
            value = reward + state_values[new_state]
            if value > max_value:
                max_value = value
                policy[state] = np.eye(num_actions)[action]

The policy was initialized to the random policy. When running some iterations of non-sweeping policy evaluation the results matched those in figure 4.1. Policy improvement applied to the state-values after at least three iterations of non-sweeping policy evaluation resulted in an optimal policy. The sweeping policy evaluation needed a few more iterations before policy improvement would find an optimal policy.

DP in Computer Programming

As mentioned in the introduction, DP refers to a collection of methods in both mathematical optimization and computer programming. For DP to be applicable to a programming problem, the problem needs to have optimal substructure and overlapping sub-problems.

Optimal substructure means that the solution to the problem can be constructed from optimal solutions to its sub-problems. Optimal substructures are usually described by recursion. An example of this is shortest path problems. If there is a path $A \to B \to C \to D$ that is the shortest path from $A$ to $D$, then $B \to C \to D$ is the shortest path from $B$ to $D$. In other words, the shortest path problem from $B$ to $D$ is nested inside the $A$ to $D$ problem and can be called a sub-problem.

Overlapping sub-problems means that a simple recursive algorithm will solve the same sub-problems over and over. If we can solve a problem by combining optimal solutions to non-overlapping sub-problems, DP is not applicable to that problem and the strategy divide and conquer might be used instead. Examples of divide and conquer algorithms are merge sort and quick sort.

DP has two approaches for avoiding having to solve the same sub-problems over and over. These are the top-down approach and the bottom-up approach. The top-down approach is to do recursion in the order the problem is formulated, but also store the result of solved sub-problems so that they don’t need to be computed more than once. Storing solutions to sub-problems is called memoization. The bottom-up approach is to solve sub-problems from the bottom first and using those solutions to build solutions to bigger sub-problems.

Fibonacci Sequence Example

Computing the nth member of the Fibonacci sequence is a good example of a problem that has both optimal substructure and overlapping sub-problems. The Fibonacci sequence $F_n$ is defined as:

$$ \begin{aligned} F_0 &= 0 \\ F_1 &= 1 \\ F_n &= F_{n-1} + F_{n-2} \end{aligned} $$

The following is a naive implementation:

def fib(n):
    if n <= 1:
        return n
    return fib(n - 1) + fib(n - 2)

The naive implementation is computationally very inefficient since it computes the solutions to the same sub-problems over and over. For example, if we call fib(5) we produce the following call tree:

  1. fib(5)
  2. fib(4) + fib(3)
  3. (fib(3) + fib(2)) + (fib(2) + fib(1))
  4. ((fib(2) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))
  5. (((fib(1) + fib(0)) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))

You can see that fib(2) was computed from scratch three times. This implementation runs in exponential time (O(2^n)).

A DP implementation with the top-down approach:

memory = { 0: 0, 1:1 }
def fib_memoization(n):
    if n not in memory:
        memory[n] = fib_memoization(n - 1) + fib_memoization(n - 2)
    return memory[n]

This version uses memoization and computes solutions to sub-problems only if they aren’t in the memory yet. This implementation runs in linear time (O(n)) but requires O(n) storage space.

A DP implementation with the bottom-up approach:

def fib_bottom_up(n):
    if n == 0:
        return 0

    previous_fib = 0
    current_fib = 1
    for _ in range(n - 1):
        new_fib = previous_fib + current_fib
        previous_fib = current_fib
        current_fib = new_fib
    return current_fib

This version computes sub-problems from the bottom and builds upwards. The smaller values are computed first and are used to compute larger values. This implementation also runs in linear time but only requires constant (O(1)) storage space.

Both DP implementations actually take O(n^2) time for large integers, since addition with large integers is O(n).


This has been a short introduction to dynamic programming. Splitting up a complex problem (for example computing an optimal policy in an MDP) into simpler, interdependent sub-problems (for example computing state-values) is a powerful idea that can be applied to a variety of problems. Understanding DP and the idea of GPI provides a good base for understanding many reinforcement learning algorithms.

Thank you for reading and feel free to send me any questions.