Bellman Operators are Contractions

November 7, 2022

Introduction

I’m studying Reinforcement Learning (RL) again, primarily using the books:

I’ve read book 1 once before, often referred to as the RL textbook or even the RL bible, but I had forgotten most of it and wanted to study the book in a slower and more careful manner, with the aim of finishing it with a deeper understanding than I did the first time. At the time of writing this I’m still in the early parts of the book. What prompted this post was to document what I learned while trying to answer the following questions that I struggled with for a while:

How do we know that the Bellman equations and Bellman optimality equations have unique solutions?
Why can’t iterative policy evaluation or policy/value iteration get stuck in local optima?

To answer these questions I had to improve my mathematical foundations, and I’m grateful to have found book 2 which has been a big help along with various other online resources. With this post I hope to preserve my knowledge and to help anyone with similar questions that happen to find it. If you have any questions or feedback, please feel free to reach out to me at c@cfml.se.

Contraction mappings and the Banach fixed-point theorem

From the Wikipedia article Banach fixed-point theorem:

Definition. Let $(X,d)$ be a complete metric space. Then a map $T:X \to X$ is called a contraction mapping on $X$ if there exists $q \in [0,1)$ such that

$$ d\big(T(x),T(y)\big) \leq qd(x,y) $$

for all $x,y \in X$.

Banach Fixed Point Theorem. Let $(X,d)$ be a non-empty complete metric space with a contraction mapping $T:X\to X$. Then $T$ admits a unique fixed-point $x^*$ in $X$ (i.e. $T(x^*) = x^*$). Furthermore, $x^*$ can be found as follows: start with an arbitrary element $x_0 \in X$ and define a sequence $(x_n)_{n \in \mathbb{N}}$ by $x_n = T(x_{n-1})$ for $n \geq 1$. Then $\lim_{n \to \infty} x_n = x^*$.

Thus, we need to prove that the right hand side of the Bellman equations and Bellman optimality equations are contractions to answer both questions stated in the introduction.

You can see a proof for the theorem in the Wikipedia article Banach fixed-point theorem. The rough idea is that a sequence $(x_n)_{n \in \mathbb{N}}$ by $x_n = T(x_{n-1})$ for $n \geq 1$ is shown to be a Cauchy sequence (a sequence whose elements become arbitrarily close to each other as the sequence progresses). Thus, the sequence converges to $x^* \in X$ which has to be a fixed point of $T$. It has to be the only fixed point of $T$ since any pair of distinct fixed points $p_1$ and $p_2$ would contradict $T$ being a contraction:

$$ d\big(T(p_1), T(p_2)\big) = d(p_1, p_2) > q d(p_1, p_2) $$

Bellman equations

Let’s consider any finite Markov decision process (MDP) where $\mathcal{S}$ denote the finite set of all states, $\mathcal{R} \subset \mathbb{R}$ the finite set of all rewards, and $\mathcal{A}(s)$ the finite set of all available actions $a$ in state $s$. Let $r_\pi(s)$ denote the expected immediate reward and $p_\pi(s^\prime | s)$ the probability of transitioning to state $s^\prime$ when following policy $\pi$ from state $s$. Let $\gamma \in (0, 1)$ denote the discount rate for future rewards. I’m assuming some knowledge about MDPs and won’t explain all of the notation here, feel free to check my older post Markov Decision Processes for an introduction.

The Bellman equation for the state-value function for policy $\pi$ can be defined as follows:

$$ \begin{aligned} v_{\pi}(s) &= \mathbb{E}_\pi \big[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s \big] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a | s) \bigg[\sum_{r \in \mathcal{R}} p(r | s, a) r + \gamma \sum_{s^\prime \in \mathcal{S}} p(s^\prime | s, a) v_{\pi}(s^\prime) \bigg] \\ &= r_\pi(s) + \gamma \sum_{s^\prime \in \mathcal{S}} p_\pi(s^\prime | s) v_{\pi}(s^\prime) \end{aligned} $$

for all $s \in \mathcal{S}$. Let $n = |\mathcal{S}|$, we can write the equation in matrix form:

$$ \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix}= \begin{bmatrix} r_\pi(1) \\ \vdots \\ r_\pi(n) \end{bmatrix} +\gamma \begin{bmatrix} p_\pi(1 | 1) & \dots & p_\pi(n | 1) \\ \vdots & \ddots & \vdots\\ p_\pi(1 | n) & \dots & p_\pi(n | n) \end{bmatrix} \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix} $$

More compactly:

$$ v_\pi = r_\pi + \gamma P_\pi v_\pi $$

We can define an (expected) Bellman operator $\mathcal{T}^\pi : \mathbb{R}^n \to \mathbb{R}^n$ as:

$$ \mathcal{T}^\pi(v) = r_\pi + \gamma P_\pi v $$

for any $v \in \mathbb{R}^n$.

Let $||\cdot||$ be a norm in $\mathbb{R}^n$. If there exists a $\gamma \in (0, 1)$ such that $||\mathcal{T}^\pi(v_1) - \mathcal{T}^\pi(v_2)|| \leq \gamma ||v_1 - v_2||$ for all $v_1, v_2 \in \mathbb{R}^n$, then $\mathcal{T}^\pi$ is a contraction mapping. Note that the definition of a contraction uses a distance function $d$ for generality (all norms can be used to create a distance function as in $d(x, y) = ||x - y||$, but not all distance functions have a corresponding norm). In all proofs in this post $|\cdot|$ and $\leq$ are elementwise, and the norm used is the max norm $||x||_\infty = \max(|x|) = \max(|x_1|, \dots, |x_n|)$, where $\max(\cdot) : \mathbb{R}^n \to \mathbb{R}$ chooses the largest element in a vector.

$$ \begin{aligned} ||\mathcal{T}^\pi(v_1) - \mathcal{T}^\pi(v_2)||_\infty &= \max \big(|r_\pi + \gamma P_\pi v_1 - (r_\pi + \gamma P_\pi v_2)| \big) \\ &= \gamma \max \big(|P_\pi(v_1 - v_2)| \big) \\ &\leq \gamma \max \big(P_\pi|v_1 - v_2| \big) \\ &\leq \gamma \max \big(|v_1 - v_2| \big) \\ &= \gamma ||v_1 - v_2||_\infty \end{aligned} $$

Thus $\mathcal{T}^\pi$ is a contraction. The last inequality is due to the rows of $P_\pi$ containing only non-negative elements that sum to 1.

Let $r(s, a)$ denote the expected immediate reward when selecting action $a$ in state $s$, and $p_\pi(s^\prime, a^\prime | s, a)$ the probability of transitioning to state $s^\prime$ and selecting action $a^\prime$ when selecting action $a$ in state $s$ and following policy $\pi$ after. The Bellman equation for the action-value function for policy $\pi$ can be defined as follows:

$$ \begin{aligned} q_{\pi}(s, a) &= \mathbb{E}_\pi \big[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a \big] \\ &= \sum_{r \in \mathcal{R}} p(r | s, a) r + \gamma \sum_{s^\prime \in \mathcal{S}} p(s^\prime | s, a) \sum_{a^\prime \in \mathcal{A}(s^\prime)} \pi(a^\prime | s^\prime) q_{\pi}(s^\prime, a^\prime) \\ &= r(s, a) + \gamma \sum_{s^\prime \in \mathcal{S}} \sum_{a^\prime \in \mathcal{A}(s^\prime)} p_\pi(s^\prime, a^\prime | s, a) q_{\pi}(s^\prime, a^\prime) \end{aligned} $$

for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$. Let $n = |\mathcal{S}|$ and $n_s = |\mathcal{A}(s)|$, we can write the equation in matrix form:

$$ \begin{bmatrix} q_\pi(1, 1) \\ q_\pi(1, 2) \\ \vdots \\ q_\pi(n, n_n) \end{bmatrix}= \begin{bmatrix} r_\pi(1, 1) \\ r_\pi(1, 2) \\ \vdots \\ r_\pi(n, n_n) \end{bmatrix} +\gamma \begin{bmatrix} p_\pi(1, 1 | 1, 1) & p_\pi(1, 2 | 1, 1) & \dots & p_\pi(n, n_n | 1, 1) \\ p_\pi(1, 1 | 1, 2) & p_\pi(1, 2 | 1, 2) & \dots & p_\pi(n, n_n | 1, 2) \\ \vdots & \vdots & \ddots & \vdots \\ p_\pi(1, 1 | n, n_n) & p_\pi(1, 2 | n, n_n) & \dots & p_\pi(n, n_n | n, n_n) \end{bmatrix} \begin{bmatrix} q_\pi(1, 1) \\ q_\pi(1, 2) \\ \vdots \\ q_\pi(n, n_n) \end{bmatrix} $$

More compactly:

$$ q_\pi = r_\pi + \gamma P_\pi q_\pi $$

The only difference compared to the equation for the state-value function is that the vectors and matrices are larger. Thus, we can define an expected Bellman operator in identical fashion (with $n$ denoting the number of state-action pairs rather than the number of states) and the proof will be identical (the rows of $P_\pi$ still contain only non-negative elements that sum to 1).

Bellman optimality equations

For brevity I will go directly into the compact matrix form of the Bellman optimality equation for the optimal state-value function:

$$ v_* = \max_\pi(r_\pi + \gamma P_\pi v_*) $$

We can define an (expected) Bellman optimality operator $\mathcal{T}^* : \mathbb{R}^n \to \mathbb{R}^n$ as:

$$ \mathcal{T}^*(v) = \max_\pi(r_\pi + \gamma P_\pi v) $$

Consider any two vectors $v_1, v_2 \in \mathbb{R}^n$, and let $\pi_1^* = \text{argmax}_\pi(r_\pi + \gamma P_\pi v_1)$ and $\pi_2^* = \text{argmax}_\pi(r_\pi + \gamma P_\pi v_2)$. Then we have:

$$ \mathcal{T}^*(v_1) = \max_\pi(r_\pi + \gamma P_\pi v_1) = r_{\pi_1^*} + \gamma P_{\pi_1^*} v_1 \geq r_{\pi_2^*} + \gamma P_{\pi_2^*} v_1 \\ \mathcal{T}^*(v_2) = \max_\pi(r_\pi + \gamma P_\pi v_2) = r_{\pi_2^*} + \gamma P_{\pi_2^*} v_2 \geq r_{\pi_1^*} + \gamma P_{\pi_1^*} v_2 $$

and

$$ \begin{aligned} \mathcal{T}^*(v_1) - \mathcal{T}^*(v_2) &= r_{\pi_1^*} + \gamma P_{\pi_1^*} v_1 - (r_{\pi_2^*} + \gamma P_{\pi_2^*} v_2) \\ &\leq r_{\pi_1^*} + \gamma P_{\pi_1^*} v_1 - (r_{\pi_1^*} + \gamma P_{\pi_1^*} v_2) \\ &= \gamma P_{\pi_1^*} (v_1 - v_2) \end{aligned} $$

Similarly we have $\mathcal{T}^*(v_2) - \mathcal{T}^*(v_1) \leq \gamma P_{\pi_2^*} (v_2 - v_1)$, which implies that $\mathcal{T}^*(v_1) - \mathcal{T}^*(v_2) \geq \gamma P_{\pi_2^*} (v_1 - v_2)$, and thus:

$$ \gamma P_{\pi_2^*} (v_1 - v_2) \leq \mathcal{T}^*(v_1) - \mathcal{T}^*(v_2) \leq \gamma P_{\pi_1^*} (v_1 - v_2) $$

Let $\max\{\cdot, \cdot \} : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}^n$ choose the largest values between two vectors elementwise, we have:

$$ \begin{aligned} |\mathcal{T}^*(v_1) - \mathcal{T}^*(v_2)| &\leq \max \{|\gamma P_{\pi_2^*} (v_1 - v_2)|, |\gamma P_{\pi_1^*} (v_1 - v_2)| \} \\ &\leq \gamma \max \{P_{\pi_2^*} |v_1 - v_2|, P_{\pi_1^*} |v_1 - v_2| \} \end{aligned} $$

Let $\max(\cdot) : \mathbb{R}^n \to \mathbb{R}$ choose the largest element in a vector (as previously defined), we have:

$$ \begin{aligned} ||\mathcal{T}^*(v_1) - \mathcal{T}^*(v_2)||_\infty &= \max \big(|\mathcal{T}^*(v_1) - \mathcal{T}^*(v_2)| \big) \\ &\leq \gamma \max \big(\max \{P_{\pi_2^*} |v_1 - v_2|, P_{\pi_1^*} |v_1 - v_2| \} \big) \\ &\leq \gamma \max \big(|v_1 - v_2| \big) \\ &= \gamma ||v_1 - v_2||_\infty \end{aligned} $$

Thus $\mathcal{T}^*$ is a contraction. Note that the last inequality is due to the rows of $P_{\pi_1^*}$ and $P_{\pi_2^*}$ containing only non-negative elements that sum to 1. As for Bellman equations, an almost identical proof can be written for action-values with the only difference being that we consider state-action pairs rather than states.

Conclusion

We’re now ready to answer the questions from the introduction.

How do we know that the Bellman equations and Bellman optimality equations have unique solutions?

We can define the right hand side of the equations as operators and show that they are contractions, which means that they have unique fixed-points that can be solved for by starting with an arbitrary point and iteratively applying the operator, due to the Banach fixed-point theorem. Any solution to one of the equations must be a fixed-point for its corresponding operator, thus these fixed-points are the unique solutions to the equations.

Why can’t iterative policy evaluation or policy/value iteration get stuck in local optima?

Iterative policy evaluation and value iteration are simply iteratively applying the Bellman operator or Bellman optimality operator respectively. As explained above we know that the operators have unique fixed-points that can be solved for in this manner. In other words, there are no local optima. However, note that convergence is only guaranteed in the limit and one might have to settle for approximations, depending on the complexity of the MDP. Policy iteration is not as straight forward, but it’s closely connected to value iteration and I find it quite intuitive that once we know that value iteration will converge to the optimal state-value/action-value function, then policy iteration will also do so. From Reinforcement Learning: An Introduction by Sutton and Barto (2nd edition):

Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement. Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement sweep. In general, the entire class of truncated policy iteration algorithms can be thought of as sequences of sweeps, some of which use policy evaluation updates and some of which use value iteration updates. Because the max operation in (4.10) is the only difference between these updates this just means that the max operation is added to some sweeps of policy evaluation. All of these algorithms converge to an optimal policy for discounted finite MDPs.

Truncated policy iteration is the general method with value iteration at one end of the spectrum where only one iteration of policy evaluation is done before updating the policy greedily, and policy iteration at the other end where policy evaluation runs until convergence before updating the policy greedily. We know that value iteration only has one fixed-point and only have to consider what happens when we add more iterations of policy evaluation before doing one step of value iteration (applying the Bellman optimality operator). One way to think about it, is that we will simply update the policy using more accurate state-values or action-values, and it’s hard to see how this would interfere with convergence.

Another way to think about it, is that each step will either improve or be as good as the previous policy due to the policy improvement theorem (see chapters 3 and 4 in Reinforcement Learning: An Introduction by Sutton and Barto (2nd edition)). And since each iteration essentially ends with a step of value iteration, we know that the algorithm will only get stuck in one fixed-point - the same as in value iteration.

Beyond intuitive explanations, there are of course formal proofs. From chapter 4 in Mathematical Foundations of Reinforcement Learning by Zhao (the proof is in the book):

The idea of the proof is to show that policy iteration converge faster than value iteration. Since value iteration has been proven to be convergent, the convergence of policy iteration immediately follows.

Note that faster here does not mean faster compute time or fewer total number of iterations, since the policy evaluation step potentially requires an infinite number of iterations, but instead that policy iteration requires equally many or fewer policy updates before convergence.

That’s it for this post. I hope you enjoyed the read and please feel free to reach out to me at c@cfml.se with any questions or feedback!