Monte Carlo Methods



Monte Carlo (MC) methods are a broad class of algorithms that uses repeated random sampling to compute numerical results. The aim of this article is to be a short introduction to MC methods in reinforcement learning (RL). To read more about this subject, I recommend Reinforcement Learning: An Introduction (Sutton, Barto, 2018).

MC methods have three main advantages over Dynamic Programming (DP):

Like DP, MC methods are a form of generalized policy iteration (GPI), which means that they alternate policy evaluation (estimation of value functions) and policy improvement (using value estimates to improve a policy). Unlike DP, MC methods do not use bootstrapping. In other words, value estimates are not computed using other value estimates.

We will be primarily interested in estimating the action-value function $q_\pi(s,a)$ since we want to be able to pick the actions that maximize expected return. We denote the estimates by $Q_\pi(s,a)$ and compute them by averaging sample returns in each state-action pair similarily to contextual Multi-armed Bandits. The difference is that the underlying model is an MDP, which means that the states are interrelated.

This article is going to focus on MC methods for episodic tasks. That is, tasks where the agent eventually reach a terminal state. After each episode is finished, the action-value estimates and the policy is updated. The action-value estimate of a state-action pair is updated such that it’s the return following the first visit, or every visit to the pair, averaged over all episodes. A visit to a state action pair $(s,a)$ means that the agent was in state $s$ and took the action $a$. Both the first-visit method and the every-visit method converge to the actual action values as the number of visits to each state-action pair approaches infinity. Like DP, we don’t need perfect estimates and can truncate the value estimation before the policy is improved. Usually we will update the policy for a state immediately after an action-value has been updated in that state.

Later in the article there is going to be an implemented example of a Monte Carlo method that solves a simple version of blackjack. The source code can be found at

On-policy Methods

The objective of all RL methods is to find policies that maximize expected return. To achieve this, agents have to balance exploitation of current knowledge (selecting actions that maximizes $Q_\pi$) with exploration of other actions that have lower action-value estimates in order to learn better estimates. This is called the exploration versus exploitation tradeoff.

There are two approaches for ensuring exploration, on-policy and off-policy methods. On-policy methods use a single policy $\pi$ that is used both for selecting actions and for evaluation and improvement. Off-policy methods have two policies, one that is evaluated and improved and one that is used for action selection. In this section we are going to look at on-policy methods.

Exploring Starts

To be able to estimate the action-value of a state-action pair, the pair needs to be visited. On way to make sure that the agent visits all state-action pairs is to do exploring starts. This is done by starting the episodes in a random state and selecting a random starting action such that every state-action pair has a probability greater than zero of being selected. Exploring starts is useful for some problems but cannot be relied upon in general, particularly when learning from interaction with an actual environment.

Pseudocode for MC with exploring starts:

On-policy MC with exploring starts
$\pi(s) \in \mathcal{A}(s)$ arbitrarily for all $s \in \mathcal{S}$
$Q(s,a) \in \mathbb{R}$ arbitrarily for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
$N(s,a) \leftarrow 0$ for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
Loop for multiple episodes:
Select $(S_0,A_0)$ randomly such that all pairs have probability > 0
Generate an episode from $(S_0,A_0)$, following $\pi: S_0,A_0,R_1,\ldots,S_{T-1},A_{T-1},R_T$
$G \leftarrow 0$
Loop for each step of the episode, $t=T-1,T-2,\ldots,0$:
$G \leftarrow \gamma G + R_{t+1}$
If $(S_t,A_t) \notin \{(S_0,A_0),(S_1,A_1),\ldots,(S_{t-1},A_{t-1})\}$:
$N(S_t,A_t) \leftarrow N(S_t,A_t) + 1$
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)} [G - Q(S_t,A_t)]$
$\pi(S_t) \leftarrow argmax_a Q(S_t,a)$ (ties broken arbitrarily)

If you want discounting, set $0 < \gamma < 1$. If you don’t want discounting, which is often the case for episodic tasks, set $\gamma = 1$. The algorithm above uses the first-visit method. If you want to use the every-visit method instead, simply remove the if-statement that checks if a state-action pair has been visited in an earlier time step.

Soft Policies

A common way to make sure that the agent explores is to use a soft policy. Soft policies satisfy $\pi(a|s) > 0$ for all $s \in \mathcal{S}$ and all $a \in \mathcal{A}(s)$. The probabilities are often shifted closer and closer to a deterministic policy.

$\epsilon$-greedy policies are soft policies that select the action that maximizes $Q_\pi$ with probability $1 - \epsilon$, and a random action with probability $\epsilon$.

Pseudocode for on-policy MC with $\epsilon$-greedy policy:

On-policy MC with $\epsilon$-greedy policy
$\pi(a|s)$ to an arbitrary soft policy
$Q(s,a) \in \mathbb{R}$ arbitrarily for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
$N(s,a) \leftarrow 0$ for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
Loop for multiple episodes:
Generate an episode following $\pi: S_0,A_0,R_1,\ldots,S_{T-1},A_{T-1},R_T$
$G \leftarrow 0$
Loop for each step of the episode, $t=T-1,T-2,\ldots,0$:
$G \leftarrow \gamma G + R_{t+1}$
If $(S_t,A_t) \notin \{(S_0,A_0),(S_1,A_1),\ldots,(S_{t-1},A_{t-1})\}$:
$N(S_t,A_t) \leftarrow N(S_t,A_t) + 1$
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)} [G - Q(S_t,A_t)]$
$A^* \leftarrow argmax_a Q(S_t,a)$ (ties broken arbitrarily)
$\pi(A^*|S_t) \leftarrow 1 - \epsilon + \epsilon / |\mathcal{A}(S_t)|$
For all $a \neq A^* \in \mathcal{A}(S_t)$:
$\pi(a|S_t) \leftarrow \epsilon / |\mathcal{A}(S_t)|$

Off-policy Methods

Off-policy methods have two policies, the target policy $\pi$ and the behavior policy $b$. The policy that is evaluated and improved is $\pi$ while $b$ is used for action selection. $\pi$ is often a greedy deterministic policy while $b$ has to be a soft policy in order for the agent to explore sufficiently.

Importance Sampling

If action-value estimates are updated in the same way as on-policy methods while following $b$, they would be estimates of $q_b$ rather than the $q_\pi$ we are interested in. To correctly estimate $q_\pi$, we need to use importance sampling. When using importance sampling, returns are weighted by the relative probability of their trajectories occuring under the target and behavior policies. This relative probability is called the importance-sampling ratio.

Given a starting state-action pair $S_t,A_t$, the probability of the subsequent state-action trajectory, $S_{t+1},A_{t+1},S_{t+2},A_{t+2},\ldots,S_T$, occuring while following any policy $\pi$ is

$$ \begin{aligned} Pr&\{S_{t+1},A_{t+1},S_{t+2},A_{t+2},\ldots,S_T | S_t,A_t,A_{t+1:T-1} \sim \pi \} \\ &= p(S_{t+1} | S_t,A_t) \pi(A_{t+1} | S_{t+1}) p(S_{t+2} | S_{t+1},A_{t+1}) \pi(A_{t+2} | S_{t+2}) \ldots p(S_T | S_{T-1},A_{T-1}) \\ &= p(S_{t+1} | S_t,A_t) \prod_{k=t+1}^{T-1} \pi(A_k | S_k) p(S_{k+1} | S_k,A_k) \end{aligned} $$


$$ p(S_{t+1} | S_t,A_t) = \sum_r p(S_{t+1},R_{t+1} = r | S_t,A_t) $$

Thus, the importance-sampling ratio is

$$ \begin{aligned} \rho_{t:T-1} &= \frac{p(S_{t+1} | S_t,A_t) \prod_{k=t+1}^{T-1} \pi(A_k | S_k) p(S_{k+1} | S_k,A_k)}{p(S_{t+1} | S_t,A_t) \prod_{k=t+1}^{T-1} b(A_k | S_k) p(S_{k+1} | S_k,A_k)} \\ &= \frac{\prod_{k=t+1}^{T-1} \pi(A_k | S_k)}{\prod_{k=t+1}^{T-1} b(A_k | S_k)} \end{aligned} $$

The state-transition probabilities cancel and we are left with a definition that doesn’t require any knowledge about the MDP dynamics.

There are two types of importance sampling, ordinary importance sampling that uses a simple average of the weighted returns and weighted importance sampling that uses a weighted average. Let the time step $t$ span over episodes in such a way that if the first episode is 100 steps long, the next episode will start at $t=101$. Let $\tau(s,a)$ be the set of time steps where the state-action pair $(s,a)$ were first visited if using the first-visit method and the set of time steps for all visits if using the every-visit method. Let $T(t)$ be the first termination time step after $t$. We can then estimate $q_\pi$ using ordinary importance sampling by

$$ Q(s,a) = \frac{\sum_{t \in \tau(s,a)} \rho_{t:T(t)-1} G_t}{|\tau(s,a)|} $$

and using weighted importance sampling by

$$ Q(s,a) = \frac{\sum_{t \in \tau(s,a)} \rho_{t:T(t)-1} G_t}{\sum_{t \in \tau(s,a)} \rho_{t:T(t)-1}} $$

Ordinary importance sampling produces unbiased estimates, but has larger, possibly infinite variance. Weighted importance sampling produces biased estimates but has finite variance, which often makes it preffered in practice.

Pseudocode for off-policy MC with weighted importance sampling:

Off-policy MC with weighted importance sampling
$Q(s,a) \in \mathbb{R}$ arbitrarily for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
$C(s,a) \leftarrow 0$ for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
$\pi(s) \leftarrow argmax_a Q(s,a)$ (ties broken consistently)
Loop for multiple episodes:
$b \leftarrow$ any soft policy
Generate an episode following $b: S_0,A_0,R_1,\ldots,S_{T-1},A_{T-1},R_T$
$G \leftarrow 0$
$W \leftarrow 1$
Loop for each step of the episode, $t=T-1,T-2,\ldots,0$:
$G \leftarrow \gamma G + R_{t+1}$
$C(S_t,A_t) \leftarrow C(S_t,A_t) + W$
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{W}{C(S_t,A_t)} [G - Q(S_t,A_t)]$
$\pi(s) \leftarrow argmax_a Q(S_t,a)$ (ties broken consistently)
If $A_t \neq \pi(S_t)$: exit inner loop (proceed to next episode)
$W \leftarrow W \frac{1}{b(A_t|S_t)}$

The above algorithm uses the every-visit method. Note that $W$ and $C$ are used to implement weighted importance sampling incrementally.

Blackjack Example

The objective of blackjack is to obtain cards whose numerical sum is as big as possible without exceeding 21. All face cards count as 10 and aces count as 1 or 11. The game begins with two cards dealt to both dealer and player. The dealer has one of his cards face down. If the player has 21 immediately he wins, unless the dealer also has 21, in which case the game is a draw. If the player doesn’t have 21, he can choose between the actions hit and stick. If he chooses hit he draws a new card and if the sum is greater than 21 he goes bust and immediately loses. If it isn’t, he can once again choose between hit and stick. If he chooses stick, the action goes over to the dealer who plays the fixed strategy of sticking on any sum 17 or greater and hitting otherwise. If the dealer goes bust, the player wins. Otherwise, win, lose, or draw is determined by whose final sum is greater. This simplified version of blackjack has no splitting and cards are assumed to be picked from an infinite number or decks (or picked with replacement).

Playing blackjack is naturally formulated as a finite MDP where each game is an episode. Rewards are 0 everwhere except when reaching terminal states, where +1 is given for a win, 0 for a draw, and -1 for a loss. No discounting is done, thus the terminal rewards are also the returns. The state is a combination of the player sum, if the player has a usable ace, and the dealer’s showing card. If the player sum is 11 or less it is always best to hit, thus the values for player sum we are interested in is 12-21. The player has a usable ace if he can count an ace as 11 without going bust. The dealer shows one card (ace-10). Thus there are 200 possible states. The available actions are hit or stick in every non-terminal state.

Note how much easier it is to generate sample transitions than to fully define the dynamics function $p$. Below is an implementation of an MC agent that interacts with the blackjack environment to learn an optimal policy. The agent uses on-policy exploring starts with the every visit method. The final policy, which is probably optimal or close to optimal, is plotted after training.

import numpy as np
from enum import Enum
from copy import copy
import matplotlib as mpl
import matplotlib.pyplot as plt

class Action(Enum):
    HIT = 0
    STICK = 1

class State:
    def __init__(self, player_sum, usable_ace, dealer_showing):
        self.player_sum = player_sum
        self.usable_ace = usable_ace
        self.dealer_showing = dealer_showing
    def to_string(self):
        return "player_sum: {}, usable_ace: {}, dealer_showing: {}".format(
            self.player_sum, self.usable_ace, self.dealer_showing)

class Environment:
    def __init__(self, starting_state):
        self.state = starting_state

    def draw_card(self):
        cards = np.arange(10) + 1
        probabilities = np.append(np.ones(9)/13, 4/13)
        return np.random.choice(cards, p=probabilities)

    def take_action(self, action):
        if action == Action.HIT:
            # Get new card
            card = self.draw_card()
            if card == 1: # Ace worth 1 or 11
                if self.state.player_sum <= 10:
                    self.state.player_sum += 11
                    self.state.usable_ace = True
                    self.state.player_sum += 1
                self.state.player_sum += card

            # Check if player busted or needs to use an ace as 1
            if self.state.player_sum > 21:
                if self.state.usable_ace:
                    self.state.player_sum -= 10
                    self.state.usable_ace = False
                else: # Player busted
                    return -1
            return None
        # Player sticks, dealer now acts
        dealer_usable_ace = self.state.dealer_showing == 1
        dealer_sum = 11 if dealer_usable_ace else self.state.dealer_showing
        while dealer_sum < 17:
            # Dealer get new card
            card = self.draw_card()
            if card == 1: # Ace worth 1 or 11
                if dealer_sum <= 10:
                    dealer_sum += 11
                    dealer_usable_ace = True
                    dealer_sum += 1
                dealer_sum += card

            # Check if dealer busted or needs to use an ace as 1
            if dealer_sum > 21:
                if dealer_usable_ace:
                    dealer_sum -= 10
                    dealer_usable_ace = False
                else: # Dealer busted
                    return 1
        # Round is over
        if dealer_sum > self.state.player_sum:
            return -1
        elif dealer_sum < self.state.player_sum:
            return 1
        return 0

class Agent:
    # The policy is a deterministic mapping from state to action: 0 = HIT, 1 = STICK
    # Init policy to only stick on 20 and 21
    policy = np.zeros((2, 10, 10))
    policy[:, 8:10, :] = 1
    action_values = np.zeros((2, 10, 10, 2))
    num_visits = np.zeros(np.shape(action_values))

    def run_episode(self, environment, starting_action):
        history = []
        reward = None

        # Generate episode
        history.append((copy(environment.state), starting_action))
        reward = environment.take_action(starting_action)
        while reward is None:
            i = 0 if environment.state.usable_ace else 1
            j = environment.state.player_sum - 12
            k = environment.state.dealer_showing - 1
            action = Action(self.policy[i, j, k])
            history.append((copy(environment.state), action))
            reward = environment.take_action(action)

        # Use episode history to update state-values
        for state, action in history:
            i = 0 if state.usable_ace else 1
            j = state.player_sum - 12
            k = state.dealer_showing - 1
            l = action.value
            self.num_visits[i, j, k, l] += 1
            self.action_values[i, j, k, l] += (reward - self.action_values[i, j, k, l]) / self.num_visits[i, j, k, l]
            self.policy[i, j, k] = 0 if self.action_values[i, j, k, 0] > self.action_values[i, j, k, 1] else 1

if __name__ == "__main__":
    NUM_EPISODES = 10000000
    agent = Agent()

    # Agent learning policy
    for episode in range(NUM_EPISODES):
        if episode % 10000 == 0:
            print("STARTING EPISODE:", str(episode))
        random_starting_state = State(
            np.random.randint(12, high=22),
            True if np.random.randint(2) == 0 else False,
            np.random.randint(1, high=11)
        environment = Environment(random_starting_state)
        random_starting_action = Action(np.random.randint(2))
        agent.run_episode(environment, random_starting_action)

    # Plot final policy
    ticks = np.arange(10)
    xtick_labels = ["A", "2", "3", "4", "5", "6", "7", "8", "9", "10"]
    ytick_labels = range(12, 22)
    fig, (ax1, ax2) = plt.subplots(1, 2)
    ax1.imshow(agent.policy[0], cmap=mpl.colors.ListedColormap(["red", "blue"]), origin="lower")
    ax1.set_xlabel("Dealer showing")
    ax1.set_ylabel("Player sum")
    ax1.set_title("Usable ace")
    im2 = ax2.imshow(agent.policy[1], cmap=mpl.colors.ListedColormap(["red", "blue"]), origin="lower")
    ax2.set_title("No usable ace")
    cbar_ax = fig.add_axes([0.9, 0.4, 0.02, 0.2])
    cbar = fig.colorbar(im2, cax=cbar_ax, ticks=[0, 1])["Hit", "Stick"])



This has been a short introduction to Monte Carlo methods in reinforcement learning. On-policy and off-policy methods were discussed and a simple on-policy MC agent was implemented. The blackjack example is a good starting point for implementing MC methods because of its simplicity.

Thank you for reading and feel free to send me any questions.