Deep Q-Networks

Introduction

Deep Q-Networks (DQN) are artificial neural networks (ANN) that utilizes a variant of Q-learning in order to update the parameters of the network. DQN was introduced in Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013).

The aim of this article is to describe the core concepts of DQN and how it can be applied to the OpenAI Gym CartPole problem. The source code can be found at https://github.com/CarlFredriksson/openai_gym_problems.

Approximate RL Methods

In my previous post Temporal Difference Learning, I described the basic tabular version of Q-learning. Tabular reinforcement learning (RL) methods are great for problems with a small state space, but are not suited for problems where the state space is large. This is because of the memory required to store the table of action-values and the time it would take to fill the table.

Many RL methods can be extended to be approximate instead of tabular by using function approximators, such as linear models or ANNs. This makes them better suited for problems with large state spaces. Tabular methods define the action-value estimates $Q(s,a)$ as a function of state $s$ and action $a$. Approximate methods define them as $Q(s,a,w)$, where $w$ is the model parameters of the function approximator, such as the weights and biases of a neural network. Instead of updating the action-values themselves, approximate methods update the model parameters $w$. This is usually done using some form of gradient descent.

Assume that we know the true $q(s,a)$ and want our estimate $Q(s,a,w)$ to move towards it. A natural loss function $\mathcal{L}(w)$ to facilitate this is the mean squared error (MSE)

$$ \mathcal{L}(w) = \frac{1}{2} \big[q(s,a) - Q(s,a,w) \big]^2 $$

Let $\alpha$ be the learning rate, now we can do gradient descent on $\mathcal{L}(w)$

$$ \begin{aligned} w_{t+1} &= w_t - \alpha \nabla \mathcal{L}(w) \\ &= w_t - \frac{1}{2} \alpha \nabla \big[q(s,a) - Q(s,a,w) \big]^2 \\ &= w_t + \alpha \big[q(s,a) - Q(s,a,w) \big] \nabla Q(s,a,w) \end{aligned} $$

Of course, if we knew the true $q(s,a)$, the problem would already be solved. Let us denote the target we want our action-value estimates to move towards with $Q_{target}$. Exchanging $Q_{target}$ with $q(s,a)$ in the update rule above gives us

$$ w_{t+1} = w_t + \alpha \big[Q_{target} - Q(s,a,w) \big] \nabla Q(s,a,w) $$

Let us now define $Q_{target}$ to be the target in Q-learning

$$ Q_{target} = R + \gamma \operatorname{max}_a Q(S^\prime,a,w) $$

This gives us

$$ w_{t+1} = w_t + \alpha \big[R + \gamma \operatorname{max}_a Q(S^\prime,a,w) - Q(s,a,w) \big] \nabla Q(s,a,w) $$

Which is the update rule for approximate Q-learning.

Approximate version of Q-learning
Algorithm parameters: $\alpha \in (0,1]$, $\gamma \in (0,1]$, small $\epsilon > 0$
Initialize model $Q(S,A,w)$ and model parameters $w \in \mathbb{R}^d$
Loop for each episode:
Initialize $S$
Loop until $S$ is terminal:
Choose $A \in \mathcal{A}(S)$ using a soft policy derived from $Q$ (e.g. $\epsilon$-greedy)
Take action $A$, observe $R$, $S^\prime$
If $S^\prime$ is terminal:
$w \leftarrow w + \alpha \big[R - Q(S,A,w) \big] \nabla Q(S,A,w)$
Else:
$w \leftarrow w + \alpha \big[R + \gamma \operatorname{max}_a Q(S^\prime,a,w) - Q(S,A,w) \big] \nabla Q(S,A,w)$
$S \leftarrow S^\prime$

Experience Replay

DQN uses the same update rule as the one described in approximate Q-learning above. The difference is that $S$, $A$, $R$, and $S^\prime$ are added to a memory instead of immediately used for parameter updating. The parameters are updated using randomly sampled batches of $S A R S^\prime$ experiences from memory. This is called experience replay.

Experience replay has two major advantages. It makes more efficient use of previous experiences, since they are used for learning multiple times. It also has better convergence behavior when training a function approximator. One reason for this is that the experiences in a single trajectory are often correlated, but randomly sampling from the memory makes the data more like indepent and identically distributed (IID) data. This is important since most supervised learning convergence proofs assume IID data.

One disadvantage with experience replay is that it is harder to use with multi-step learning algorithms, such as $Q(\lambda)$.

DQN
Algorithm parameters: $\alpha, \gamma, \epsilon_{max}, \epsilon_{min}, \epsilon_{decay} \in (0,1]$, ${batch\_size} \geq 1$
Initialize ANN model $Q(S,A,w)$ and model parameters $w \in \mathbb{R}^d$
Initialize $\epsilon \leftarrow \epsilon_{max}$
Initialize memory
Loop for each episode:
Initialize $S$
Loop until $S$ is terminal:
Choose $A \in \mathcal{A}(S)$ using a soft policy derived from $Q$ (e.g. $\epsilon$-greedy)
Take action $A$, observe $R$, $S^\prime$
Add $(S,A,R,S^\prime)$ to memory
Sample random batch from memory
Loop for each $(S,A,R,S^\prime)$ in batch:
If $S^\prime$ is terminal:
$w \leftarrow w + \alpha \big[R - Q(S,A,w) \big] \nabla Q(S,A,w)$
Else:
$w \leftarrow w + \alpha \big[R + \gamma \operatorname{max}_a Q(S^\prime,a,w) - Q(S,A,w) \big] \nabla Q(S,A,w)$
$\epsilon \leftarrow \operatorname{max}(\epsilon_{min}, \epsilon \times \epsilon_{decay})$
$S \leftarrow S^\prime$

OpenAI CartPole Problem

OpenAI Gym has many problems to try algorithms on. Let us apply DQN on the simple CartPole problem. Description from https://gym.openai.com/envs/CartPole-v1/ :

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

import random
import gym
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

ENV_NAME = "CartPole-v1"
NUM_EPISODES = 100
MEMORY_SIZE = NUM_EPISODES * 1000
BATCH_SIZE = 20
GAMMA = 0.95
LEARNING_RATE = 0.001
EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995

class DQNAgent():
    def __init__(self, state_space_size, action_space_size):
        self.action_space_size = action_space_size
        self.exploration_rate = EXPLORATION_MAX
        self.memory = deque(maxlen=MEMORY_SIZE)

        self.model = Sequential()
        self.model.add(Dense(24, input_shape=(state_space_size,), activation="relu"))
        self.model.add(Dense(24, activation="relu"))
        self.model.add(Dense(action_space_size, activation="linear"))
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))

    def select_action(self, state):
        # Select action epsilon-greedily
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space_size)
        Q_values = self.model.predict(state)
        return np.argmax(Q_values[0])

    def remember(self, state, action, reward, state_next, done):
        self.memory.append((state, action, reward, state_next, done))

    def experience_replay(self):
        if len(self.memory) < BATCH_SIZE:
            return

        # Update model parameters using random samples from memory
        batch = random.sample(self.memory, BATCH_SIZE)
        for state, action, reward, state_next, done in batch:
            Q_update = reward if done else (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))
            Q_values = self.model.predict(state)
            Q_values[0][action] = Q_update
            self.model.fit(state, Q_values, verbose=0)

    def update_exploration_rate(self):
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)

def cartpole():
    # Create environment and agent
    env = gym.make(ENV_NAME)
    state_space_size = env.observation_space.shape[0]
    action_space_size = env.action_space.n
    dqn_agent = DQNAgent(state_space_size, action_space_size)

    # Simulate episodes
    scores = np.zeros(NUM_EPISODES)
    for episode in range(NUM_EPISODES):
        state = env.reset()
        state = np.expand_dims(state, axis=0)
        done = False
        while not done:
            #env.render()
            action = dqn_agent.select_action(state)
            state_next, reward, done, info = env.step(action)
            scores[episode] += reward
            state_next = np.expand_dims(state_next, axis=0)
            dqn_agent.remember(state, action, reward, state_next, done)
            dqn_agent.experience_replay()
            dqn_agent.update_exploration_rate()
            state = state_next
        print("Episode: {}, Score: {}".format(str(episode), str(scores[episode])))

    # Plot score
    plt.plot(np.arange(NUM_EPISODES), scores)
    plt.xlabel("Episode")
    plt.ylabel("Score")
    plt.show()

if __name__ == "__main__":
    cartpole()

results

Conclusion

Thank you for reading and feel free to send me any questions.