Deep Q-Networks
Introduction
Deep Q-Networks (DQN) are artificial neural networks (ANN) that utilizes a variant of Q-learning in order to update the parameters of the network. DQN was introduced in Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013).
The aim of this article is to describe the core concepts of DQN and how it can be applied to the OpenAI Gym CartPole problem. The source code can be found at https://github.com/CarlFredriksson/openai_gym_problems.
Approximate RL Methods
In my previous post Temporal Difference Learning, I described the basic tabular version of Q-learning. Tabular reinforcement learning (RL) methods are great for problems with a small state space, but are not suited for problems where the state space is large. This is because of the memory required to store the table of action-values and the time it would take to fill the table.
Many RL methods can be extended to be approximate instead of tabular by using function approximators, such as linear models or ANNs. This makes them better suited for problems with large state spaces. Tabular methods define the action-value estimates $Q(s,a)$ as a function of state $s$ and action $a$. Approximate methods define them as $Q(s,a,w)$, where $w$ is the model parameters of the function approximator, such as the weights and biases of a neural network. Instead of updating the action-values themselves, approximate methods update the model parameters $w$. This is usually done using some form of gradient descent.
Assume that we know the true $q(s,a)$ and want our estimate $Q(s,a,w)$ to move towards it. A natural loss function $\mathcal{L}(w)$ to facilitate this is the mean squared error (MSE)
$$ \mathcal{L}(w) = \frac{1}{2} \big[q(s,a) - Q(s,a,w) \big]^2 $$
Let $\alpha$ be the learning rate, now we can do gradient descent on $\mathcal{L}(w)$
Of course, if we knew the true $q(s,a)$, the problem would already be solved. Let us denote the target we want our action-value estimates to move towards with $Q_{target}$. Exchanging $Q_{target}$ with $q(s,a)$ in the update rule above gives us
Let us now define $Q_{target}$ to be the target in Q-learning
This gives us
Which is the update rule for approximate Q-learning.
Initialize model $Q(S,A,w)$ and model parameters $w \in \mathbb{R}^d$
Loop for each episode:
Loop until $S$ is terminal:
Take action $A$, observe $R$, $S^\prime$
If $S^\prime$ is terminal:
Experience Replay
DQN uses the same update rule as the one described in approximate Q-learning above. The difference is that $S$, $A$, $R$, and $S^\prime$ are added to a memory instead of immediately used for parameter updating. The parameters are updated using randomly sampled batches of $S A R S^\prime$ experiences from memory. This is called experience replay.
Experience replay has two major advantages. It makes more efficient use of previous experiences, since they are used for learning multiple times. It also has better convergence behavior when training a function approximator. One reason for this is that the experiences in a single trajectory are often correlated, but randomly sampling from the memory makes the data more like indepent and identically distributed (IID) data. This is important since most supervised learning convergence proofs assume IID data.
One disadvantage with experience replay is that it is harder to use with multi-step learning algorithms, such as $Q(\lambda)$.
Initialize ANN model $Q(S,A,w)$ and model parameters $w \in \mathbb{R}^d$
Initialize $\epsilon \leftarrow \epsilon_{max}$
Initialize memory
Loop for each episode:
Loop until $S$ is terminal:
Take action $A$, observe $R$, $S^\prime$
Add $(S,A,R,S^\prime)$ to memory
Sample random batch from memory
Loop for each $(S,A,R,S^\prime)$ in batch:
$S \leftarrow S^\prime$
OpenAI CartPole Problem
OpenAI Gym has many problems to try algorithms on. Let us apply DQN on the simple CartPole problem. Description from https://gym.openai.com/envs/CartPole-v1/ :
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
import random
import gym
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
ENV_NAME = "CartPole-v1"
NUM_EPISODES = 100
MEMORY_SIZE = NUM_EPISODES * 1000
BATCH_SIZE = 20
GAMMA = 0.95
LEARNING_RATE = 0.001
EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995
class DQNAgent():
def __init__(self, state_space_size, action_space_size):
self.action_space_size = action_space_size
self.exploration_rate = EXPLORATION_MAX
self.memory = deque(maxlen=MEMORY_SIZE)
self.model = Sequential()
self.model.add(Dense(24, input_shape=(state_space_size,), activation="relu"))
self.model.add(Dense(24, activation="relu"))
self.model.add(Dense(action_space_size, activation="linear"))
self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))
def select_action(self, state):
# Select action epsilon-greedily
if np.random.rand() < self.exploration_rate:
return random.randrange(self.action_space_size)
Q_values = self.model.predict(state)
return np.argmax(Q_values[0])
def remember(self, state, action, reward, state_next, done):
self.memory.append((state, action, reward, state_next, done))
def experience_replay(self):
if len(self.memory) < BATCH_SIZE:
return
# Update model parameters using random samples from memory
batch = random.sample(self.memory, BATCH_SIZE)
for state, action, reward, state_next, done in batch:
Q_update = reward if done else (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))
Q_values = self.model.predict(state)
Q_values[0][action] = Q_update
self.model.fit(state, Q_values, verbose=0)
def update_exploration_rate(self):
self.exploration_rate *= EXPLORATION_DECAY
self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)
def cartpole():
# Create environment and agent
env = gym.make(ENV_NAME)
state_space_size = env.observation_space.shape[0]
action_space_size = env.action_space.n
dqn_agent = DQNAgent(state_space_size, action_space_size)
# Simulate episodes
scores = np.zeros(NUM_EPISODES)
for episode in range(NUM_EPISODES):
state = env.reset()
state = np.expand_dims(state, axis=0)
done = False
while not done:
#env.render()
action = dqn_agent.select_action(state)
state_next, reward, done, info = env.step(action)
scores[episode] += reward
state_next = np.expand_dims(state_next, axis=0)
dqn_agent.remember(state, action, reward, state_next, done)
dqn_agent.experience_replay()
dqn_agent.update_exploration_rate()
state = state_next
print("Episode: {}, Score: {}".format(str(episode), str(scores[episode])))
# Plot score
plt.plot(np.arange(NUM_EPISODES), scores)
plt.xlabel("Episode")
plt.ylabel("Score")
plt.show()
if __name__ == "__main__":
cartpole()
Conclusion
Thank you for reading and feel free to send me any questions.