Categories on Carl Fredriksson

Causal Inference Basics

Tue, 30 Jan 2024 00:00:00 +0000

RL Gymnasium Training

Thu, 09 Nov 2023 00:00:00 +0000

After my second, significantly more thorough reading of Reinforcement Learning: An Introduction by Sutton and Barto (2nd edition), I wanted to get my hands dirty and apply some of what I’d learned. I created the repository rl-gymnasium-training to work on my Reinforcement Learning (RL) skills using Gymnasium environments (https://gymnasium.farama.org/). There is no substitute for trying to implement and apply algorithms yourself - In my experience it not only improves your practical skills in a field, but also improves your understanding of the theory. At the time of writing this I had solved three simple gym environments, below are some gif examples:

Random Agent	Trained Agent

Check out the repository if you’re interested, and feel free to reach out with any questions!

Bellman Operators are Contractions

Mon, 07 Nov 2022 00:00:00 +0000

Introduction

I’m studying Reinforcement Learning (RL) again, primarily using the books:

I’ve read book 1 once before, often referred to as the RL textbook or even the RL bible, but I had forgotten most of it and wanted to study the book in a slower and more careful manner, with the aim of finishing it with a deeper understanding than I did the first time. At the time of writing this I’m still in the early parts of the book. What prompted this post was to document what I learned while trying to answer the following questions that I struggled with for a while:

How do we know that the Bellman equations and Bellman optimality equations have unique solutions?
Why can’t iterative policy evaluation or policy/value iteration get stuck in local optima?

To answer these questions I had to improve my mathematical foundations, and I’m grateful to have found book 2 which has been a big help along with various other online resources. With this post I hope to preserve my knowledge and to help anyone with similar questions that happen to find it. If you have any questions or feedback, please feel free to reach out to me at c@cfml.se.

Contraction mappings and the Banach fixed-point theorem

From the Wikipedia article Banach fixed-point theorem:

Definition. Let $(X,d)$ be a complete metric space. Then a map $T:X \to X$ is called a contraction mapping on $X$ if there exists $q \in [0,1)$ such that

$$ d\big(T(x),T(y)\big) \leq qd(x,y) $$

for all $x,y \in X$.

Banach Fixed Point Theorem. Let $(X,d)$ be a non-empty complete metric space with a contraction mapping $T:X\to X$. Then $T$ admits a unique fixed-point $x^*$ in $X$ (i.e. $T(x^*) = x^*$). Furthermore, $x^*$ can be found as follows: start with an arbitrary element $x_0 \in X$ and define a sequence $(x_n)_{n \in \mathbb{N}}$ by $x_n = T(x_{n-1})$ for $n \geq 1$. Then $\lim_{n \to \infty} x_n = x^*$.

Thus, we need to prove that the right hand side of the Bellman equations and Bellman optimality equations are contractions to answer both questions stated in the introduction.

You can see a proof for the theorem in the Wikipedia article Banach fixed-point theorem. The rough idea is that a sequence $(x_n)_{n \in \mathbb{N}}$ by $x_n = T(x_{n-1})$ for $n \geq 1$ is shown to be a Cauchy sequence (a sequence whose elements become arbitrarily close to each other as the sequence progresses). Thus, the sequence converges to $x^* \in X$ which has to be a fixed point of $T$. It has to be the only fixed point of $T$ since any pair of distinct fixed points $p_1$ and $p_2$ would contradict $T$ being a contraction:

$$ d\big(T(p_1), T(p_2)\big) = d(p_1, p_2) > q d(p_1, p_2) $$

Bellman equations

Let’s consider any finite Markov decision process (MDP) where $\mathcal{S}$ denote the finite set of all states, $\mathcal{R} \subset \mathbb{R}$ the finite set of all rewards, and $\mathcal{A}(s)$ the finite set of all available actions $a$ in state $s$. Let $r_\pi(s)$ denote the expected immediate reward and $p_\pi(s^\prime | s)$ the probability of transitioning to state $s^\prime$ when following policy $\pi$ from state $s$. Let $\gamma \in (0, 1)$ denote the discount rate for future rewards. I’m assuming some knowledge about MDPs and won’t explain all of the notation here, feel free to check my older post Markov Decision Processes for an introduction.

The Bellman equation for the state-value function for policy $\pi$ can be defined as follows:

$$ \begin{aligned} v_{\pi}(s) &= \mathbb{E}_\pi \big[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s \big] \\ &= \sum_{a \in \mathcal{A}(s)} \pi(a | s) \bigg[\sum_{r \in \mathcal{R}} p(r | s, a) r + \gamma \sum_{s^\prime \in \mathcal{S}} p(s^\prime | s, a) v_{\pi}(s^\prime) \bigg] \\ &= r_\pi(s) + \gamma \sum_{s^\prime \in \mathcal{S}} p_\pi(s^\prime | s) v_{\pi}(s^\prime) \end{aligned} $$

for all $s \in \mathcal{S}$. Let $n = |\mathcal{S}|$, we can write the equation in matrix form:

$$ \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix}= \begin{bmatrix} r_\pi(1) \\ \vdots \\ r_\pi(n) \end{bmatrix} +\gamma \begin{bmatrix} p_\pi(1 | 1) & \dots & p_\pi(n | 1) \\ \vdots & \ddots & \vdots\\ p_\pi(1 | n) & \dots & p_\pi(n | n) \end{bmatrix} \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix} $$

More compactly:

$$ v_\pi = r_\pi + \gamma P_\pi v_\pi $$

We can define an (expected) Bellman operator $\mathcal{T}^\pi : \mathbb{R}^n \to \mathbb{R}^n$ as:

$$ \mathcal{T}^\pi(v) = r_\pi + \gamma P_\pi v $$

for any $v \in \mathbb{R}^n$.

Let $||\cdot||$ be a norm in $\mathbb{R}^n$. If there exists a $\gamma \in (0, 1)$ such that $||\mathcal{T}^\pi(v_1) - \mathcal{T}^\pi(v_2)|| \leq \gamma ||v_1 - v_2||$ for all $v_1, v_2 \in \mathbb{R}^n$, then $\mathcal{T}^\pi$ is a contraction mapping. Note that the definition of a contraction uses a distance function $d$ for generality (all norms can be used to create a distance function as in $d(x, y) = ||x - y||$, but not all distance functions have a corresponding norm). In all proofs in this post $|\cdot|$ and $\leq$ are elementwise, and the norm used is the max norm $||x||_\infty = \max(|x|) = \max(|x_1|, \dots, |x_n|)$, where $\max(\cdot) : \mathbb{R}^n \to \mathbb{R}$ chooses the largest element in a vector.

$$ \begin{aligned} ||\mathcal{T}^\pi(v_1) - \mathcal{T}^\pi(v_2)||_\infty &= \max \big(|r_\pi + \gamma P_\pi v_1 - (r_\pi + \gamma P_\pi v_2)| \big) \\ &= \gamma \max \big(|P_\pi(v_1 - v_2)| \big) \\ &\leq \gamma \max \big(P_\pi|v_1 - v_2| \big) \\ &\leq \gamma \max \big(|v_1 - v_2| \big) \\ &= \gamma ||v_1 - v_2||_\infty \end{aligned} $$

Thus $\mathcal{T}^\pi$ is a contraction. The last inequality is due to the rows of $P_\pi$ containing only non-negative elements that sum to 1.

Let $r(s, a)$ denote the expected immediate reward when selecting action $a$ in state $s$, and $p_\pi(s^\prime, a^\prime | s, a)$ the probability of transitioning to state $s^\prime$ and selecting action $a^\prime$ when selecting action $a$ in state $s$ and following policy $\pi$ after. The Bellman equation for the action-value function for policy $\pi$ can be defined as follows:

$$ \begin{aligned} q_{\pi}(s, a) &= \mathbb{E}_\pi \big[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | S_t = s, A_t = a \big] \\ &= \sum_{r \in \mathcal{R}} p(r | s, a) r + \gamma \sum_{s^\prime \in \mathcal{S}} p(s^\prime | s, a) \sum_{a^\prime \in \mathcal{A}(s^\prime)} \pi(a^\prime | s^\prime) q_{\pi}(s^\prime, a^\prime) \\ &= r(s, a) + \gamma \sum_{s^\prime \in \mathcal{S}} \sum_{a^\prime \in \mathcal{A}(s^\prime)} p_\pi(s^\prime, a^\prime | s, a) q_{\pi}(s^\prime, a^\prime) \end{aligned} $$

for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$. Let $n = |\mathcal{S}|$ and $n_s = |\mathcal{A}(s)|$, we can write the equation in matrix form:

$$ \begin{bmatrix} q_\pi(1, 1) \\ q_\pi(1, 2) \\ \vdots \\ q_\pi(n, n_n) \end{bmatrix}= \begin{bmatrix} r_\pi(1, 1) \\ r_\pi(1, 2) \\ \vdots \\ r_\pi(n, n_n) \end{bmatrix} +\gamma \begin{bmatrix} p_\pi(1, 1 | 1, 1) & p_\pi(1, 2 | 1, 1) & \dots & p_\pi(n, n_n | 1, 1) \\ p_\pi(1, 1 | 1, 2) & p_\pi(1, 2 | 1, 2) & \dots & p_\pi(n, n_n | 1, 2) \\ \vdots & \vdots & \ddots & \vdots \\ p_\pi(1, 1 | n, n_n) & p_\pi(1, 2 | n, n_n) & \dots & p_\pi(n, n_n | n, n_n) \end{bmatrix} \begin{bmatrix} q_\pi(1, 1) \\ q_\pi(1, 2) \\ \vdots \\ q_\pi(n, n_n) \end{bmatrix} $$

More compactly:

$$ q_\pi = r_\pi + \gamma P_\pi q_\pi $$

The only difference compared to the equation for the state-value function is that the vectors and matrices are larger. Thus, we can define an expected Bellman operator in identical fashion (with $n$ denoting the number of state-action pairs rather than the number of states) and the proof will be identical (the rows of $P_\pi$ still contain only non-negative elements that sum to 1).

Bellman optimality equations

For brevity I will go directly into the compact matrix form of the Bellman optimality equation for the optimal state-value function:

$$ v_* = \max_\pi(r_\pi + \gamma P_\pi v_*) $$

We can define an (expected) Bellman optimality operator $\mathcal{T}^* : \mathbb{R}^n \to \mathbb{R}^n$ as:

$$ \mathcal{T}^*(v) = \max_\pi(r_\pi + \gamma P_\pi v) $$

Consider any two vectors $v_1, v_2 \in \mathbb{R}^n$, and let $\pi_1^* = \text{argmax}_\pi(r_\pi + \gamma P_\pi v_1)$ and $\pi_2^* = \text{argmax}_\pi(r_\pi + \gamma P_\pi v_2)$. Then we have:

$$ \mathcal{T}^*(v_1) = \max_\pi(r_\pi + \gamma P_\pi v_1) = r_{\pi_1^*} + \gamma P_{\pi_1^*} v_1 \geq r_{\pi_2^*} + \gamma P_{\pi_2^*} v_1 \\ \mathcal{T}^*(v_2) = \max_\pi(r_\pi + \gamma P_\pi v_2) = r_{\pi_2^*} + \gamma P_{\pi_2^*} v_2 \geq r_{\pi_1^*} + \gamma P_{\pi_1^*} v_2 $$

and

$$ \begin{aligned} \mathcal{T}^*(v_1) - \mathcal{T}^*(v_2) &= r_{\pi_1^*} + \gamma P_{\pi_1^*} v_1 - (r_{\pi_2^*} + \gamma P_{\pi_2^*} v_2) \\ &\leq r_{\pi_1^*} + \gamma P_{\pi_1^*} v_1 - (r_{\pi_1^*} + \gamma P_{\pi_1^*} v_2) \\ &= \gamma P_{\pi_1^*} (v_1 - v_2) \end{aligned} $$

Similarly we have $\mathcal{T}^*(v_2) - \mathcal{T}^*(v_1) \leq \gamma P_{\pi_2^*} (v_2 - v_1)$, which implies that $\mathcal{T}^*(v_1) - \mathcal{T}^*(v_2) \geq \gamma P_{\pi_2^*} (v_1 - v_2)$, and thus:

$$ \gamma P_{\pi_2^*} (v_1 - v_2) \leq \mathcal{T}^*(v_1) - \mathcal{T}^*(v_2) \leq \gamma P_{\pi_1^*} (v_1 - v_2) $$

Let $\max\{\cdot, \cdot \} : \mathbb{R}^n \times \mathbb{R}^n \to \mathbb{R}^n$ choose the largest values between two vectors elementwise, we have:

$$ \begin{aligned} |\mathcal{T}^*(v_1) - \mathcal{T}^*(v_2)| &\leq \max \{|\gamma P_{\pi_2^*} (v_1 - v_2)|, |\gamma P_{\pi_1^*} (v_1 - v_2)| \} \\ &\leq \gamma \max \{P_{\pi_2^*} |v_1 - v_2|, P_{\pi_1^*} |v_1 - v_2| \} \end{aligned} $$

Let $\max(\cdot) : \mathbb{R}^n \to \mathbb{R}$ choose the largest element in a vector (as previously defined), we have:

$$ \begin{aligned} ||\mathcal{T}^*(v_1) - \mathcal{T}^*(v_2)||_\infty &= \max \big(|\mathcal{T}^*(v_1) - \mathcal{T}^*(v_2)| \big) \\ &\leq \gamma \max \big(\max \{P_{\pi_2^*} |v_1 - v_2|, P_{\pi_1^*} |v_1 - v_2| \} \big) \\ &\leq \gamma \max \big(|v_1 - v_2| \big) \\ &= \gamma ||v_1 - v_2||_\infty \end{aligned} $$

Thus $\mathcal{T}^*$ is a contraction. Note that the last inequality is due to the rows of $P_{\pi_1^*}$ and $P_{\pi_2^*}$ containing only non-negative elements that sum to 1. As for Bellman equations, an almost identical proof can be written for action-values with the only difference being that we consider state-action pairs rather than states.

Conclusion

We’re now ready to answer the questions from the introduction.

How do we know that the Bellman equations and Bellman optimality equations have unique solutions?

We can define the right hand side of the equations as operators and show that they are contractions, which means that they have unique fixed-points that can be solved for by starting with an arbitrary point and iteratively applying the operator, due to the Banach fixed-point theorem. Any solution to one of the equations must be a fixed-point for its corresponding operator, thus these fixed-points are the unique solutions to the equations.

Why can’t iterative policy evaluation or policy/value iteration get stuck in local optima?

Iterative policy evaluation and value iteration are simply iteratively applying the Bellman operator or Bellman optimality operator respectively. As explained above we know that the operators have unique fixed-points that can be solved for in this manner. In other words, there are no local optima. However, note that convergence is only guaranteed in the limit and one might have to settle for approximations, depending on the complexity of the MDP. Policy iteration is not as straight forward, but it’s closely connected to value iteration and I find it quite intuitive that once we know that value iteration will converge to the optimal state-value/action-value function, then policy iteration will also do so. From Reinforcement Learning: An Introduction by Sutton and Barto (2nd edition):

Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement. Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement sweep. In general, the entire class of truncated policy iteration algorithms can be thought of as sequences of sweeps, some of which use policy evaluation updates and some of which use value iteration updates. Because the max operation in (4.10) is the only difference between these updates this just means that the max operation is added to some sweeps of policy evaluation. All of these algorithms converge to an optimal policy for discounted finite MDPs.

Truncated policy iteration is the general method with value iteration at one end of the spectrum where only one iteration of policy evaluation is done before updating the policy greedily, and policy iteration at the other end where policy evaluation runs until convergence before updating the policy greedily. We know that value iteration only has one fixed-point and only have to consider what happens when we add more iterations of policy evaluation before doing one step of value iteration (applying the Bellman optimality operator). One way to think about it, is that we will simply update the policy using more accurate state-values or action-values, and it’s hard to see how this would interfere with convergence.

Another way to think about it, is that each step will either improve or be as good as the previous policy due to the policy improvement theorem (see chapters 3 and 4 in Reinforcement Learning: An Introduction by Sutton and Barto (2nd edition)). And since each iteration essentially ends with a step of value iteration, we know that the algorithm will only get stuck in one fixed-point - the same as in value iteration.

Beyond intuitive explanations, there are of course formal proofs. From chapter 4 in Mathematical Foundations of Reinforcement Learning by Zhao (the proof is in the book):

The idea of the proof is to show that policy iteration converge faster than value iteration. Since value iteration has been proven to be convergent, the convergence of policy iteration immediately follows.

Note that faster here does not mean faster compute time or fewer total number of iterations, since the policy evaluation step potentially requires an infinite number of iterations, but instead that policy iteration requires equally many or fewer policy updates before convergence.

That’s it for this post. I hope you enjoyed the read and please feel free to reach out to me at c@cfml.se with any questions or feedback!

Neural Style Transfer 2

Sat, 08 Aug 2020 00:00:00 +0000

Linear Regression Library Comparison

Thu, 16 Jul 2020 00:00:00 +0000

Image Classification using PyTorch

Sun, 29 Mar 2020 00:00:00 +0000

Deep Q-Networks

Thu, 02 Jan 2020 00:00:00 +0000

Introduction

Deep Q-Networks (DQN) are artificial neural networks (ANN) that utilizes a variant of Q-learning in order to update the parameters of the network. DQN was introduced in Playing Atari with Deep Reinforcement Learning (Mnih et al., 2013).

The aim of this article is to describe the core concepts of DQN and how it can be applied to the OpenAI Gym CartPole problem. The source code can be found at https://github.com/CarlFredriksson/openai_gym_problems.

Approximate RL Methods

In my previous post Temporal Difference Learning, I described the basic tabular version of Q-learning. Tabular reinforcement learning (RL) methods are great for problems with a small state space, but are not suited for problems where the state space is large. This is because of the memory required to store the table of action-values and the time it would take to fill the table.

Many RL methods can be extended to be approximate instead of tabular by using function approximators, such as linear models or ANNs. This makes them better suited for problems with large state spaces. Tabular methods define the action-value estimates $Q(s,a)$ as a function of state $s$ and action $a$. Approximate methods define them as $Q(s,a,w)$, where $w$ is the model parameters of the function approximator, such as the weights and biases of a neural network. Instead of updating the action-values themselves, approximate methods update the model parameters $w$. This is usually done using some form of gradient descent.

Assume that we know the true $q(s,a)$ and want our estimate $Q(s,a,w)$ to move towards it. A natural loss function $\mathcal{L}(w)$ to facilitate this is the mean squared error (MSE)

$$ \mathcal{L}(w) = \frac{1}{2} \big[q(s,a) - Q(s,a,w) \big]^2 $$

Let $\alpha$ be the learning rate, now we can do gradient descent on $\mathcal{L}(w)$

$$ \begin{aligned} w_{t+1} &= w_t - \alpha \nabla \mathcal{L}(w) \\ &= w_t - \frac{1}{2} \alpha \nabla \big[q(s,a) - Q(s,a,w) \big]^2 \\ &= w_t + \alpha \big[q(s,a) - Q(s,a,w) \big] \nabla Q(s,a,w) \end{aligned} $$

Of course, if we knew the true $q(s,a)$, the problem would already be solved. Let us denote the target we want our action-value estimates to move towards with $Q_{target}$. Exchanging $Q_{target}$ with $q(s,a)$ in the update rule above gives us

$$ w_{t+1} = w_t + \alpha \big[Q_{target} - Q(s,a,w) \big] \nabla Q(s,a,w) $$

Let us now define $Q_{target}$ to be the target in Q-learning

$$ Q_{target} = R + \gamma \operatorname{max}_a Q(S^\prime,a,w) $$

This gives us

$$ w_{t+1} = w_t + \alpha \big[R + \gamma \operatorname{max}_a Q(S^\prime,a,w) - Q(s,a,w) \big] \nabla Q(s,a,w) $$

Which is the update rule for approximate Q-learning.

Approximate version of Q-learning

Algorithm parameters: $\alpha \in (0,1]$, $\gamma \in (0,1]$, small $\epsilon > 0$
Initialize model $Q(S,A,w)$ and model parameters $w \in \mathbb{R}^d$
Loop for each episode:

Initialize $S$
Loop until $S$ is terminal:

Choose $A \in \mathcal{A}(S)$ using a soft policy derived from $Q$ (e.g. $\epsilon$-greedy)
Take action $A$, observe $R$, $S^\prime$
If $S^\prime$ is terminal:

$w \leftarrow w + \alpha \big[R - Q(S,A,w) \big] \nabla Q(S,A,w)$

Else:

$w \leftarrow w + \alpha \big[R + \gamma \operatorname{max}_a Q(S^\prime,a,w) - Q(S,A,w) \big] \nabla Q(S,A,w)$

$S \leftarrow S^\prime$

Experience Replay

DQN uses the same update rule as the one described in approximate Q-learning above. The difference is that $S$, $A$, $R$, and $S^\prime$ are added to a memory instead of immediately used for parameter updating. The parameters are updated using randomly sampled batches of $S A R S^\prime$ experiences from memory. This is called experience replay.

Experience replay has two major advantages. It makes more efficient use of previous experiences, since they are used for learning multiple times. It also has better convergence behavior when training a function approximator. One reason for this is that the experiences in a single trajectory are often correlated, but randomly sampling from the memory makes the data more like indepent and identically distributed (IID) data. This is important since most supervised learning convergence proofs assume IID data.

One disadvantage with experience replay is that it is harder to use with multi-step learning algorithms, such as $Q(\lambda)$.

DQN

Algorithm parameters: $\alpha, \gamma, \epsilon_{max}, \epsilon_{min}, \epsilon_{decay} \in (0,1]$, ${batch\_size} \geq 1$
Initialize ANN model $Q(S,A,w)$ and model parameters $w \in \mathbb{R}^d$
Initialize $\epsilon \leftarrow \epsilon_{max}$
Initialize memory
Loop for each episode:

Initialize $S$
Loop until $S$ is terminal:

Choose $A \in \mathcal{A}(S)$ using a soft policy derived from $Q$ (e.g. $\epsilon$-greedy)
Take action $A$, observe $R$, $S^\prime$
Add $(S,A,R,S^\prime)$ to memory
Sample random batch from memory
Loop for each $(S,A,R,S^\prime)$ in batch:

If $S^\prime$ is terminal:

$w \leftarrow w + \alpha \big[R - Q(S,A,w) \big] \nabla Q(S,A,w)$

Else:

$w \leftarrow w + \alpha \big[R + \gamma \operatorname{max}_a Q(S^\prime,a,w) - Q(S,A,w) \big] \nabla Q(S,A,w)$

$\epsilon \leftarrow \operatorname{max}(\epsilon_{min}, \epsilon \times \epsilon_{decay})$
$S \leftarrow S^\prime$

OpenAI CartPole Problem

OpenAI Gym has many problems to try algorithms on. Let us apply DQN on the simple CartPole problem. Description from https://gym.openai.com/envs/CartPole-v1/ :

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

import random
import gym
import numpy as np
import matplotlib.pyplot as plt
from collections import deque
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

ENV_NAME = "CartPole-v1"
NUM_EPISODES = 100
MEMORY_SIZE = NUM_EPISODES * 1000
BATCH_SIZE = 20
GAMMA = 0.95
LEARNING_RATE = 0.001
EXPLORATION_MAX = 1.0
EXPLORATION_MIN = 0.01
EXPLORATION_DECAY = 0.995

class DQNAgent():
    def __init__(self, state_space_size, action_space_size):
        self.action_space_size = action_space_size
        self.exploration_rate = EXPLORATION_MAX
        self.memory = deque(maxlen=MEMORY_SIZE)

        self.model = Sequential()
        self.model.add(Dense(24, input_shape=(state_space_size,), activation="relu"))
        self.model.add(Dense(24, activation="relu"))
        self.model.add(Dense(action_space_size, activation="linear"))
        self.model.compile(loss="mse", optimizer=Adam(lr=LEARNING_RATE))

    def select_action(self, state):
        # Select action epsilon-greedily
        if np.random.rand() < self.exploration_rate:
            return random.randrange(self.action_space_size)
        Q_values = self.model.predict(state)
        return np.argmax(Q_values[0])

    def remember(self, state, action, reward, state_next, done):
        self.memory.append((state, action, reward, state_next, done))

    def experience_replay(self):
        if len(self.memory) < BATCH_SIZE:
            return

        # Update model parameters using random samples from memory
        batch = random.sample(self.memory, BATCH_SIZE)
        for state, action, reward, state_next, done in batch:
            Q_update = reward if done else (reward + GAMMA * np.amax(self.model.predict(state_next)[0]))
            Q_values = self.model.predict(state)
            Q_values[0][action] = Q_update
            self.model.fit(state, Q_values, verbose=0)

    def update_exploration_rate(self):
        self.exploration_rate *= EXPLORATION_DECAY
        self.exploration_rate = max(EXPLORATION_MIN, self.exploration_rate)

def cartpole():
    # Create environment and agent
    env = gym.make(ENV_NAME)
    state_space_size = env.observation_space.shape[0]
    action_space_size = env.action_space.n
    dqn_agent = DQNAgent(state_space_size, action_space_size)

    # Simulate episodes
    scores = np.zeros(NUM_EPISODES)
    for episode in range(NUM_EPISODES):
        state = env.reset()
        state = np.expand_dims(state, axis=0)
        done = False
        while not done:
            #env.render()
            action = dqn_agent.select_action(state)
            state_next, reward, done, info = env.step(action)
            scores[episode] += reward
            state_next = np.expand_dims(state_next, axis=0)
            dqn_agent.remember(state, action, reward, state_next, done)
            dqn_agent.experience_replay()
            dqn_agent.update_exploration_rate()
            state = state_next
        print("Episode: {}, Score: {}".format(str(episode), str(scores[episode])))

    # Plot score
    plt.plot(np.arange(NUM_EPISODES), scores)
    plt.xlabel("Episode")
    plt.ylabel("Score")
    plt.show()

if __name__ == "__main__":
    cartpole()

Conclusion

Thank you for reading and feel free to send me any questions.

ANN without ML Framework

Sun, 15 Dec 2019 00:00:00 +0000

Introduction

The aim of this article is to show how you can create an artificial neural network (ANN) without using a machine learning (ML) framework such as TensorFlow or PyTorch. This is not practical for most projects, but it’s a great exercise for building intuition about neural networks. The implemented network will be a simple feedforward neural network with a few customizable hyperparameters such as the number of hidden layers and their sizes. The network will be used to solve a simple regression problem.

The source code can be found at https://github.com/CarlFredriksson/ann_without_ml_framework.

Initialize Parameters

The only import we will need to create the network is

import numpy as np

Let’s start by creating a class for our network.

class SimpleNeuralNetwork:

Inside this class we are going to write some methods. Let’s start with the __init__ method that initializes the weights and biases of the network. The weights are initialized to normally distributed random numbers and the biases are initialized to zero.

def __init__(self, input_layer_size, hidden_layer_sizes, output_layer_size):
    self.W = []
    self.b = []
    if len(hidden_layer_sizes) == 0:
        self.add_layer(input_layer_size, output_layer_size)
    elif len(hidden_layer_sizes) >= 1:
        self.add_layer(input_layer_size, hidden_layer_sizes[0])
        for i in range(1, len(hidden_layer_sizes)):
            self.add_layer(hidden_layer_sizes[i-1], hidden_layer_sizes[i])
        self.add_layer(hidden_layer_sizes[-1], output_layer_size)

def add_layer(self, previous_layer_size, layer_size):
    self.W.append(np.random.randn(previous_layer_size, layer_size) * 0.01)
    self.b.append(np.zeros((1, layer_size)))

Forward Propagation

We have initialized the necessary member variables to write a forward propagation method. Let $m$ denote the batch size, $n$ the input layer size (layer 0), and $o$ the output layer size. Let $X \in \mathbb{R}^{m \times n}$ denote a batch of input vectors and $Y \in \mathbb{R}^{m \times o}$ their corresponding target output vectors. Let $W^{[l]} \in \mathbb{R}^{\text{size of }l-1 \times \text{size of }l}$ denote the weights, $b^{[l]} \in \mathbb{R}^{1 \times \text{size of }l}$ the biases, and $A^{[l]} \in \mathbb{R}^{m \times \text{size of }l}$ the activations of layer $l \in {1,2,\ldots,L}$ with layer $l = L$ being the output layer. All layers will use the ReLU activation function except the output layer, which will be linear since we will be using the network on a regression problem. More formally

$$ \begin{aligned} A^{[0]} &= X \\ Z^{[l]} &= A^{[l-1]} W^{[l]} + b^{[l]} \qquad \forall l \in {1,2,\ldots,L} \\ A^{[l]} &= max(0, Z) \qquad \forall l \in {1,2,\ldots,L-1} \\ A^{[L]} &= Z \end{aligned} $$

Let $\mathcal{L}$ denote the loss. The loss function we will use is mean squared error (MSE). More formally

$$ \mathcal{L} = \frac{1}{2 m} \sum_{i=1}^m \sum_{j=1}^{o} \big[Y_{i,j} - A^{[L]}_{i,j} \big]^2 $$

We could divide by $o$ instead of 2 to make it more of a true mean, but dividing by 2 makes the gradient cleaner. Note that we store the computed $Z$ and $A$ values in order to use them for computing the gradient by backpropagation.

def propagate_forward(self, X, Y):
    batch_size = np.shape(X)[0]
    Z_cache = [None] * (len(self.W))
    A_cache = [None] * (len(self.W))

    A_prev = X
    for i in range(len(self.W)):
        Z = np.dot(A_prev, self.W[i]) + self.b[i]
        A = Z
        # If not at the output layer, apply relu
        if i < (len(self.W) - 1):
            A = np.maximum(0, Z)
        Z_cache[i] = Z
        A_cache[i] = A
        A_prev = A

    loss = 1/(2*batch_size) * np.sum((Y - A_cache[-1])**2)

    return loss, Z_cache, A_cache

Backpropagation

We will use backpropagation to efficiently compute the gradient of the loss function. Backpropagation might seem mysterious at first, but the idea is simply to use the chain rule of derivatives and the stored $Z$ and $A$ values for efficiency. We can work our way backwards

$$ \begin{aligned} \frac{\delta \mathcal{L}}{\delta A^{[L]}} &= \frac{1}{m} \sum A^{[L]} - Y \\ \frac{\delta \mathcal{L}}{\delta Z^{[L]}} &= \frac{\delta \mathcal{L}}{\delta A^{[L]}} \frac{\delta A^{[L]}}{\delta Z^{[L]}} = \frac{\delta \mathcal{L}}{\delta A^{[L]}} \qquad \text{(since the output layer is linear)} \\ \frac{\delta \mathcal{L}}{\delta W^{[L]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L]}} \frac{\delta Z^{[L]}}{\delta W^{[L]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L]}} A^{[L-1]} \\ \frac{\delta \mathcal{L}}{\delta b^{[L]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L]}} \frac{\delta Z^{[L]}}{\delta b^{[L]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L]}} \\ \frac{\delta \mathcal{L}}{\delta A^{[L-1]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L]}} \frac{\delta Z^{[L]}}{\delta A^{[L-1]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L]}} W^{[L-1]} \\ \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} &= \frac{\delta \mathcal{L}}{\delta A^{[L-1]}} \frac{\delta A^{[L-1]}}{\delta Z^{[L-1]}} = \frac{\delta \mathcal{L}}{\delta A^{[L-1]}} \frac{\delta max(0,Z^{[L-1]})}{\delta Z^{[L-1]}} \\ \frac{\delta \mathcal{L}}{\delta W^{[L-1]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} \frac{\delta Z^{[L-1]}}{\delta W^{[L-1]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} A^{[L-2]} \\ \frac{\delta \mathcal{L}}{\delta b^{[L-1]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} \frac{\delta Z^{[L-1]}}{\delta b^{[L-1]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} \\ &\vdots \end{aligned} $$

A pattern emerges and the computation of partial derivatives can be implemented iteratively. Some of the matrix multiplications above might not have matching dimensions but I find it more productive to focus on those details when implementing, since we probably want to change the order of some computations to make it more efficient anyway (for example by not averaging over examples immediately). This is not the only possible implementation and I encourage you to explore alternatives.

def propagate_backward(self, X, Y, Z_cache, A_cache):
    batch_size = np.shape(X)[0]
    dW = [None] * (len(self.W))
    db = [None] * (len(self.b))

    # dA is short for dL/dA etc.
    dA = A_cache[-1] - Y
    dZ = dA
    for i in reversed(range(1, len(A_cache))):
        A = A_cache[i-1]
        dW[i] = (1/batch_size) * np.dot(A.T, dZ)
        db[i] = (1/batch_size) * np.sum(dZ, axis=0)
        dA = np.dot(dZ, self.W[i].T)
        dZ = np.multiply(dA, self.relu_derivative(Z_cache[i-1]))
    dW[0] = (1/batch_size) * np.dot(X.T, dZ)
    db[0] = (1/batch_size) * np.sum(dZ, axis=0)

    return dW, db

def relu_derivative(self, x):
    return (x > 0) * 1

Training Step

During training we will alternate forward propagation and backpropagation. Let’s create a method that combines them for convenience.

def propagate(self, X, Y):
    loss, Z_cache, A_cache = self.propagate_forward(X, Y)
    dW, db = self.propagate_backward(X, Y, Z_cache, A_cache)
    return loss, A_cache[-1], dW, db

A single training step consists of one combined propagation and one gradient descent step.

loss, predictions, dW, db = neural_network.propagate(X_train, Y_train)
for j in range(len(neural_network.W)):
    neural_network.W[j] -= learning_rate * dW[j]
    neural_network.b[j] -= learning_rate * db[j]

Regression Problem

Let’s generate some quadratic data with random noise added.

def generate_random_data():
    X = np.expand_dims(np.linspace(-3, 3, num=200), axis=1)
    Y = X**2 - 2
    noise = np.random.normal(0, 2, size=X.shape)
    Y = Y + noise
    return X, Y

We can now use our SimpleNeuralNetwork class to create model that can fit to the generated data.

if __name__ == "__main__":
    X, Y = utils.generate_random_data()
    utils.plot_data(X, Y, "output/data.png")

    neural_network = SimpleNeuralNetwork(1, [10, 10], 1)
    learning_rate = 0.01
    num_iterations = 10000
    for i in range(num_iterations):
        loss, predictions, dW, db = neural_network.propagate(X, Y)
        for j in range(len(neural_network.W)):
            neural_network.W[j] -= learning_rate * dW[j]
            neural_network.b[j] -= learning_rate * db[j]
        if i % 1000 == 0:
            print("loss:", loss)

    loss, Z_cache, A_cache = neural_network.propagate_forward(X, Y)
    utils.plot_results(X, Y, A_cache[-1], "output/results.png")

Conclusion

I hope you found this article helpful. Thank you for reading and feel free to send me any questions.

Temporal Difference Learning

Sun, 10 Nov 2019 00:00:00 +0000

Introduction

“If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning” (Reinforcement Learning: An Introduction, Sutton, Barto, 2018).

TD learning combines ideas from Monte Carlo Methods (MC methods) and Dynamic Programming (DP). Like DP and MC methods, TD methods are a form of generalized policy iteration (GPI), which means that they alternate policy evaluation (estimation of value functions) and policy improvement (using value estimates to improve a policy). Like MC methods, TD methods don’t need to know the underlying dynamics of the model and instead learn from experience. Unlike MC methods but like DP, TD methods use bootstrapping, which means that value estimates are updated using other value estimates.

Unlike MC methods, which update value estimates and the target policy after each episode, TD methods update estimates and the target policy after each time step. This makes TD methods work on continuous tasks as well as episodic tasks. On episodic tasks, where MC methods can work as well, TD methods usually converge faster.

The aim of this article is to be a short introduction to TD learning. To read more about this subject, I recommend (Sutton, Barto, 2018). Later in the article there is going to be an implemented example of a couple of TD methods that solves a gridworld problem. The source code can be found at https://github.com/CarlFredriksson/temporal_difference_learning.

Sarsa

Sarsa is an on-policy TD method that uses action-value estimates $Q_\pi(s,a)$ to improve the target policy $\pi$. To assure that the agent explores sufficiently, the policy $\pi$ has to be soft, which means that all actions in every state has a probability greater than zero of being selected. A commonly used type of soft policies are $\epsilon$-greedy policies that select the action that maximizes $Q_\pi(s,a)$ with probability $1-\epsilon$ and a random action with probability $\epsilon$.

In Sarsa, the action-value estimates are improved using the update rule

$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \big[R_{t+1} + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t) \big] $$

The name Sarsa comes from the fact that $S_t,A_t,R_{t+1},S_{t+1},A_{t+1}$ are used in the update rule, which spells out SARSA if the subscripts are dropped.

Pseudocode for Sarsa with the target policy being any soft policy derived from $Q$ (e.g. $\epsilon$-greedy) is given below.

Sarsa

Algorithm parameters: $\alpha \in (0,1]$, $\gamma \in (0,1]$, small $\epsilon > 0$
Initialize $Q(s,a)$, for all $s \in \mathcal{S}^+$, $a \in \mathcal{A}(s)$, arbitrarily except that $Q(terminal,\dot{}) \leftarrow 0$
Loop for each episode:

Initialize $S$
Choose $A \in \mathcal{A}(S)$ using a soft policy derived from $Q$ (e.g. $\epsilon$-greedy)
Loop until $S$ is terminal:

Take action $A$, observe $R$, $S^\prime$
Choose $A^\prime \in \mathcal{A}(S^\prime)$ using a soft policy derived from $Q$ (e.g. $\epsilon$-greedy)
$Q(S,A) \leftarrow Q(S,A) + \alpha \big[R + \gamma Q(S^\prime,A^\prime) - Q(S,A) \big]$
$S \leftarrow S^\prime$
$A \leftarrow A^\prime$

Q-learning

Q-learning is an off-policy TD method. Thus, it uses two different policies, the the target policy $\pi$ and the behavior policy $b$. The target policy is used for value estimation and is the target for improvement, while the behavior policy is used for action selection. The behavior policy has to be soft in order to maintain sufficient exploration. The target policy can be completely greedy.

In Q-learning, the action-value estimates are improved using the update rule

$$ Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \big[R_{t+1} + \gamma \underset{a}{\operatorname{max}} Q(S_{t+1},a) - Q(S_t,A_t) \big] $$

The name Q-learning simply comes from the use of action value estimates $Q$.

Pseudocode for Q-learning with the target policy being the greedy policy derived from $Q$ and the behavior policy being any soft policy derived from $Q$ (e.g. $\epsilon$-greedy) is given below.

Q-learning

Initialize $S$
Loop until $S$ is terminal:

Choose $A \in \mathcal{A}(S)$ using a soft policy derived from $Q$ (e.g. $\epsilon$-greedy)
Take action $A$, observe $R$, $S^\prime$
$Q(S,A) \leftarrow Q(S,A) + \alpha \big[R + \gamma \operatorname{max}_a Q(S^\prime,a) - Q(S,A) \big]$
$S \leftarrow S^\prime$

Windy Gridworld Example

We are going to use Example 6.5: Windy Gridworld (Sutton, Barto, 2018) as a problem for showcasing Python implementations of Sarsa and Q-learning. It is a standard gridworld problem with the addition of wind that moves the agent upward a different number of cells depending on which column the agent is on after taking an action. The state is the cell position of the agent and the available actions are UP, DOWN, LEFT, RIGHT in every state. The actions deterministically move the agent in their respective directions before wind is applied. If the agent would move outside of the gridworld (by an action or by the wind), it stays in same cell as before moving. The task is episodic and undiscounted with a reward of -1 given after every action, except when the goal state G is reached. The gridworld is visualized in the following image, with the blue line representing the optimal trajectory when starting from state S.

Note that MC methods cannot easily be used for this problem since not all policies are guaranteed to end up in a terminal state (the goal state). This can lead to infinite episodes if the number of steps isn’t limited. TD methods don’t have the same problem since they learn during episodes and will eventually end up in the goal state. Below is an implementation of the gridworld environment agents can interact with by calling the take_action function.

class Action(Enum):
    UP = 0
    DOWN = 1
    LEFT = 2
    RIGHT = 3

class Environment:
    def __init__(self, starting_state):
        self.wind = [0, 0, 0, 1, 1, 1, 2, 2, 1, 0]
        self.goal = (7, 3)
        self.state = starting_state

    def adjust_position(self):
        # X position
        if self.state[0] < 0:
            self.state = (0, self.state[1])
        elif self.state[0] > 9:
            self.state = (9, self.state[1])

        # Y Position
        if self.state[1] < 0:
            self.state = (self.state[0], 0)
        elif self.state[1] > 6:
            self.state = (self.state[0], 6)

    def take_action(self, action):
        # Move agent from action
        if action == Action.UP:
            self.state = (self.state[0], self.state[1] + 1)
        elif action == Action.DOWN:
            self.state = (self.state[0], self.state[1] - 1)
        elif action == Action.LEFT:
            self.state = (self.state[0] - 1, self.state[1])
        elif action == Action.RIGHT:
            self.state = (self.state[0] + 1, self.state[1])
        else:
            print("ERROR: INVALID ACTION SELECTED!")
        self.adjust_position()

        # Move agent from wind
        if self.state != self.goal:
            self.state = (self.state[0], self.state[1] + self.wind[self.state[0]])
            self.adjust_position()

        if self.state == self.goal:
            return 0
        return -1

Implementations of a Sarsa agent and a Q-learning agent:

class SarsaAgent:
    def __init__(self, alpha, epsilon):
        self.alpha = alpha
        self.epsilon = epsilon
        self.Q = np.zeros((10, 7, 4))
        self.a = 0

    def step(self, environment):
        # Take action
        s = environment.state
        r = environment.take_action(Action(self.a))
        s_prime = environment.state

        # Select next action and update Q
        a_prime = epsilon_greedy(self.epsilon, self.Q, s_prime)
        self.Q[s[0], s[1], self.a] += self.alpha * (r + self.Q[s_prime[0], s_prime[1], a_prime] - self.Q[s[0], s[1], self.a])
        self.a = a_prime

        # Check if goal has been reached
        if r == 0:
            return True
        return False

class QLearningAgent:
    def __init__(self, alpha, epsilon):
        self.alpha = alpha
        self.epsilon = epsilon
        self.Q = np.zeros((10, 7, 4))

    def step(self, environment):
        # Select action
        s = environment.state
        a = epsilon_greedy(self.epsilon, self.Q, s)

        # Take action
        r = environment.take_action(Action(a))
        s_prime = environment.state

        # Update Q
        self.Q[s[0], s[1], a] += self.alpha * (r + np.max(self.Q[s_prime[0], s_prime[1], :]) - self.Q[s[0], s[1], a])

        # Check if goal has been reached
        if r == 0:
            return True
        return False

def epsilon_greedy(epsilon, Q, s):
    a = np.argmax(Q[s[0], s[1], :])
    if np.random.rand() < epsilon:
        a = np.random.randint(0, 4)
    return a

Some code for running the agents and plotting the results:

if __name__ == "__main__":
    NUM_EPISODES = 200
    MAX_NUM_STEPS = 1000
    ALPHA = 0.5
    EPSILON = 0.1
    STARTING_STATE = (0, 3)

    agents = [
        SarsaAgent(ALPHA, EPSILON),
        QLearningAgent(ALPHA, EPSILON)
    ]
    num_steps = np.zeros((len(agents), NUM_EPISODES))

    # Agents learning
    for agent_index, agent in enumerate(agents):
        for episode in range(NUM_EPISODES):
            environment = Environment(STARTING_STATE)
            for num_steps[agent_index, episode] in range(MAX_NUM_STEPS):
                at_goal = agent.step(environment)
                if at_goal:
                    break

    # Plot results
    plt.plot(range(NUM_EPISODES), num_steps[0, :], label="Sarsa")
    plt.plot(range(NUM_EPISODES), num_steps[1, :], label="Q-learning")
    plt.legend()
    plt.xlabel("Episode")
    plt.ylabel("Number of steps to reach goal")
    plt.show()

Conclusion

We have described two of the most popular TD methods (and reinforcement learning methods in general), the on-policy method Sarsa and the off-policy method Q-learning. They both are one-step, model-free, and tabular TD methods. TD methods can be extended to n-step forms, which link them to MC methods, and to forms that include a model of the environment, which link them to DP. TD methods can also be extended to be non-tabular and instead use various forms of function approximation, for example by using artificial neural networks.

Thank you for reading and feel free to send me any questions.

Monte Carlo Methods

Sat, 02 Nov 2019 00:00:00 +0000

Introduction

Monte Carlo (MC) methods are a broad class of algorithms that uses repeated random sampling to compute numerical results. The aim of this article is to be a short introduction to MC methods in reinforcement learning (RL). To read more about this subject, I recommend Reinforcement Learning: An Introduction (Sutton, Barto, 2018).

MC methods have three main advantages over Dynamic Programming (DP):

They can be used to learn optimal policies by interacting with the actual environment, without having any knowledge of the underlying Markov Decision Process (MDP).
They can be used with simulated environments that only need to generate sample transitions. Generating sample transitions is often much easier than completely specifying the dynamics function $p$ of the MDP model.
They can be focused on a small subset of states. States of special interest can be evaluated without the computational complexity of accurately evaluating all states.

Like DP, MC methods are a form of generalized policy iteration (GPI), which means that they alternate policy evaluation (estimation of value functions) and policy improvement (using value estimates to improve a policy). Unlike DP, MC methods do not use bootstrapping. In other words, value estimates are not computed using other value estimates.

We will be primarily interested in estimating the action-value function $q_\pi(s,a)$ since we want to be able to pick the actions that maximize expected return. We denote the estimates by $Q_\pi(s,a)$ and compute them by averaging sample returns in each state-action pair similarily to contextual Multi-armed Bandits. The difference is that the underlying model is an MDP, which means that the states are interrelated.

This article is going to focus on MC methods for episodic tasks. That is, tasks where the agent eventually reach a terminal state. After each episode is finished, the action-value estimates and the policy is updated. The action-value estimate of a state-action pair is updated such that it’s the return following the first visit, or every visit to the pair, averaged over all episodes. A visit to a state action pair $(s,a)$ means that the agent was in state $s$ and took the action $a$. Both the first-visit method and the every-visit method converge to the actual action values as the number of visits to each state-action pair approaches infinity. Like DP, we don’t need perfect estimates and can truncate the value estimation before the policy is improved. Usually we will update the policy for a state immediately after an action-value has been updated in that state.

Later in the article there is going to be an implemented example of a Monte Carlo method that solves a simple version of blackjack. The source code can be found at https://github.com/CarlFredriksson/monte_carlo_methods.

On-policy Methods

The objective of all RL methods is to find policies that maximize expected return. To achieve this, agents have to balance exploitation of current knowledge (selecting actions that maximizes $Q_\pi$) with exploration of other actions that have lower action-value estimates in order to learn better estimates. This is called the exploration versus exploitation tradeoff.

There are two approaches for ensuring exploration, on-policy and off-policy methods. On-policy methods use a single policy $\pi$ that is used both for selecting actions and for evaluation and improvement. Off-policy methods have two policies, one that is evaluated and improved and one that is used for action selection. In this section we are going to look at on-policy methods.

Exploring Starts

To be able to estimate the action-value of a state-action pair, the pair needs to be visited. On way to make sure that the agent visits all state-action pairs is to do exploring starts. This is done by starting the episodes in a random state and selecting a random starting action such that every state-action pair has a probability greater than zero of being selected. Exploring starts is useful for some problems but cannot be relied upon in general, particularly when learning from interaction with an actual environment.

Pseudocode for MC with exploring starts:

On-policy MC with exploring starts

Initialize:

$\pi(s) \in \mathcal{A}(s)$ arbitrarily for all $s \in \mathcal{S}$
$Q(s,a) \in \mathbb{R}$ arbitrarily for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
$N(s,a) \leftarrow 0$ for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$

Loop for multiple episodes:

Select $(S_0,A_0)$ randomly such that all pairs have probability > 0
Generate an episode from $(S_0,A_0)$, following $\pi: S_0,A_0,R_1,\ldots,S_{T-1},A_{T-1},R_T$
$G \leftarrow 0$
Loop for each step of the episode, $t=T-1,T-2,\ldots,0$:

$G \leftarrow \gamma G + R_{t+1}$
If $(S_t,A_t) \notin \{(S_0,A_0),(S_1,A_1),\ldots,(S_{t-1},A_{t-1})\}$:

$N(S_t,A_t) \leftarrow N(S_t,A_t) + 1$
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)} [G - Q(S_t,A_t)]$
$\pi(S_t) \leftarrow argmax_a Q(S_t,a)$ (ties broken arbitrarily)

If you want discounting, set $0 < \gamma < 1$. If you don’t want discounting, which is often the case for episodic tasks, set $\gamma = 1$. The algorithm above uses the first-visit method. If you want to use the every-visit method instead, simply remove the if-statement that checks if a state-action pair has been visited in an earlier time step.

Soft Policies

A common way to make sure that the agent explores is to use a soft policy. Soft policies satisfy $\pi(a|s) > 0$ for all $s \in \mathcal{S}$ and all $a \in \mathcal{A}(s)$. The probabilities are often shifted closer and closer to a deterministic policy.

$\epsilon$-greedy policies are soft policies that select the action that maximizes $Q_\pi$ with probability $1 - \epsilon$, and a random action with probability $\epsilon$.

Pseudocode for on-policy MC with $\epsilon$-greedy policy:

On-policy MC with $\epsilon$-greedy policy

Initialize:

$\pi(a|s)$ to an arbitrary soft policy
$Q(s,a) \in \mathbb{R}$ arbitrarily for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
$N(s,a) \leftarrow 0$ for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$

Loop for multiple episodes:

Generate an episode following $\pi: S_0,A_0,R_1,\ldots,S_{T-1},A_{T-1},R_T$
$G \leftarrow 0$
Loop for each step of the episode, $t=T-1,T-2,\ldots,0$:

$G \leftarrow \gamma G + R_{t+1}$
If $(S_t,A_t) \notin \{(S_0,A_0),(S_1,A_1),\ldots,(S_{t-1},A_{t-1})\}$:

$N(S_t,A_t) \leftarrow N(S_t,A_t) + 1$
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{1}{N(S_t,A_t)} [G - Q(S_t,A_t)]$
$A^* \leftarrow argmax_a Q(S_t,a)$ (ties broken arbitrarily)
$\pi(A^*|S_t) \leftarrow 1 - \epsilon + \epsilon / |\mathcal{A}(S_t)|$
For all $a \neq A^* \in \mathcal{A}(S_t)$:

$\pi(a|S_t) \leftarrow \epsilon / |\mathcal{A}(S_t)|$

Off-policy Methods

Off-policy methods have two policies, the target policy $\pi$ and the behavior policy $b$. The policy that is evaluated and improved is $\pi$ while $b$ is used for action selection. $\pi$ is often a greedy deterministic policy while $b$ has to be a soft policy in order for the agent to explore sufficiently.

Importance Sampling

If action-value estimates are updated in the same way as on-policy methods while following $b$, they would be estimates of $q_b$ rather than the $q_\pi$ we are interested in. To correctly estimate $q_\pi$, we need to use importance sampling. When using importance sampling, returns are weighted by the relative probability of their trajectories occuring under the target and behavior policies. This relative probability is called the importance-sampling ratio.

Given a starting state-action pair $S_t,A_t$, the probability of the subsequent state-action trajectory, $S_{t+1},A_{t+1},S_{t+2},A_{t+2},\ldots,S_T$, occuring while following any policy $\pi$ is

$$ \begin{aligned} Pr&\{S_{t+1},A_{t+1},S_{t+2},A_{t+2},\ldots,S_T | S_t,A_t,A_{t+1:T-1} \sim \pi \} \\ &= p(S_{t+1} | S_t,A_t) \pi(A_{t+1} | S_{t+1}) p(S_{t+2} | S_{t+1},A_{t+1}) \pi(A_{t+2} | S_{t+2}) \ldots p(S_T | S_{T-1},A_{T-1}) \\ &= p(S_{t+1} | S_t,A_t) \prod_{k=t+1}^{T-1} \pi(A_k | S_k) p(S_{k+1} | S_k,A_k) \end{aligned} $$

where

$$ p(S_{t+1} | S_t,A_t) = \sum_r p(S_{t+1},R_{t+1} = r | S_t,A_t) $$

Thus, the importance-sampling ratio is

$$ \begin{aligned} \rho_{t:T-1} &= \frac{p(S_{t+1} | S_t,A_t) \prod_{k=t+1}^{T-1} \pi(A_k | S_k) p(S_{k+1} | S_k,A_k)}{p(S_{t+1} | S_t,A_t) \prod_{k=t+1}^{T-1} b(A_k | S_k) p(S_{k+1} | S_k,A_k)} \\ &= \frac{\prod_{k=t+1}^{T-1} \pi(A_k | S_k)}{\prod_{k=t+1}^{T-1} b(A_k | S_k)} \end{aligned} $$

The state-transition probabilities cancel and we are left with a definition that doesn’t require any knowledge about the MDP dynamics.

There are two types of importance sampling, ordinary importance sampling that uses a simple average of the weighted returns and weighted importance sampling that uses a weighted average. Let the time step $t$ span over episodes in such a way that if the first episode is 100 steps long, the next episode will start at $t=101$. Let $\tau(s,a)$ be the set of time steps where the state-action pair $(s,a)$ were first visited if using the first-visit method and the set of time steps for all visits if using the every-visit method. Let $T(t)$ be the first termination time step after $t$. We can then estimate $q_\pi$ using ordinary importance sampling by

$$ Q(s,a) = \frac{\sum_{t \in \tau(s,a)} \rho_{t:T(t)-1} G_t}{|\tau(s,a)|} $$

and using weighted importance sampling by

$$ Q(s,a) = \frac{\sum_{t \in \tau(s,a)} \rho_{t:T(t)-1} G_t}{\sum_{t \in \tau(s,a)} \rho_{t:T(t)-1}} $$

Ordinary importance sampling produces unbiased estimates, but has larger, possibly infinite variance. Weighted importance sampling produces biased estimates but has finite variance, which often makes it preffered in practice.

Pseudocode for off-policy MC with weighted importance sampling:

Off-policy MC with weighted importance sampling

Initialize:

$Q(s,a) \in \mathbb{R}$ arbitrarily for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
$C(s,a) \leftarrow 0$ for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
$\pi(s) \leftarrow argmax_a Q(s,a)$ (ties broken consistently)

Loop for multiple episodes:

$b \leftarrow$ any soft policy
Generate an episode following $b: S_0,A_0,R_1,\ldots,S_{T-1},A_{T-1},R_T$
$G \leftarrow 0$
$W \leftarrow 1$
Loop for each step of the episode, $t=T-1,T-2,\ldots,0$:

$G \leftarrow \gamma G + R_{t+1}$
$C(S_t,A_t) \leftarrow C(S_t,A_t) + W$
$Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \frac{W}{C(S_t,A_t)} [G - Q(S_t,A_t)]$
$\pi(s) \leftarrow argmax_a Q(S_t,a)$ (ties broken consistently)
If $A_t \neq \pi(S_t)$: exit inner loop (proceed to next episode)
$W \leftarrow W \frac{1}{b(A_t|S_t)}$

The above algorithm uses the every-visit method. Note that $W$ and $C$ are used to implement weighted importance sampling incrementally.

Blackjack Example

The objective of blackjack is to obtain cards whose numerical sum is as big as possible without exceeding 21. All face cards count as 10 and aces count as 1 or 11. The game begins with two cards dealt to both dealer and player. The dealer has one of his cards face down. If the player has 21 immediately he wins, unless the dealer also has 21, in which case the game is a draw. If the player doesn’t have 21, he can choose between the actions hit and stick. If he chooses hit he draws a new card and if the sum is greater than 21 he goes bust and immediately loses. If it isn’t, he can once again choose between hit and stick. If he chooses stick, the action goes over to the dealer who plays the fixed strategy of sticking on any sum 17 or greater and hitting otherwise. If the dealer goes bust, the player wins. Otherwise, win, lose, or draw is determined by whose final sum is greater. This simplified version of blackjack has no splitting and cards are assumed to be picked from an infinite number or decks (or picked with replacement).

Playing blackjack is naturally formulated as a finite MDP where each game is an episode. Rewards are 0 everwhere except when reaching terminal states, where +1 is given for a win, 0 for a draw, and -1 for a loss. No discounting is done, thus the terminal rewards are also the returns. The state is a combination of the player sum, if the player has a usable ace, and the dealer’s showing card. If the player sum is 11 or less it is always best to hit, thus the values for player sum we are interested in is 12-21. The player has a usable ace if he can count an ace as 11 without going bust. The dealer shows one card (ace-10). Thus there are 200 possible states. The available actions are hit or stick in every non-terminal state.

Note how much easier it is to generate sample transitions than to fully define the dynamics function $p$. Below is an implementation of an MC agent that interacts with the blackjack environment to learn an optimal policy. The agent uses on-policy exploring starts with the every visit method. The final policy, which is probably optimal or close to optimal, is plotted after training.

import numpy as np
from enum import Enum
from copy import copy
import matplotlib as mpl
import matplotlib.pyplot as plt

class Action(Enum):
    HIT = 0
    STICK = 1

class State:
    def __init__(self, player_sum, usable_ace, dealer_showing):
        self.player_sum = player_sum
        self.usable_ace = usable_ace
        self.dealer_showing = dealer_showing
    
    def to_string(self):
        return "player_sum: {}, usable_ace: {}, dealer_showing: {}".format(
            self.player_sum, self.usable_ace, self.dealer_showing)

class Environment:
    def __init__(self, starting_state):
        self.state = starting_state

    def draw_card(self):
        cards = np.arange(10) + 1
        probabilities = np.append(np.ones(9)/13, 4/13)
        return np.random.choice(cards, p=probabilities)

    def take_action(self, action):
        if action == Action.HIT:
            # Get new card
            card = self.draw_card()
            if card == 1: # Ace worth 1 or 11
                if self.state.player_sum <= 10:
                    self.state.player_sum += 11
                    self.state.usable_ace = True
                else:
                    self.state.player_sum += 1
            else:
                self.state.player_sum += card

            # Check if player busted or needs to use an ace as 1
            if self.state.player_sum > 21:
                if self.state.usable_ace:
                    self.state.player_sum -= 10
                    self.state.usable_ace = False
                else: # Player busted
                    return -1
            return None
        
        # Player sticks, dealer now acts
        dealer_usable_ace = self.state.dealer_showing == 1
        dealer_sum = 11 if dealer_usable_ace else self.state.dealer_showing
        
        while dealer_sum < 17:
            # Dealer get new card
            card = self.draw_card()
            if card == 1: # Ace worth 1 or 11
                if dealer_sum <= 10:
                    dealer_sum += 11
                    dealer_usable_ace = True
                else:
                    dealer_sum += 1
            else:
                dealer_sum += card

            # Check if dealer busted or needs to use an ace as 1
            if dealer_sum > 21:
                if dealer_usable_ace:
                    dealer_sum -= 10
                    dealer_usable_ace = False
                else: # Dealer busted
                    return 1
        
        # Round is over
        if dealer_sum > self.state.player_sum:
            return -1
        elif dealer_sum < self.state.player_sum:
            return 1
        return 0

class Agent:
    # The policy is a deterministic mapping from state to action: 0 = HIT, 1 = STICK
    # Init policy to only stick on 20 and 21
    policy = np.zeros((2, 10, 10))
    policy[:, 8:10, :] = 1
    action_values = np.zeros((2, 10, 10, 2))
    num_visits = np.zeros(np.shape(action_values))

    def run_episode(self, environment, starting_action):
        history = []
        reward = None

        # Generate episode
        history.append((copy(environment.state), starting_action))
        reward = environment.take_action(starting_action)
        while reward is None:
            i = 0 if environment.state.usable_ace else 1
            j = environment.state.player_sum - 12
            k = environment.state.dealer_showing - 1
            action = Action(self.policy[i, j, k])
            history.append((copy(environment.state), action))
            reward = environment.take_action(action)

        # Use episode history to update state-values
        for state, action in history:
            i = 0 if state.usable_ace else 1
            j = state.player_sum - 12
            k = state.dealer_showing - 1
            l = action.value
            self.num_visits[i, j, k, l] += 1
            self.action_values[i, j, k, l] += (reward - self.action_values[i, j, k, l]) / self.num_visits[i, j, k, l]
            self.policy[i, j, k] = 0 if self.action_values[i, j, k, 0] > self.action_values[i, j, k, 1] else 1

if __name__ == "__main__":
    NUM_EPISODES = 10000000
    agent = Agent()

    # Agent learning policy
    for episode in range(NUM_EPISODES):
        if episode % 10000 == 0:
            print("STARTING EPISODE:", str(episode))
        random_starting_state = State(
            np.random.randint(12, high=22),
            True if np.random.randint(2) == 0 else False,
            np.random.randint(1, high=11)
        )
        environment = Environment(random_starting_state)
        random_starting_action = Action(np.random.randint(2))
        agent.run_episode(environment, random_starting_action)

    # Plot final policy
    ticks = np.arange(10)
    xtick_labels = ["A", "2", "3", "4", "5", "6", "7", "8", "9", "10"]
    ytick_labels = range(12, 22)
    fig, (ax1, ax2) = plt.subplots(1, 2)
    ax1.imshow(agent.policy[0], cmap=mpl.colors.ListedColormap(["red", "blue"]), origin="lower")
    ax1.set_xticks(ticks)
    ax1.set_xticklabels(ticks)
    ax1.set_xlabel("Dealer showing")
    ax1.set_yticks(ticks)
    ax1.set_yticklabels(ytick_labels)
    ax1.set_ylabel("Player sum")
    ax1.set_title("Usable ace")
    im2 = ax2.imshow(agent.policy[1], cmap=mpl.colors.ListedColormap(["red", "blue"]), origin="lower")
    ax2.set_xticks(ticks)
    ax2.set_xticklabels(ticks)
    ax2.set_yticks(ticks)
    ax2.set_yticklabels(ytick_labels)
    ax2.set_title("No usable ace")
    fig.subplots_adjust(right=0.85)
    cbar_ax = fig.add_axes([0.9, 0.4, 0.02, 0.2])
    cbar = fig.colorbar(im2, cax=cbar_ax, ticks=[0, 1])
    cbar.ax.set_yticklabels(["Hit", "Stick"])
    plt.savefig("final_policy.png")

Conclusion

This has been a short introduction to Monte Carlo methods in reinforcement learning. On-policy and off-policy methods were discussed and a simple on-policy MC agent was implemented. The blackjack example is a good starting point for implementing MC methods because of its simplicity.

Thank you for reading and feel free to send me any questions.

Dynamic Programming

Mon, 14 Oct 2019 00:00:00 +0000

Introduction

Dynamic programming (DP) refers to a collection of methods in both mathematical optimization and computer programming. What all methods have in common is that they simplify a complicated problem by breaking it down into simpler sub-problems in a recursive manner. The focus of this article is going to be the role of DP in mathematical optimization and for solving finite Markov Decision Processes (MDPs) in particular. Later in the article there is going to be a segment about DP in computer programming.

Bellman equations are the key components of DP in mathematical optimization. To read more about them and MDPs in general, I refer you to my previous article Markov Decision Processes. As a reminder, the Bellman equations for the state-value function $v_{\pi}$ and action-value function $q_{\pi}$ are:

$$ v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_{\pi}(s^\prime) \big) $$

$$ q_{\pi}(s,a) = \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma \sum_{a^\prime} \pi(a^\prime|s^\prime) q(a^\prime,s^\prime) \big) $$

The Bellman optimality equations are the Bellman equations for the optimal state-value function $v_$ and optimal action-value function $q_$:

$$ v_*(s) = \max_a \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_*(s^\prime) \big) $$

$$ q_*(s,a) = \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma \max_{a^\prime} q_*(s^\prime,a^\prime) \big) $$

The equations are going to be used as update rules in iterative algorithms. Remember that it’s trivial to find an optimal policy when you know $v_$ or $q_$.

Note that DP algorithms have limited utility in reinforcement learning (RL) since they assume perfect knowledge of the dynamics of the MDP and are very computationally expensive. However DP is still useful for many other problems and provides a good foundation for understanding many RL-algorithms.

The aim of this article is to be a short introduction to DP with a focus on solving MDPs. To read more about this subject, I recommend Reinforcement Learning: An Introduction (Sutton, Barto, 2018). The source code can be found at https://github.com/CarlFredriksson/dynamic_programming.

Policy Evaluation

Policy evaluation is about computing estimates of value functions, for example the state-value function $v_{\pi}$ for a given policy $\pi$. To compute the exact solution, one could theoretically solve the system of linear Bellman equations. However, the computational complexity grows quickly with the size of the MDP and iterative methods are often more efficient. We can use the Bellman equation for $v_{\pi}$ as an update rule

$$ v_{k+1}(s) = \sum_a \pi(a|s) \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_{k}(s^\prime) \big) $$

Applied iteratively, this update rule will make $v_{k+1}$ become closer to $v_{\pi}$ and eventually converge as $k \to \infty$. We of course don’t have infinite time and instead will have to accept approximations and stop when the updates become satisfactorily small. The updates can be done by remembering all the old values $v_{k}$ or by sweeping over the state space, updating values in place. If sweeping, the updates will use a mix of new and old values. This is not strictly the update rule as defined above, but $v_{k+1}$ will still converge to $v_{\pi}$. Sweeping uses less memory and usually converges faster so it’s normally the version that is used in DP.

The idea of updating estimates using other estimates is called bootstrapping and is used extensively in RL.

Policy Improvement

To solve an MDP, our goal is to find an optimal policy. The next step after approximating $v_{\pi}$ using policy evaluation is to improve the current policy $\pi$. To create a new improved policy $\pi^\prime$, we go over each state and select actions such that expected return is maximized assuming $\pi$ being followed afterwards. In other words, we select actions greedily with respect to $q_\pi$:

$$ \begin{aligned} \pi^\prime(s) &= \underset{a}{\operatorname{argmax}} q_\pi(s,a) \\ &= \underset{a}{\operatorname{argmax}} E[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t=s, A_t=a] \\ &= \underset{a}{\operatorname{argmax}} \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_\pi(s^\prime) \big) \end{aligned} $$

When selecting actions this way we know that the new policy $\pi^\prime$ is going to be as good as, or better than the old policy $\pi$. If $\pi^\prime$ is as good as, but not better than $\pi$ it means that $\pi$ and $\pi^\prime$ are both optimal. Like in policy evaluation, policy improvement is usually implemented as a sweep.

Policy Iteration

Now that we have tools for evaluating and improving policies we simply have to put them together. Policy iteration refers to alternating policy evaluation and policy improvement for multiple iterations until a satisfactory policy is found. Unless an optimal policy has been found, each iteration will produce a strictly better policy. Policy iteration will converge to an optimal policy in finite time since finite MDPs have a finite number of possible deterministic policies. To speed up convergence, policy evaluation will start with the value estimates of the previous policy.

Value Iteration

There is a drawback to policy iteration in that the policy evaluation step can be computationally expensive. When doing policy evaluation iteratively, $v_{\pi}$ converges only in the limit and we will have to accept approximations. But the ultimate goal is not to compute $v_{\pi}$ as accurately as possible each policy iteration, it is to find an optimal policy. How close to the real state-value function do we actually need to get? For the gridworld in example 4.1 from (Sutton, Barto, 2018), only three iterations of policy evaluation on the random policy was needed before the result of the corresponding policy improvement wouldn’t change. This can be seen in figure 4.1 from (Sutton, Barto, 2018).

This was only an example, but it holds in general that the policy evaluation step can be truncated without removing the convergence guarantees of policy iteration. An important special case is when policy evaluation is stopped after a single iteration. This is called value iteration. Instead of having two steps it can be combined into a single update rule

$$ v_{k+1}(s) = \max_a \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_k(s^\prime) \big) $$

This is almost the same update rule as policy evaluation, but using the Bellman optimality equation for $v_*$ instead. Value iteration is usually implemented as a sweep, with every sweep effectively being a combination of one policy evaluation sweep and one policy iteration sweep. Usually doing a few policy evaluation sweeps for every policy improvement sweep will lead to faster convergence. All policy iteration algorithms with a truncated policy evaluation step can be implemented as a sequence of sweeps, with some sweeps using policy evaluation updates and some using value iteration updates.

Generalized Policy Iteration

Policy iteration consists of policy evaluation and policy improvement steps that are alternated. As mentioned, the number of iterations of policy evaluation before doing policy improvement can be varied. There are other methods of finding optimal, or approximately optimal, policies that similarly alternate policy evaluation and improvement but vary other aspects. One such aspect is the granularity of the updates, which means that some states could get multiple updates before others gets a single one. This is often the case for RL-methods where the agent doesn’t have complete knowledge of the dynamics function $p$ and instead rely on exploration and sample returns.

The term generalized policy iteration (GPI) refers the general idea of alternating policy evaluation and policy improvement. It is an important idea since almost all RL methods are a form of GPI.

Gridworld Example

We will use example 4.1 from (Sutton, Barto, 2018). States $1,2,…,14$ are non-terminal and state $15$ is terminal. State $15$ is in both the top-left and the bottom-right. Actions “up”, “down”, “right”, and “left” are available in every state. Actions deterministically move the agent in their respective direction except when the action would move the agent out of the gridworld. In that case the state remains unchanged. Actions in all non-terminal states give the agent a reward of -1. It is an undiscounted episodic task.

Since the dynamics are completely deterministic we can implement them as a lookup table $(s,a) \to (s^\prime,r)$. The actions $(0,1,2,3)$ correspond to “up”, “down”, “right”, and “left” respectively. The state numbers have all been reduced by one to make them start at zero which makes the policy evaluation and value iteration functions cleaner.

DYNAMICS = {
    (0, 0): (0, -1),
    (0, 1): (4, -1),
    (0, 2): (1, -1),
    (0, 3): (14, -1),
    (1, 0): (1, -1),
    (1, 1): (5, -1),
    (1, 2): (2, -1),
    (1, 3): (0, -1),
    ...
    (14, 0): (14, 0),
    (14, 1): (14, 0),
    (14, 2): (14, 0),
    (14, 3): (14, 0),
}

A policy evaluation sweep can be implemented as follows:

def policy_evaluation_sweep(dynamics, policy, state_values):
    num_states = len(state_values)
    for state in range(num_states):
        value = 0
        num_actions = len(policy[state])
        for action in range(num_actions):
            new_state, reward = dynamics[(state, action)]
            value += policy[state, action] * (reward + state_values[new_state])
        state_values[state] = value

I also implemented a non-sweeping version for comparison with figure 4.1 from (Sutton, Barto, 2018), since they used a non-sweeping version when producing the results in that figure.

def policy_evaluation(dynamics, policy, state_values):
    num_states = len(state_values)
    new_state_values = np.zeros(num_states)
    for state in range(num_states):
        value = 0
        num_actions = len(policy[state])
        for action in range(num_actions):
            new_state, reward = dynamics[(state, action)]
            value += policy[state, action] * (reward + state_values[new_state])
        new_state_values[state] = value
    return new_state_values

Policy improvement can be implemented as follows:

def value_iteration_sweep(dynamics, policy, state_values):
    num_states = len(state_values)
    for state in range(num_states):
        max_value = -np.inf
        num_actions = len(policy[state])
        for action in range(num_actions):
            new_state, reward = dynamics[(state, action)]
            value = reward + state_values[new_state]
            if value > max_value:
                max_value = value
                policy[state] = np.eye(num_actions)[action]

The policy was initialized to the random policy. When running some iterations of non-sweeping policy evaluation the results matched those in figure 4.1. Policy improvement applied to the state-values after at least three iterations of non-sweeping policy evaluation resulted in an optimal policy. The sweeping policy evaluation needed a few more iterations before policy improvement would find an optimal policy.

DP in Computer Programming

As mentioned in the introduction, DP refers to a collection of methods in both mathematical optimization and computer programming. For DP to be applicable to a programming problem, the problem needs to have optimal substructure and overlapping sub-problems.

Optimal substructure means that the solution to the problem can be constructed from optimal solutions to its sub-problems. Optimal substructures are usually described by recursion. An example of this is shortest path problems. If there is a path $A \to B \to C \to D$ that is the shortest path from $A$ to $D$, then $B \to C \to D$ is the shortest path from $B$ to $D$. In other words, the shortest path problem from $B$ to $D$ is nested inside the $A$ to $D$ problem and can be called a sub-problem.

Overlapping sub-problems means that a simple recursive algorithm will solve the same sub-problems over and over. If we can solve a problem by combining optimal solutions to non-overlapping sub-problems, DP is not applicable to that problem and the strategy divide and conquer might be used instead. Examples of divide and conquer algorithms are merge sort and quick sort.

DP has two approaches for avoiding having to solve the same sub-problems over and over. These are the top-down approach and the bottom-up approach. The top-down approach is to do recursion in the order the problem is formulated, but also store the result of solved sub-problems so that they don’t need to be computed more than once. Storing solutions to sub-problems is called memoization. The bottom-up approach is to solve sub-problems from the bottom first and using those solutions to build solutions to bigger sub-problems.

Fibonacci Sequence Example

Computing the nth member of the Fibonacci sequence is a good example of a problem that has both optimal substructure and overlapping sub-problems. The Fibonacci sequence $F_n$ is defined as:

$$ \begin{aligned} F_0 &= 0 \\ F_1 &= 1 \\ F_n &= F_{n-1} + F_{n-2} \end{aligned} $$

The following is a naive implementation:

def fib(n):
    if n <= 1:
        return n
    return fib(n - 1) + fib(n - 2)

The naive implementation is computationally very inefficient since it computes the solutions to the same sub-problems over and over. For example, if we call fib(5) we produce the following call tree:

fib(5)
fib(4) + fib(3)
(fib(3) + fib(2)) + (fib(2) + fib(1))
((fib(2) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))
(((fib(1) + fib(0)) + fib(1)) + (fib(1) + fib(0))) + ((fib(1) + fib(0)) + fib(1))

You can see that fib(2) was computed from scratch three times. This implementation runs in exponential time (O(2^n)).

A DP implementation with the top-down approach:

memory = { 0: 0, 1:1 }
def fib_memoization(n):
    if n not in memory:
        memory[n] = fib_memoization(n - 1) + fib_memoization(n - 2)
    return memory[n]

This version uses memoization and computes solutions to sub-problems only if they aren’t in the memory yet. This implementation runs in linear time (O(n)) but requires O(n) storage space.

A DP implementation with the bottom-up approach:

def fib_bottom_up(n):
    if n == 0:
        return 0

    previous_fib = 0
    current_fib = 1
    for _ in range(n - 1):
        new_fib = previous_fib + current_fib
        previous_fib = current_fib
        current_fib = new_fib
    return current_fib

This version computes sub-problems from the bottom and builds upwards. The smaller values are computed first and are used to compute larger values. This implementation also runs in linear time but only requires constant (O(1)) storage space.

Both DP implementations actually take O(n^2) time for large integers, since addition with large integers is O(n).

Conclusion

This has been a short introduction to dynamic programming. Splitting up a complex problem (for example computing an optimal policy in an MDP) into simpler, interdependent sub-problems (for example computing state-values) is a powerful idea that can be applied to a variety of problems. Understanding DP and the idea of GPI provides a good base for understanding many reinforcement learning algorithms.

Thank you for reading and feel free to send me any questions.

Markov Decision Processes

Tue, 08 Oct 2019 00:00:00 +0000

Introduction

A Markov decision process (MDP) is a mathematical formalization of a sequential decision-making process where actions affect both immediate reward and the next state. MDPs are similar to Multi-armed Bandits in that the agent repeatedly has to make decisions and receives immediate rewards depending on what action is selected. What makes MDPs more complex is that the actions affect the subsequent states, and thus affect future rewards as well. The goal of the agent is the same, to maximize expected return (expected cumulative future reward). In an MDP, each state could have different actions available. After an agent selects an action, the environment gives the agent a reward and presents a new state. Figure 3.1 from Reinforcement Learning: An Introduction (Sutton, Barto, 2018) illustrates the interaction between agent and environment.

At each time step $t=0,1,2,…$, the agent observes a representation of the environment’s state $S_t \in \mathcal{S}$, where $\mathcal{S}$ is the set of all states that aren’t terminal states (the set $\mathcal{S}^+$ includes all terminal states). The agent selects an action $A_t \in \mathcal{A}(S_t)$, where $\mathcal{A}(S_t)$ is the set of all available actions in state $S_t$. The next time step, the agent receives a numerical reward $R_{t+1} \in \mathcal{R} \subset \mathbb{R}$ that depends on the previous state and selected action. The agent observes a new state $S_{t+1}$ and the process continues. This process creates a sequence (or trajectory) of the following form:

$$ S_0,A_0,R_1,S_1,A_1,R_2,S_2,A_2,R_3,… $$

Dynamics Function $p$

This article is going to focus on finite MDPs, which are MDPs where the sets $\mathcal{S}, \mathcal{A}, \mathcal{R}$ all have a finite number of elements. Finite MDPs are defined by their dynamics function $p$:

$$ p(s^\prime,r|s,a) = Pr{S_t=s^\prime, R_t=r | S_{t-1}=s, A_t=a} $$

In other words, $p$ defines the joint probability of observing state $s^\prime$ and receiving reward $r$ after being in state $s$ and taking action $a$.

Episodic and Continuous Tasks

MDPs can be divided into episodic and continuous tasks. Episodic tasks have a clear ending, the agent will eventually reach a terminal state. Continuous tasks have no clear ending and potentially continue forever. In both type of tasks the goal is to maximize expected return. However, the return can be defined in different ways.

The return is denoted $G_t$ and the final time step is denoted $T$. In episodic tasks, $T$ is finite and $G_t$ can be defined as:

$$ G_t = R_{t+1} + R_{t+2} + R_{t+3} + … + R_{T} = \sum_{k=t+1}^{T} R_k $$

In continuous tasks, $T = \infty$ and $G_t$ could easily diverge. To make $G_t$ converge we add add a parameter $0 \leq \gamma \leq 1$, called the discount rate:

$$ G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + … = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$

This is called discounting and means that rewards closer to the current time step will be prioritized. We can also define the return in a way that works for both episodic and continuous tasks:

$$ G_t = \sum_{k=t+1}^{T} \gamma^{k-t-1} R_k $$

When the task is episodic we usually set $\gamma = 1$ (sometimes it might make sense to use discounting for an episodic task). When the task is continuous we usually set $\gamma < 1$.

Examples

A wide array of problems can be formulated as MDPs. Some examples:

Chess: The states are the position of the pieces. Rewards are 0 except when reaching a terminal state (win, draw, loss). A reward of 1 is given for a win, 1/2 for a draw, and 0 for a loss. The actions are all legal moves in a given state and they deterministically change the state. Since there are terminal states that eventually will be reached it is an episodic task.
Cleaning robot: The states are combinations of the robots sensor readings, such as position and dust-detection sensors. Positive rewards are given when dust is cleaned up. You might also add a negative reward when the robot runs out of battery, to incentivize the robot to go to its recharging station before running out. The actions could be on the abstraction level of activating motors, or higher level like “move right”. Could be modelled either as an episodic or continuous task, depending on if you want the robot to run all the time or not.
Temperature control: The states are temperature readings on a thermometer. Negative reward is given when a human manually has to adjust the temperature. The actions are how much power to send to different radiators. Will probably be a continuous task.

Note the importance of designing good rewards, such that if the agent learns to optimize expected return, it will also perform the task very well.

Solving MDPs

The objective of the agent is to maximize expected cumulative future reward. In other words, we want to find a policy that gives the agent as much expected return as possible. A policy is a mapping from states to probabilities of selecting each available action in that state. Solving an MDP usually means finding an optimal policy. When MDPs are small (the sets of states and actions are small) and the dynamics function $p$ is fully known, we can find the optimal policy by solving the Bellman equations, more on this later. MDPs are often too big for this to be computationally feasible. There is a collection of algorithms called dynamic programming (DP) that can be more efficient at finding the optimal policy. When the MDPs are too large for DP or when the dynamics function is not explicitly known (which is often the case for real world appplications), we use reinforcement learning (RL) instead. In RL the aim is usually to find a policy that is good but doesn’t have to be optimal.

The aim of this article is to introduce the basics of finite MDPs. To read more about this subject, I recommend the aforementioned (Sutton, Barto, 2018).

Policies and Value Functions

As mentioned above, a policy is a mapping from states to probabilities of selecting each available action in that state. More formally, if the agent is following policy $\pi$ at time $t$, then $\pi(a|s)$ is the probability that $A_t=a$ if $S_t=s$. A special case is deterministic policies where we denote the action that will be selected in state $s$ by $\pi(s)$.

Value functions are functions that map states, or state-action pairs, to expected return when following a specific policy $\pi$. You can think of value functions as how good a state, or state-action pair, is when following $\pi$. Value functions often can’t be computed exactly, but estimating them and using the estimates to make policy decisions is a part of almost all RL algorithms. There are two types of value functions, state-value functions and action-value functions.

The state-value function of a state $s$ when following policy $\pi$, denoted $v_{\pi}(s)$, is the expected return when following $\pi$ from $s$. More formally:

$$ v_{\pi}(s) = E_{\pi}[G_t | S_t=s] = E_{\pi}\Big[\sum_{k=t+1}^{T} \gamma^{k-t-1} R_k \Big| S_t=s \Big] $$

The action-value function of a state $s$ and action $a$ when following policy $\pi$, denoted $q_{\pi}(s,a)$, is the the expected return when taking action $a$ in $s$, and then following $\pi$ from the new state. More formally:

$$ q_{\pi}(s,a) = E_{\pi}[G_t | S_t=s, A_t=t] = E_{\pi}\Big[\sum_{k=t+1}^{T} \gamma^{k-t-1} R_k \Big| S_t=s, A_t=a \Big] $$

A policy $\pi$ is better than or equal to policy $\pi^\prime$ if its expected return is greater than or equal to that of $\pi^\prime$ for all states. More formally, $\pi \geq \pi^\prime$ if and only if $v_{\pi}(s) \geq v_{\pi^\prime}(s)$ for all states $s$. There is always at least one optimal policy, and we denote all of them $\pi_*$. Even though there can be multiple optimal policies, they all share the same optimal state-value function, defined as $v_*(s)=\max_{\pi} v_{\pi}(s)=v_{\pi_*}(s)$. They also share the same optimal action-value function, defined as $q_*(s,a)=\max_{\pi} q_{\pi}(s,a)=q_{\pi_*}(s,a)$.

Bellman Equations

An important property of value functions that is used throughout RL and DP is that they are recursive. We can use this property to define Bellman equations. Lets’ start with the Bellman equation for $v_{\pi}$:

$$ v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_{\pi}(s^\prime) \big) $$

Derivation:

$$ \begin{aligned} v_{\pi}(s) &= E_{\pi}[G_t | S_t=s] \\ &= E_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t=s] \\ &= \sum_a \pi(a|s) \sum_{s^\prime} \sum_{r} p(s^\prime,r|s,a) \big(r + \gamma E_{\pi}[G_{t+1} | S_{t+1}=s^\prime] \big) \\ &= \sum_a \pi(a|s) \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_{\pi}(s^\prime) \big) \end{aligned} $$

Let’s continue with the Bellman equation for $q_{\pi}$:

$$ q_{\pi}(s,a) = \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma \sum_{a^\prime} \pi(a^\prime|s^\prime) q(a^\prime,s^\prime) \big) $$

Derivation:

$$ \begin{aligned} q_{\pi}(s,a) &= E_{\pi}[G_t | S_t=s, A_t=a] \\ &= E_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t=s, A_t=a] \\ &= \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma E_{\pi}[G_{t+1} | S_{t+1}=s^\prime] \big) \\ &= \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_{\pi}(s^\prime)] \big) \\ &= \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma \sum_{a^\prime} \pi(a^\prime|s^\prime) q(a^\prime,s^\prime) \big) \end{aligned} $$

We can do the same thing for the optimal value functions. These are sometimes called Bellman optimality equations. Let’s start with the Bellman equation for $v_*$:

$$ v_*(s) = \max_a \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_*(s^\prime) \big) $$

Derivation:

$$ \begin{aligned} v_*(s) &= v_{\pi_*}(s) \\ &= \max_a q_{\pi_*}(s,a) \\ &= \max_a E_{\pi_*}[G_t | S_t=s, A_t=a] \\ &= \max_a E_{\pi_*}[R_{t+1} + \gamma G_{t+1} | S_t=s, A_t=a] \\ &= \max_a E[R_{t+1} + \gamma v_*(S_{t+1}) | S_t=s, A_t=a] \\ &= \max_a \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_*(s^\prime) \big) \end{aligned} $$

Let’s continue with the Bellman equation for $q_*$:

$$ q_*(s,a) = \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma \max_{a^\prime} q_*(s^\prime,a^\prime) \big) $$

Derivation:

$$ \begin{aligned} q_*(s,a) &= E[R_{t+1} + \gamma \max_{a^\prime} q_*(S_{t+1},a^\prime) | S_t=s, A_t=a] \\ &= \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma \max_{a^\prime} q_*(s^\prime,a^\prime) \big) \end{aligned} $$

The equations must hold for all states, thus we have systems of linear equations we can solve to compute value functions. In the optimal case it is a bit trickier since they involve the non-linear operation max. The systems of non-linear equations can still be solved to compute the optimal value functions. Even though it is theoretically possible to solve the Bellman equations for all finite MDPs where the dynamics function is known, it becomes very computationally expensive as MDPs get larger.

If you know an optimal value function, it’s trivial to compute the optimal policy. If you have the optimal state-value function, simply look one step ahead in every state and select actions greedily. In other words, select actions by:

$$ \underset{a}{\operatorname{argmax}} \sum_{s^\prime,r} p(s^\prime,r|s,a) \big(r + \gamma v_*(s^\prime) \big) $$

If you have the optimal action-value function it is even easier. In all states, simply select the action with the highest action-value function:

$$ \underset{a}{\operatorname{argmax}} q_*(s,a) $$

If there are ties between actions in either case, you can select any probability distribution between the tying actions.

Gridworld Example

Gridworlds are often used as examples of MDPs. Example 3.5 from (Sutton, Barto, 2018) is a simple example of a finite MDP. The left part of figure 3.2 from (Sutton, Barto, 2018) shows the gridworld. Each cell is a state and the agent can select between the actions: north, south, east, and west in all states. The actions deterministically move the agent in their respective directions, except when they would move the agent out of the grid. In that case the state will remain unchanged and the agent will receive a reward of -1. When the agent selects any action in state $A$, it will be moved to state $A^\prime$ and receive a reward of +10. When the agent selects any action in state $B$, it will be moved to state $B^\prime$ and receive a reward of +5. In all other cases the agent will receive a reward of 0.

The right part of figure 3.2 shows the state-value function $v_{\pi}$ for the policy where all actions are selected at random with equal probability in all states. Since there is no terminal state, discounting was used with rate $\gamma=0.9$. The authors of the book solved the Bellman equations to compute the value function.

The authors also solved the Bellman equations for $v_*$, and the optimal state-value function can be seen in the middle of figure 3.5 from (Sutton, Barto, 2018). The right part shows the corresponding optimal policies $\pi_*$. When there are multiple arrow-directions in a state, it means that any of those directions is an optimal action in that state.

Conclusion

This has been a short introduction to MDPs. The formalization is pretty simple but can model a great deal of problems and is fundamental to the majority of reinforcement learning.

Thank you for reading and feel free to send me any questions.

Multi-armed Bandits

Thu, 03 Oct 2019 00:00:00 +0000

Introduction

A multi-armed bandit problem is a problem where an agent repeatedly has to make a choice between actions that give the agent rewards. Each action has a different probability distribution that determines how much reward is given, and these distributions are not known to the agent. The objective of the agent is to maximize expected cumulative future reward, which is called expected return. The name multi-armed bandit comes from slot machines (one-armed bandits) and one can imagine the actions the agent takes as choosing which arm to pull on a slot machine with multiple arms.

Let’s say we have a $k$-armed bandit problem, with $k$ actions $a$. We can denote the action selected at time $t$ as $A_t$, and the reward received as $R_t$. The value of an action, $q_*(a)$, is defined as the expected reward given that $a$ is selected:

$$ q_*(a) = E[R_t|A_t=a] $$

If the agent knew the actual value for each action, it would be trivial to solve the problem by simply selecting the action with the highest value all the time. Instead the agent has to estimate the value of each action. We denote the estimate of the value of selecting action $a$ at time $t$ as $Q_t(a)$. If the agent’s value estimates are perfect, it can make perfect decisions, so we want $Q_t(a)$ to be as close to $q_*(a)$ as possible.

A bandit problem is a simple version of Reinforcement Learning (RL) problems. In full RL problems, the agent similarly want to maximize expected return, but the selected actions also affect the state of the agent’s environment which affects future rewards and makes the problem more complicated. Even though bandit problems are simpler, studying them provides a good base for understanding many aspects of RL.

There is a type of bandit problem called contextual bandits where state has to be taken into account. For example you can imagine a problem with multiple $k$-armed bandits and each time step the agent is presented with a random one among them. The agent also receives some clue about what the current bandit is and can learn a policy. A policy is a mapping from states to probability distributions over actions that determines what action to take in any given state. Contextual bandits are closer to full RL problems since they have states, but are still simpler since actions does not affect future states.

The aim of this article is to introduce the basics of multiple-armed bandit problems. To read more about this subject, I recommend Reinforcement Learning: An Introduction (Sutton, Barto, 2018). The source code can be found at https://github.com/CarlFredriksson/multi_armed_bandits.

Value Estimation

To estimate the value of each action we can set our estimates to an initial value $Q_0$ and then update the estimate for an action every time the agent takes that action. The general update rule is as follows:

$$ Q_{t+1}(a) = Q_t(a) + \alpha (R_t - Q_t(a)) $$

This update rule brings the estimate closer to the most recently observed reward $R_t$. How much the estimate should move towards the reward depends on $\alpha$, which is called the step size. The step size can be a constant or be computed in various ways. One of the more natural ways to compute $\alpha$ is to set it to $1/N(a)$, where $N(a)$ is the number of times action $a$ has been selected. This step size means that the value estimates will be the average reward received when taking that action. To see that this is the case we can let $Q_{n+1}(a)$ be the average of all received rewards when selecting action $a$ and manipulate the expression. For brevity, assume that we are looking at a single action and drop the $a$. Let $Q_{n}=Q_{n}(a)$, $n=N(a)$, and $R_n$ be the reward received when taking the action the $n$-th time:

$$ \begin{aligned} Q_{n+1} &= \frac{1}{n} \sum_{i=1}^{n} R_i \\ &= \frac{1}{n} \Big(R_{n} + \sum_{i=1}^{n-1} R_i \Big) \\ &= \frac{1}{n} \Big(R_n + (n-1) \frac{1}{n-1} \sum_{i=1}^{n-1} R_i \Big) \\ &= \frac{1}{n} \Big(R_n + (n-1) Q_n \Big) \\ &= \frac{1}{n} \Big(R_n + n Q_n - Q_n \Big) \\ &= Q_n + \frac{1}{n} \Big(R_n - Q_n \Big) \end{aligned} $$

Thus with $\alpha = 1/N(a)$, the estimate $Q_{t}(a)$ will be the average of all received rewards and will converge to the real value $q_*(a)$ by the law of large numbers. This step size is very good when the problem is stationary, in other words, when $q_*(a)$ stays the same for all $t$. However, when the problem is non-stationary and $q_*(a)$ changes, $\alpha = 1/N(a)$ might not lead to good estimates in the long term. This is because the observed rewards affect the estimates less and less the more times actions gets selected. In other words, $Q_{t}(a)$ will change less as time goes on. When the problem is non-stationary it is more appropriate to use a constant $\alpha \in (0,1]$. This leads to recent rewards affecting the estimates more than older ones, with the extreme $\alpha = 1$ leading to estimates only caring about the most recent reward. Thus a constant $\alpha$ will make the agent more adaptable to changing $q_*(a)$.

Action Selection

Now we have discussed a couple of options for estimating the value of actions, but how should the agent select actions? The first thing that comes to mind is probably to be greedy and simply select the action with the highest estimated value. In other words, the agent exploits its current knowledge in order to get as much immediate reward as possible. This is an important part of action selection since the objective is to maximize expected return. However, if an agent is too greedy and only exploits, it might not learn accurate value estimates. If an agent has inaccurate value estimates, it can select a sub-optimal action when it exploits. This is why the opposite of exploitation, exploration, is also important when selecting actions.

Exploration is the act of sacrificing immediate reward for gaining knowledge about the problem, in this case to get better value estimates. There is an important tradeoff in RL (and many other fields) called exploration versus exploitation. This is a tradeoff because an agent can’t do both at the same time. The tradeoff needs to be balanced, since if the agent exploits too much it might take a long time to learn what the best action is, or it might never learn it. On the other hand, if the agent explores too much it will miss out on a lot of reward from selecting sub-optimal actions.

There are many methods for balancing the tradeoff. One of the simpler ones is $\epsilon$-greedy, which means that the agent is greedy with probability $1-\epsilon$, and selects a random action with probability $\epsilon$.

Action-value Methods

When a method for value estimation is combined with a method for action selection it is called an action-value method. To roughly assess the performance of different action-value methods we can use a set of randomly generated $k$-armed bandit problems.

The code below generates multiple bandit problems with different values $q_*(a)$. It runs $\epsilon$-greedy and greedy only methods ($\epsilon$-greedy with $\epsilon=0$) with different parameter settings on the generated problems. The results are averaged over the generated problems and plots of both reward over time and percentage of optimal action selected over time are saved. The same procedure is also done for the non-stationary case, when $q_*(a)$ changes over time.

import numpy as np
import matplotlib.pyplot as plt
import math

def my_argmax(Q_values):
    """
    Returns the index of the maximum value in Q_values.
    In case of ties, one of the tying indices are selected at uniform-random and returned.
    """
    max_val = -math.inf
    ties = []
    for i in range(len(Q_values)):
        Q_val = Q_values[i]
        if Q_val > max_val:
            max_val = Q_val
            ties = [i]
        elif Q_val == max_val:
            ties.append(i)

    return np.random.choice(ties)

def run_epsilon_greedy(q, num_steps, epsilon, const_alpha=None, Q_0=0, stationary=True):
    """
    Run a single run of the epsilon greedy algorithm.
    Returns the returns and an array where 1 means optimal action was selected and 0 it wasn't.
    """
    R = np.zeros((num_steps,))
    optimal = np.zeros((num_steps,))
    Q = np.ones(np.shape(q)) * Q_0
    N = np.zeros(np.shape(q))

    # Run steps
    for t in range(0, num_steps):
        if not stationary:
            q = q + np.random.normal(scale=0.05, size=np.shape(q))

        # Choose action greedily by default and randomly by probability epsilon
        a = my_argmax(Q)
        if np.random.uniform(low=0.0, high=1.0) < epsilon:
            a = np.random.randint(low=0, high=len(Q) - 1)
        N[a] += 1
        if a == np.argmax(q):
            optimal[t] = 1

        # Get reward
        R[t] = np.random.normal(loc=q[a])

        # Update Q
        alpha = const_alpha
        if const_alpha is None:
            alpha = 1 / N[a]
        Q[a] = Q[a] + alpha * (R[t] - Q[a])

    return R, optimal

def run_experiments(parameters, num_runs, num_steps, k, stationary=True):
    """
    Run multiple runs of epsilon greedy.
    Returns the averaged results and optimal action selections.
    """
    bandits = np.random.normal(loc=0, scale=1, size=(num_runs, k))
    R = np.zeros((len(parameters), num_runs, num_steps))
    optimal = np.zeros(np.shape(R))

    for i in range(len(parameters)):
        for j in range(num_runs):
            if j % 100 == 0:
                print("Parameter set", i, "- Starting run", j)
            q = np.copy(bandits[j])
            epsilon = parameters[i]["epsilon"]
            const_alpha = parameters[i]["const_alpha"]
            Q_0 = parameters[i]["Q_0"]
            R[i, j], optimal[i, j] = run_epsilon_greedy(q, num_steps, epsilon, const_alpha, Q_0, stationary)

    return np.mean(R, axis=1), np.mean(optimal, axis=1)

def plot_subplot(y, y_label, title=None):
    """Plot a subplot of results or optimal action selections."""
    num_plots = np.shape(y)[0]
    x = np.arange(0, np.shape(y)[1])

    for i in range(num_plots):
        epsilon = parameters[i]["epsilon"]
        const_alpha = parameters[i]["const_alpha"]
        Q_0 = parameters[i]["Q_0"]
        alpha = const_alpha if const_alpha is not None else "1/n"
        label = (
            "epsilon=" + str(epsilon) +
            ", alpha=" + str(alpha) +
            ", Q_0=" + str(Q_0)
        )
        plt.plot(x, y[i], label=label)

    plt.legend()
    plt.xlabel("Step")
    plt.ylabel(y_label)
    if title is not None:
        plt.title(title)

def plot_results(file_name, R_avg, optimal_avg, parameters, title):
    """Plot the averaged results and optimal action selections."""
    plt.figure(figsize=(10, 8))
    plt.subplot(2, 1, 1)
    plot_subplot(R_avg, "Average reward", title)
    plt.subplot(2, 1, 2)
    plot_subplot(optimal_avg * 100, "Optimal action %")
    plt.yticks([0, 20, 40, 60, 80, 100], labels=["0%", "20%", "40%", "60%", "80%", "100%"])
    plt.tight_layout()
    plt.savefig(file_name)

if __name__ == "__main__":
    parameters = [
        { "epsilon": 0.1, "const_alpha": None, "Q_0": 0 },
        { "epsilon": 0.1, "const_alpha": 0.1, "Q_0": 0 },
        { "epsilon": 0.01, "const_alpha": 0.1, "Q_0": 0 },
        { "epsilon": 0, "const_alpha": 0.1, "Q_0": 0 },
        { "epsilon": 0, "const_alpha": 0.1, "Q_0": 5 },
    ]
    num_runs = 2000
    num_steps = 10000
    k = 10
    R_avg, optimal_avg = run_experiments(parameters, num_runs, num_steps, k)
    plot_results("results_stationary.png", R_avg, optimal_avg, parameters, "Stationary")
    R_avg, optimal_avg = run_experiments(parameters, num_runs, num_steps, k, stationary=False)
    plot_results("results_non-stationary.png", R_avg, optimal_avg, parameters, "Non-stationary")

Plots for the stationary problems:

The greedy method with estimates initialized to 0 ($Q_0=0$) performed the worst. It can be hard to see in the reward plot, but it performed slightly worse than the $\epsilon$-greedy methods with $\epsilon=0.1$. Of the $\epsilon$-greedy methods with $\epsilon=0.1$, the one with $\alpha=1/N(a)$ performed slightly better than the one with a constant $\alpha=0.1$. The $\epsilon$-greedy method with $\epsilon=0.01$ takes a long time before it gets good since it explores only 1% of the time. Eventually it learns good enough estimates and overtakes the other $\epsilon$-greedy methods since it selects sub-optimal actions less.

The optimal action percentage plot is easier to read and it clearly shows why the greedy method with $Q_0=0$ performed the worst. It does not explore at all and misses out on selecting the optimal action a lot. However, the greedy method with $Q_0=5$ actually performed the best. The reason is that the high initial estimates forces the agent to try out all actions in the beginning, and thus learns good value estimates. After value estimates are learned, the purely greedy method has the advantage of never selecting actions with lower value estimates. This trick is called optimistic initial values and a similar effect can be achieved by an $\epsilon$-greedy method with large initial $\epsilon$ that reduces to 0 after some time.

Plots for the non-stationary problems:

This time the greedy method with $Q_0=5$ did not perform as good. It starts out by learning good value estimates, but can’t adapt to the changing values. This can be seen by the decreasing optimal action percentage. The $\epsilon$-greedy method with $\alpha=1/N(a)$ performed much worse on the non-stationary version as predicted. Since the updates to the value estimates gets smaller and smaller, it can’t adapt to the changing values.

There are many other action-value methods, such as methods using Upper-Confidence-Bound action selection (UCB). When estimating values there are varying degrees of uncertainty, depending on how often an action has been taken. UCB quantifies this uncertainty as confidence bounds around value estimates and uses these when selecting actions. There are also methods that doesn’t use value estimates, such as gradient bandit algorithms.

Conclusion

This has been a short introduction to multi-armed bandits. The model is simple but can be used for modelling repeated decision problems in the real world. An example of such a problem could be clinical trials where the effect of different drugs on patients with the same illness are compared. For each patient it has to be decided what drug to treat them with. The reward could be 1 if the patient recovers and 0 otherwise. Multi-armed bandits are also useful for introducing important concepts in RL, such as value estimation and the exploration vs exploitation tradeoff.

Thank you for reading and feel free to send me any questions.

Conditional Generative Adversarial Networks

Tue, 22 Jan 2019 00:00:00 +0000

Introduction

A Generative Adversarial Network (GAN) takes noise as input and generates data that resembles examples from the training set. A Conditional Generative Adversarial Network (CGAN) is a type of GAN that condition on additional information. The additional information makes it possible to have more control over the data generation. CGANs where introduced in Conditional Generative Adversarial Nets (Mirza, Osindero, 2014).

A GAN can be conditioned on many types of information. For example, if you have a training set of animal images and their respective class labels, you can condition the model on animal class labels to be able to generate specific animals. Another example is to condition on images in order to do image-to-image translation, such as in Image-to-Image Translation with Conditional Adversarial Networks (Isola, Yan Zhu, Zhou, Efros, 2017).

In this post we are going to use TensorFlow to implement a CGAN for generating specific digits that look handwritten. The dataset we are going to use is the MNIST database of handwritten digits. In my last post I wrote about DCGANs and generated random handwritten digits, but in this post we are going to condition the GAN on digit labels in order to generate specific digits.

The source code can be found at https://github.com/CarlFredriksson/cgan_tensorflow.

Theory

For an introduction to the theory behind GANs, see my previous post Generative Adversarial Networks. The only thing that have changed is that the generator $G$ and the discriminator $D$ now condition on additional information $y$.

One training step consists of:

Sampling a mini batch of $m$ noise vectors ${z^{(1)},\dots,z^{(m)}}$
Sampling a mini batch of $m$ training examples ${(x^{(1)},y^{(1)}),\dots,(x^{(m)},y^{(m)})}$
Updating $D$ by doing one gradient descent step on its loss function:

$$ J_D = - \frac{1}{m} \sum_{i=1}^{m} \Big[log \thinspace D\big(x^{(i)}, y^{(i)}\big) + log\big(1 - D\big(G\big(z^{(i)},y^{(i)}\big),y^{(i)}\big)\big)\Big] $$

Updating $G$ by doing one gradient descent step on its loss function:

$$ J_G = - \frac{1}{m} \sum_{i=1}^{m} \Big[log \thinspace D\big(G\big(z^{(i)},y^{(i)}\big),y^{(i)}\big)\Big] $$

Implementation

Import Packages

These are the packages we need to import.

import math
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from tensorflow.keras.datasets import mnist
from sklearn.utils import shuffle

Load MNIST Data

You could download the dataset from http://yann.lecun.com/exdb/mnist/, but a convenient alternative is to import it from tensorflow.keras.datasets. Note that in the GAN post we didn’t need to store the labels, but this time we are going to condition on them.

X_train, Y_train = utils.load_mnist_data()

def load_mnist_data():
    (X_train, Y_train), _ = mnist.load_data()
    return X_train, Y_train

In order to visualize the data we can plot some examples.

NUM_DIGITS = 10
SAMPLE_SIZE = NUM_DIGITS**2

utils.plot_sample(X_train[:SAMPLE_SIZE], "output/mnist_data.png")

def plot_sample(sample, path):
    n = int(np.sqrt(sample.shape[0]))
    fig = plt.figure(figsize=(8, 8))
    for i in range(n*n):
        ax = plt.subplot(n, n, i + 1)
        ax.imshow(sample[i], cmap=plt.get_cmap("gray"))
        ax.axis("off")
        ax.set_xticklabels([])
        ax.set_yticklabels([])
    fig.subplots_adjust(hspace=0.025, wspace=0.025)
    plt.savefig(path, bbox_inches="tight")
    plt.clf()
    plt.close()

Preprocessing

The generator is going to have a tanh activation at the output layer. Thus we want to normalize the pixel values of the training images from the range $(0,255)$ to $(-1,1)$. We also add a channel dimension of depth 1 since the images were loaded in grayscale.

X_train = utils.preprocess_images(X_train)

def preprocess_images(X):
    X = X / 255
    X = X - 0.5
    X = X * 2
    X = np.expand_dims(X, axis=-1)
    return X

The labels also need some preprocessing. We will convert them to one-hot vectors and add some dimensions in order to prepare them for convolutional layers. Note that dimensions are not added in the baseline model.

Y_train = preprocess_labels(Y_train)

def preprocess_labels(Y):
    Y = utils.convert_to_one_hot(Y)
    Y = np.expand_dims(Y, axis=1)
    Y = np.expand_dims(Y, axis=1)
    return Y

Postprocessing

Later we are going to plot samples of generated images and we will need a function that is the inverse of the preprocess function.

def postprocess_images(X):
    X = np.squeeze(X, axis=-1)
    X = X / 2
    X = X + 0.5
    X = X * 255
    return X

Partition Training Data into Mini Batches

The training data is randomly shuffled and partitioned into mini batches.

BATCH_SIZE = 128

mini_batches = utils.random_mini_batches(X_train, Y_train, BATCH_SIZE)

def random_mini_batches(X_train, Y_train, batch_size):
    mini_batches = []
    m = X_train.shape[0]
    X_train, Y_train = shuffle(X_train, Y_train)

    # Partition into mini-batches
    num_complete_batches = math.floor(m / batch_size)
    for i in range(num_complete_batches):
        startIndex = i * batch_size
        endIndex = (i + 1) * batch_size
        X_batch = X_train[startIndex : endIndex]
        Y_batch = Y_train[startIndex : endIndex]
        mini_batches.append((X_batch, Y_batch))

    # Handling the case that the last mini-batch < batch_size
    if m % batch_size != 0:
        startIndex = num_complete_batches * batch_size
        endIndex = m
        X_batch = X_train[startIndex : endIndex]
        Y_batch = Y_train[startIndex : endIndex]
        mini_batches.append((X_batch, Y_batch))

    return mini_batches

Baseline Model

Let’s start by creating a simple CGAN model in order to establish baseline performance. The generator is a simple feedforward neural network (FNN) that takes 100 dimensional noise vectors $Z$ and 10 dimensional one-hot vectors $Y$ as input. It outputs images with dimensions $(28,28,1)$ and pixel values in the range $(-1,1)$.

def generator(Z, Y):
    with tf.variable_scope("Generator"):
        x = tf.concat([Z, Y], 1)
        x = tf.layers.dense(x, 128, activation="relu")
        x = tf.layers.dense(x, 784, activation="tanh")
        x = tf.reshape(x, [-1, 28, 28, 1])

    return x

The discriminator is also a simple FNN that takes images $X$ and one-hot vectors $Y$ as input. It outputs single values representing guesses of whether an input image is a real or fake image of the given class $y$.

def discriminator(X, Y, reuse=False):
    with tf.variable_scope("Discriminator", reuse=reuse):
        x = tf.layers.flatten(X)
        x = tf.concat([x, Y], 1)
        x = tf.layers.dense(x, 128, activation="relu")
        x = tf.layers.dense(x, 1, activation="sigmoid")

    return x

If you want the rest of the code for the baseline model check out the source code. After training the CGAN for 100 epochs I got the following result.

CDCGAN Model

Let’s now create a deep convolutional CGAN (CDCGAN) model which is better suited for our problem. $Y_{fill}$ will contain the same labels as in $Y$, copied to fill $(28,28,10)$. This is done in order to fit the additional information into the purely convolutional architecture of the discriminator.

NOISE_DIM = 100

X = tf.placeholder(tf.float32, shape=(None, X_train.shape[1], X_train.shape[2], X_train.shape[3]))
Y = tf.placeholder(tf.float32, shape=(None, 1, 1, NUM_DIGITS))
Y_fill = tf.placeholder(tf.float32, shape=(None, X_train.shape[1], X_train.shape[2], NUM_DIGITS))
Z = tf.placeholder(tf.float32, shape=(None, 1, 1, NOISE_DIM))
is_training = tf.placeholder(tf.bool, shape=())
G, D_real, D_fake = create_cdcgan(X, Y, Y_fill, Z, is_training)

def create_cdcgan(X, Y, Y_fill, Z, is_training):
    G = generator(Z, Y, is_training)
    D_real = discriminator(X, Y_fill, is_training)
    D_fake = discriminator(G, Y_fill, is_training, reuse=True)

    return G, D_real, D_fake

The generator has the same architecture as in the DCGANs post, except for the additional input $Y$.

def generator(Z, Y, is_training):
    with tf.variable_scope("Generator"):
        x = tf.concat([Z, Y], 3)
        # x.shape: (?, 1, 1, 110)

        x = tf.layers.conv2d_transpose(x, 256, 7, strides=1, padding="valid", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 7, 7, 256)

        x = tf.layers.conv2d_transpose(x, 128, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 14, 14, 128)

        x = tf.layers.conv2d_transpose(x, 1, 5, strides=2, padding="same")
        x = tf.nn.tanh(x)
        # x.shape: (?, 28, 28, 1)

    return x

The discriminator also has the same architecture as in the DCGANs post, except for the additional input $Y_{fill}$.

def discriminator(X, Y_fill, is_training, reuse=False):
    with tf.variable_scope("Discriminator", reuse=reuse):
        x = tf.concat([X, Y_fill], 3)
        # x.shape: (?, 28, 28, 11)

        x = tf.layers.conv2d(x, 128, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 14, 14, 128)

        x = tf.layers.conv2d(x, 256, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 7, 7, 256)

        x = tf.layers.conv2d(x, 1, 7, strides=1, padding="valid")
        x = tf.nn.sigmoid(x)
        # x.shape: (?, 1, 1, 1)

    return x

Create Training Steps

We will use the same optimizers as in the DCGANs post.

LEARNING_RATE = 0.0002
BETA1 = 0.5

G_loss_func, D_loss_func = utils.create_loss_funcs(D_real, D_fake)
G_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Generator")
D_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Discriminator")
G_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, beta1=BETA1).minimize(G_loss_func, var_list=G_vars)
D_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, beta1=BETA1).minimize(D_loss_func, var_list=D_vars)

def create_loss_funcs(D_real, D_fake):
    eps = 1e-12
    G_loss_func = tf.reduce_mean(-tf.log(D_fake + eps))
    D_loss_func = tf.reduce_mean(-(tf.log(D_real + eps) + tf.log(1 - D_fake + eps)))

    return G_loss_func, D_loss_func

Train Model

I trained the model for 20 epochs and plotted a sample of generated images after each epoch.

NUM_EPOCHS = 20

# Start session
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    # Training loop
    for epoch in range(NUM_EPOCHS):
        for (X_batch, Y_batch) in mini_batches:
            Z_batch = utils.generate_Z_batch((X_batch.shape[0], 1, 1, NOISE_DIM))
            Y_fill_batch = Y_batch * np.ones((X_batch.shape[0], Y_fill.shape[1], Y_fill.shape[2], Y_fill.shape[3]))

            # Compute losses
            G_loss, D_loss = sess.run([G_loss_func, D_loss_func], feed_dict={X: X_batch, Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
            print("Epoch [{0}/{1}] - G_loss: {2}, D_loss: {3}".format(epoch, NUM_EPOCHS - 1, G_loss, D_loss))

            # Run training steps
            _ = sess.run(G_train_step, feed_dict={Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
            _ = sess.run(D_train_step, feed_dict={X: X_batch, Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})

        # Plot generated images
        Y_batch = np.ones((NUM_DIGITS, NUM_DIGITS))
        Y_batch = (Y_batch * np.arange(NUM_DIGITS)).T.reshape((SAMPLE_SIZE, )).astype(int)
        Y_batch = preprocess_labels(Y_batch)
        Y_fill_batch = Y_batch * np.ones((SAMPLE_SIZE, Y_fill.shape[1], Y_fill.shape[2], Y_fill.shape[3]))
        Z_batch = utils.generate_Z_batch((SAMPLE_SIZE, 1, 1, NOISE_DIM))
        gen_imgs = sess.run(G, feed_dict={Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
        gen_imgs = utils.postprocess_images(gen_imgs)
        utils.plot_sample(gen_imgs, "output/cdcgan/cdcgan_gen_data_" + str(epoch) + ".png")

def generate_Z_batch(size):
    return np.random.uniform(low=-1, high=1, size=size)

Images generated after the 20th epoch:

Conclusion

It was interesting to see that the results were considerably better for the conditioned version of the DCGAN compared to the unconditioned one in the DCGANs post. This is probably because the additional information forces the generator and discriminator to be more specific. The unconditioned DCGAN generated some images that looked like combinations of two digits, which barely happened with the conditioned version.

Thank you for reading and feel free to send me any questions.

Deep Convolutional Generative Adversarial Networks

Sun, 16 Dec 2018 00:00:00 +0000

Introduction

I recently wrote a post about Generative Adversarial Networks (GANs). A Deep Convolutional Generative Adversarial Network (DCGAN) is a type of GAN where the generator is a deep neural network utilizing transpose convolutions and the discriminator is a deep convolutional neural network (CNN). DCGANs were introduced in the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (Radford, Metz, & Chintala, 2016).

In this post I’m going to show how you can use TensorFlow to implement a DCGAN for generating digits that look handwritten. The dataset I used is the MNIST database of handwritten digits, which I also used in my post about Digit Recognition.

For an introduction to the theory behind DCGANs I refer you to my previous post about Generative Adversarial Networks. The competition between the generator and discriminator remains the same. The generator takes noise as input and generates fake images, and the discriminator tries to distinguish fake images from real images in the training set. We don’t have to change the loss functions, which is a great feature of GANs. The only difference with DCGANs is how the generator and discriminator are implemented.

The source code can be found at https://github.com/CarlFredriksson/dcgan_tensorflow.

Implementation

Import Packages

These are the packages we need to import.

import math
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from tensorflow.keras.datasets import mnist

Load MNIST Data

You could download the dataset from http://yann.lecun.com/exdb/mnist/, but a convenient alternative is to import it from tensorflow.keras.datasets. Since we are doing unsupervised learning we don’t need to store the labels.

X_train = utils.load_mnist_data()

def load_mnist_data():
    (X_train, _), _ = mnist.load_data()
    return X_train

In order to visualize the data we can plot some examples.

SAMPLE_SIZE = 100

utils.plot_sample(X_train[:SAMPLE_SIZE], "output/mnist_data.png")

def plot_sample(sample, path):
    n = int(np.sqrt(sample.shape[0]))
    fig = plt.figure(figsize=(8, 8))
    for i in range(n*n):
        ax = plt.subplot(n, n, i + 1)
        ax.imshow(sample[i], cmap=plt.get_cmap("gray"))
        ax.axis("off")
        ax.set_xticklabels([])
        ax.set_yticklabels([])
    fig.subplots_adjust(hspace=0.025, wspace=0.025)
    plt.savefig(path, bbox_inches="tight")
    plt.clf()
    plt.close()

Preprocessing

X_train = utils.preprocess_images(X_train)

def preprocess_images(X):
    X = X / 255
    X = X - 0.5
    X = X * 2
    X = np.expand_dims(X, axis=-1)
    return X

Postprocessing

Later we are going to plot samples of generated images and we will need a function that is the inverse of the preprocess function.

def postprocess_images(X):
    X = np.squeeze(X, axis=-1)
    X = X / 2
    X = X + 0.5
    X = X * 255
    return X

Partition Training Data into Mini Batches

The training data is randomly shuffled and partitioned into mini batches.

BATCH_SIZE = 128

mini_batches = utils.random_mini_batches(X_train, BATCH_SIZE)

def random_mini_batches(X, batch_size):
    mini_batches = []
    m = X.shape[0]
    np.random.shuffle(X)

    # Partition into mini-batches
    num_complete_batches = math.floor(m / batch_size)
    for i in range(num_complete_batches):
        batch = X[i * batch_size : (i + 1) * batch_size]
        mini_batches.append(batch)

    # Handling the case that the last mini-batch < batch_size
    if m % batch_size != 0:
        batch = X[num_complete_batches * batch_size : m]
        mini_batches.append(batch)

    return mini_batches

Baseline Model

Let’s start by creating a simple GAN model in order to establish baseline performance. The generator is a simple feedforward neural network (FNN) that takes 100 dimensional noise vectors $Z$ as input, and outputs images with dimensions $(28,28,1)$ and pixel values in the range $(-1,1)$. The discriminator is also a simple FNN that takes images of dimensions $(28,28,1)$ as input and outputs single values representing guesses of whether an input image is real or fake.

def generator(Z):
    with tf.variable_scope("Generator"):
        x = tf.layers.dense(Z, 128, activation="relu")
        x = tf.layers.dense(x, 784, activation="tanh")
        x = tf.reshape(x, [-1, 28, 28, 1])
    
    return x

def discriminator(X, reuse=False):
    with tf.variable_scope("Discriminator", reuse=reuse):
        x = tf.layers.flatten(X)
        x = tf.layers.dense(x, 128, activation="relu")
        x = tf.layers.dense(x, 1, activation="sigmoid")

    return x

If you want the rest of the code for the baseline model check out the source code. After training the GAN for 100 epochs I got the following result.

DCGAN Model

Let’s now create a DCGAN model which is better suited for our problem.

Create Generator

The generator takes 100 dimensional noise vectors $Z$ as input and outputs images with dimensions $(28,28,1)$ and pixel values in the range $(-1,1)$. The network architecture is heavily inspired by the architecture guidelines suggested in the original DCGAN paper (Radford, Metz, & Chintala, 2016).

We will use transpose convolutions as a way to do upsampling in a learnable way. Traditional methods of upsampling like nearest-neighbor interpolation or bilinear interpolation are methods without learnable parameters. For generating images we want the network to be able to learn its own upsampling method in order to be able to create convincing fake images.

def generator(Z, is_training):
    with tf.variable_scope("Generator"):
        x = Z
        # x.shape: (?, 1, 1, 100)

        x = tf.layers.conv2d_transpose(x, 256, 7, strides=1, padding="valid", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 7, 7, 256)

        x = tf.layers.conv2d_transpose(x, 128, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 14, 14, 128)

        x = tf.layers.conv2d_transpose(x, 1, 5, strides=2, padding="same")
        x = tf.nn.tanh(x)
        # x.shape: (?, 28, 28, 1)

    return x

Create Discriminator

The discriminator is a CNN that takes images of dimensions $(28,28,1)$ as input and outputs single values representing guesses of whether an input image is real or fake. The network looks like a normal CNN for binary image classification with a few modifications. Most notably, we use strided convolutions instead of pooling layers for downsampling. This is so the network can learn its own downsampling, similarly to the upsampling in the generator.

def discriminator(X, is_training, reuse=False):
    with tf.variable_scope("Discriminator", reuse=reuse):
        x = X
        # x.shape: (?, 28, 28, 1)

        x = tf.layers.conv2d(x, 128, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 14, 14, 128)

        x = tf.layers.conv2d(x, 256, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 7, 7, 256)

        x = tf.layers.conv2d(x, 1, 7, strides=1, padding="valid")
        x = tf.nn.sigmoid(x)
        # x.shape: (?, 1, 1, 1)

    return x

Create DCGAN

The overall structure is the same as in a regular GAN. The generator $G$ generates fake examples from noise. The discriminator $D$ takes both real examples from training data and fake examples from $G$ as input, and outputs guesses on which examples are real and which are fake.

Note that I used the placeholder is_training mostly out of habit. It’s going to be set to true for both training and generation of images. When doing supervised learning there is a clear distinction between training a model and using it for inference. When batch normalization layers are applied during training, the normalization step is done using mean and variance of the current mini-batch. When doing inference, the normalization step uses population mean and population variance instead, which are statistics estimated during training using moving mean and variance.

Since we are generating new data from noise and not doing any inference, there is no need to change the batch norm layers from training mode to inference mode. I tried both and found that keeping training mode on worked the best. When images where generated with inference mode on (is_training set to false) they were slightly blurry, and more importantly they ended up generating the same number (9) in most cases.

NOISE_DIM = 100

X = tf.placeholder(tf.float32, shape=(None, X_train.shape[1], X_train.shape[2], X_train.shape[3]))
Z = tf.placeholder(tf.float32, [None, 1, 1, NOISE_DIM])
is_training = tf.placeholder(tf.bool, shape=())
G, D_real, D_fake = create_gan(X, Z, is_training)

def create_gan(X, Z, is_training):
    G = generator(Z, is_training)
    D_real = discriminator(X, is_training)
    D_fake = discriminator(G, is_training, reuse=True)
    
    return G, D_real, D_fake

Create Training Steps

Just like in a regular GAN we will use separate training steps for the generator and discriminator. We will also use the same loss functions as in a regular GAN. As suggested in the DCGAN paper, we will use Adam optimizers with parameters $\beta_1$ and learning rate set to 0.5 and 0.0002 respectively.

BETA1 = 0.5
LEARNING_RATE = 0.0002

G_loss_func, D_loss_func = utils.create_loss_funcs(D_real, D_fake)
G_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Generator")
D_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Discriminator")
G_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, beta1=BETA1).minimize(G_loss_func, var_list=G_vars)
D_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, beta1=BETA1).minimize(D_loss_func, var_list=D_vars)

def create_loss_funcs(D_real, D_fake):
    eps = 1e-12
    G_loss_func = tf.reduce_mean(-tf.log(D_fake + eps))
    D_loss_func = tf.reduce_mean(-(tf.log(D_real + eps) + tf.log(1 - D_fake + eps)))

    return G_loss_func, D_loss_func

Train Model

I trained the model for 20 epochs and plotted a sample of generated images after each epoch.

NUM_EPOCHS = 20

# Start session
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    # Training loop
    for epoch in range(NUM_EPOCHS):
        for X_batch in mini_batches:
            Z_batch = utils.generate_Z_batch((X_batch.shape[0], 1, 1, NOISE_DIM))

            # Compute losses
            G_loss, D_loss = sess.run([G_loss_func, D_loss_func], feed_dict={X: X_batch, Z: Z_batch, is_training: True})
            print("Epoch [{0}/{1}] - G_loss: {2}, D_loss: {3}".format(epoch, NUM_EPOCHS - 1, G_loss, D_loss))

            # Run training steps
            _ = sess.run(G_train_step, feed_dict={Z: Z_batch, is_training: True})
            _ = sess.run(D_train_step, feed_dict={X: X_batch, Z: Z_batch, is_training: True})

        # Plot generated images
        Z_batch = utils.generate_Z_batch((SAMPLE_SIZE, 1, 1, NOISE_DIM))
        gen_imgs = sess.run(G, feed_dict={Z: Z_batch, is_training: True})
        gen_imgs = utils.postprocess_images(gen_imgs)
        utils.plot_sample(gen_imgs, "output/dcgan/dcgan_gen_data_" + str(epoch) + ".png")

def generate_Z_batch(size):
    return np.random.uniform(low=-1, high=1, size=size)

Images generated after the 20th epoch:

Conclusion

This was a fun project and I’m happy about how it turned out. Not all of the generated images could fool a human, but most of them look like they could be handwritten digits to me. Although generating handwritten digits is not the most exiting objective, it serves as a good introductory task for learning how to implement and tune DCGANs.

Thank you for reading and feel free to send me any questions.

Generative Adversarial Networks

Mon, 12 Nov 2018 00:00:00 +0000

Introduction

Generative Adversarial Networks (GANs) have been getting a lot of attention recently. They have been used for many different tasks, such as generating anime characters (Jin et al., 2017, https://arxiv.org/abs/1708.05509), generating photos of fake celebrities (Karras, Aila, Laine, & Lehtinen, 2017, https://arxiv.org/abs/1710.10196), increasing resolution in photos (Ledig et al., 2016, https://arxiv.org/abs/1609.04802), and general image-to-image translation (Isola, Yan Zhu, Zhou, Efros, 2017, https://arxiv.org/abs/1611.07004).

GANs were introduced in (Goodfellow et al., 2014, https://arxiv.org/abs/1406.2661) as an unsupervised learning model, trained on data without labels. The model learns to take noise as input and output data that resembles examples from the training set. There are versions of GANs that can utilize additional information as input, such as conditional GANs (CGANs), but in this post I’m going to focus on the basic version.

I’m going to explain some of the theory behind GANs and then show how you can implement a simple version in TensorFlow. The source code can be found at https://github.com/CarlFredriksson/gan_tensorflow.

Theory

A GAN consists of two parts, a generator $G$ and a discriminator $D$. The objective for $G$ is to take random noise $z$ as input and generate a fake example that looks like it comes from the real training data. The objective for $D$ is to take an example $x$ from the training data or a fake example $G(z)$ as input and output either 0 or 1. If the example is from the training data $D$ should output 1, and if the example is fake $D$ should output 0. In other words, $D$ is a binary classifier on whether the input is real (part of the training data) or fake (generated by $G$).

Note that $G$ is what we are ultimately interested in. After training, $G$ is what we will use in our applications to generate new data. $D$ is there to help train $G$. Training using an adversary in this way is a powerful idea that lets us define simple loss functions for $G$ and $D$ that don’t need to be changed between different tasks. GANs remove the need for handcrafting specific loss functions for many problems.

Minimax Game

$G$ and $D$ have directly competing objectives, and the situation can be modelled as a minimax zero-sum game where $G$ and $D$ are the players. When $G$ learns to create better fakes, it forces $D$ to get better at detecting them and vice versa. The competition makes $G$ and $D$ better and better until a steady state is reached.

An analogy could be a cop vs a criminal making counterfeit money. As the cop gets better at detecting fake bills, the criminal has to learn more complex counterfeiting methods, which in turn forces the cop to get even better, and so on.

The game that $G$ and $D$ plays can be modelled by the value function $V(D,G)$:

$$ \underset{G}{min} \thinspace \underset{D}{max} \thinspace V(D,G) = \mathbb{E}_{x \sim p_{data}(x)}[log \thinspace D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[log(1 - D(G(z)))] \tag 1 $$

Training

One training step consists of:

Sample a mini batch of $m$ noise vectors ${z^{(1)},\dots,z^{(m)}}$
Sample a mini batch of $m$ training examples ${x^{(1)},\dots,x^{(m)}}$
Update $D$ by doing one gradient descent step on its loss function:

$$ J_D = - \frac{1}{m} \sum_{i=1}^{m} \Big[log \thinspace D\big(x^{(i)}\big) + log\big(1 - D\big(G\big(z^{(i)}\big)\big)\big)\Big] \tag 2 $$

Update $G$ by doing one gradient descent step on its loss function:

$$ J_G = - \frac{1}{m} \sum_{i=1}^{m} \Big[log \thinspace D\big(G\big(z^{(i)}\big)\big)\Big] \tag 3 $$

I chose to minimize $-log \thinspace D(G(z))$, rather than $log(1 - D(G(z)))$. This is suggested in the original GAN paper to provide stronger gradients for $G$ early in training, while keeping the same fixed point dynamics of $G$ and $D$ as in equation 1. Since the performance of $G$ is poor in the early stages, $D$ can easily distinguish fake examples from real training data, which results in small gradients for $G$. If needed, the imbalance can be combated further by slowing down the learning of $D$. This can be done by running more than one training step for $G$ for every step of $D$ or lowering the learning rate of $D$.

Implementation

Let us now implement a GAN. I chose to keep it simple in order to focus on the basic ideas. We are going to generate a two-dimensional dataset using a quadratic function, and implement $G$ and $D$ as small feedforward neural networks (FNNs).

Import Packages

These are the packages we need to import.

import math
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

Generate Data

The data is generated using the function $y=x^2 + 5$. The data contains two-dimensional points without labels that are shuffled and partitioned into mini-batches for training.

data = generate_data()
plot_data(data)
mini_batches = random_mini_batches(data, BATCH_SIZE)

def generate_data(n=1000, scale=10):
    X = scale*(np.random.random_sample(n) - 0.5)
    Y = X**2 + 5

    return np.array([X, Y]).T

def plot_data(data):
    plt.plot(data[:, 0], data[:, 1], "o")
    plt.savefig("output/data.png", bbox_inches="tight")
    plt.clf()

def random_mini_batches(data, batch_size):
    mini_batches = []
    m = data.shape[0]
    np.random.shuffle(data)

    # Partition into mini-batches
    num_complete_batches = math.floor(m / batch_size)
    for i in range(num_complete_batches):
        batch = data[i * batch_size : (i + 1) * batch_size]
        mini_batches.append(batch)

    # Handling the case that the last mini-batch < batch_size
    if m % batch_size != 0:
        batch = data[num_complete_batches * batch_size : m]
        mini_batches.append(batch)

    return mini_batches

Create Model

It is time to create the GAN. Let’s start by defining two placeholders X and Z. X is the placeholder for batches of training data, and Z is the placeholder for batches of random noise.

X = tf.placeholder(tf.float32, [None, 2])
Z = tf.placeholder(tf.float32, [None, 2])

Let’s create two functions for defining the generator $G$ and the discriminator $D$.

def generator(Z, hidden_sizes=[10, 10]):
    with tf.variable_scope("GAN/Generator"):
        h1 = tf.layers.dense(Z, hidden_sizes[0], activation=tf.nn.leaky_relu)
        h2 = tf.layers.dense(h1, hidden_sizes[1], activation=tf.nn.leaky_relu)
        out = tf.layers.dense(h2, 2)

    return out

def discriminator(X, hidden_sizes=[10, 10], reuse=False):
    with tf.variable_scope("GAN/Discriminator", reuse=reuse):
        h1 = tf.layers.dense(X, hidden_sizes[0], activation=tf.nn.leaky_relu)
        h2 = tf.layers.dense(h1, hidden_sizes[1], activation=tf.nn.leaky_relu)
        out = tf.layers.dense(h2, 1, activation=tf.nn.sigmoid)

    return out

Using our generator and discriminator functions, we can create the GAN model. The tensors we are interested in is the output of $G$, the output of $D$ when real training data is used as input, and the output of $D$ when the input is generated fakes from $G$. The reuse flag is set to true when we call the discriminator function for the second time, since we only want one set of variables for $D$.

gen_out, disc_out_real, disc_out_fake = create_model(X, Z)

def create_model(X, Z):
    gen_out = generator(Z)
    disc_out_real = discriminator(X)
    disc_out_fake = discriminator(gen_out, reuse=True)

    return gen_out, disc_out_real, disc_out_fake

Train Model

To train the model we need to create loss functions, which are defined in equations 2 and 3. Note the addition of the small value EPS in order to avoid taking the log of 0.

EPS = 1e-12

disc_loss = tf.reduce_mean(-(tf.log(disc_out_real + EPS) + tf.log(1 - disc_out_fake + EPS)))
gen_loss = tf.reduce_mean(-tf.log(disc_out_fake + EPS))

Since we are alternating between training $G$ and $D$, we need to create a separate training step for each.

LEARNING_RATE = 0.001

gen_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="GAN/Generator")
disc_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="GAN/Discriminator")

gen_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE).minimize(gen_loss, var_list=gen_vars)
disc_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE).minimize(disc_loss, var_list=disc_vars)

Now we are ready to train the model. The performance of the generator is plotted every 250th epoch, and some of the results for one training run can be seen below.

NUM_EPOCHS = 5000
BATCH_SIZE = 32

# Start session
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    # Training loop
    for epoch in range(NUM_EPOCHS):
        for X_batch in mini_batches:
            Z_batch = generate_Z_batch(X_batch.shape[0])
            _, disc_loss_train = sess.run([disc_train_step, disc_loss], feed_dict={X: X_batch, Z: Z_batch})
            _, gen_loss_train = sess.run([gen_train_step, gen_loss], feed_dict={Z: Z_batch})
        if (epoch % 250) == 0:
            print("Epoch " + str(epoch) + " - plotting generated data")
            plot_generated_data(data, lambda Z_batch: sess.run(gen_out, feed_dict={Z: Z_batch}), epoch)

    print("Training finished - plotting generated data")
    plot_generated_data(data, lambda Z_batch: sess.run(gen_out, feed_dict={Z: Z_batch}), NUM_EPOCHS - 1)

def plot_generated_data(data, gen_func, epoch):
    Z_batch = generate_Z_batch(data.shape[0])
    gen_data = gen_func(Z_batch)
    plt.plot(data[:, 0], data[:, 1], "o")
    plt.plot(gen_data[:, 0], gen_data[:, 1], "o")
    plt.title("Epoch " + str(epoch))
    plt.savefig("output/gen_data_" + str(epoch) + ".png", bbox_inches="tight")
    plt.clf()

After 5000 epochs, $G$ has learned to generate data points that resemble the training data pretty well.

Conclusion

GANs are based on the idea of adversarial learning, and are capable of incredible performance on a wide variety of problems. In order to focus on the basic properties of GANs, this post showed how to implement a simple GAN and apply it on a simple task. I urge you to check out some of the recent papers to see really cool applications of GANs if you haven’t already.

I hope this post was useful to you, and feel free to send me any questions.

Sentiment Classification

Tue, 23 Oct 2018 00:00:00 +0000

Introduction

Sentiment classification is a Natural Language Processing (NLP) task where the objective is to predict the sentiment of some text input. For example, predicting a one to five star rating given a product review, or predicting the emotion of a text message. In this post I’m going to show you how I implemented sentiment classification for predicting whether a movie review is positive or negative.

The dataset I used is the “Large Movie Review Dataset” which can be found at http://ai.stanford.edu/~amaas/data/sentiment/. It was introduced in the paper Learning Word Vectors for Sentiment Analysis by Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (ACL 2011). The dataset contains 25000 highly polar (distinctly positive or negative) movie reviews for training and 25000 for testing. I kept the training data as my training set, but split the test data into a validation set used to evaluate different models, and a test set used to evaluate if the final model generalizes well.

The project was implemented in Keras, and the source code can be found at https://github.com/CarlFredriksson/sentiment_classification.

Implementation

Load Data

After downloading the dataset, you will see that both the training and test data is divided into directories of negative and positive examples. Each example has its own file, which means that we need to open every file, read it, and append the text to the correct input set (X_train or X_test). We also need to append a label of 0 or 1 to the corresponding output sets Y_train or Y_test for each text example. I chose to use 0 for examples with negative sentiment, and 1 for examples with positive sentiment.

import os

X_train, Y_train, X_test, Y_test = sc_utils.load_data()

def load_data():
    # neg = 0, pos = 1
    X_train, Y_train, X_test, Y_test = [], [], [], []

    # Prepare paths
    data_path = os.path.join(os.getcwd(), "data")
    train_neg_path = os.path.join(data_path, "train", "neg")
    train_pos_path = os.path.join(data_path, "train", "pos")
    test_neg_path = os.path.join(data_path, "test", "neg")
    test_pos_path = os.path.join(data_path, "test", "pos")

    # Load training data
    for filename in os.listdir(train_neg_path):
        with open(os.path.join(train_neg_path, filename), encoding="utf8") as f:
            review = f.readline()
            X_train.append(review)
            Y_train.append(0)
    for filename in os.listdir(train_pos_path):
        with open(os.path.join(train_pos_path, filename), encoding="utf8") as f:
            review = f.readline()
            X_train.append(review)
            Y_train.append(1)
    print("Training data loaded")

    # Load test data
    for filename in os.listdir(test_neg_path):
        with open(os.path.join(test_neg_path, filename), encoding="utf8") as f:
            review = f.readline()
            X_test.append(review)
            Y_test.append(0)
    for filename in os.listdir(test_pos_path):
        with open(os.path.join(test_pos_path, filename), encoding="utf8") as f:
            review = f.readline()
            X_test.append(review)
            Y_test.append(1)
    print("Test data loaded")

    return X_train, Y_train, X_test, Y_test

Preprocess Data

The first thing we want to do is to encode texts as sequences of numbers, where each number corresponds to a word index. To create such word indices we need to know what words are used in the dataset. Fortunately, there is a vocabulary file provided with the dataset which contains all used words.

To help with the task of encoding text as sequences of word indices we can use the Keras class Tokenizer. The Tokenizer assigns an index for each word it considers distinct in the vocabulary. There is a default filter that disregards the symbols !"#$%&()*+,-./:;<=>?@[\]^_`{|}~, which makes the Tokenizer not consider some lines in the vocabulary as words. For example, the vocabulary contains many emojis (such as :) and :-p) which are disregarded because of this filter. Changing the default filter might be a worthwhile experiment, but I did not bother for this project.

After initializing the tokenizer on our vocabulary using tokenizer.fit_on_texts(vocab), we can convert our texts to sequences of word indices. The sequences are then padded to make them equal length, which is important for training in batches. Sequences that are too short get zeroes added at the end, and sequences that are too long gets clipped.

After converting the processed training and test data to numpy arrays, the data is shuffled and the test data is divided into equally sized validation and test sets.

import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.utils import shuffle

INPUT_LENGTH = 100
X_train, Y_train, X_val, Y_val, X_test, Y_test, tokenizer = sc_utils.preprocess_data(X_train, Y_train, X_test, Y_test, INPUT_LENGTH)

def preprocess_data(X_train, Y_train, X_test, Y_test, input_len):
    # Load vocabulary
    vocab_path = os.path.join(os.getcwd(), "data", "imdb.vocab")
    with open(vocab_path, encoding="utf8") as f:
        vocab = f.read().splitlines()
    print("Vocabulary length before tokenizing: " + str(len(vocab)))

    # Prepare tokenizer, the out of vocabulary token for GloVe is "unk"
    tokenizer = Tokenizer(oov_token="unk")
    tokenizer.fit_on_texts(vocab)
    vocab_len = len(tokenizer.word_index) + 1
    print("Vocabulary length after tokenizing: " + str(vocab_len))

    # Convert text to sequences of indices
    X_train = tokenizer.texts_to_sequences(X_train)
    X_test = tokenizer.texts_to_sequences(X_test)

    # Pad sequences with zeros so they all have the same length
    X_train = pad_sequences(X_train, maxlen=input_len, padding="post")
    X_test = pad_sequences(X_test, maxlen=input_len, padding="post")

    # Convert training and test data to numpy arrays
    X_train = np.array(X_train, dtype="float32")
    Y_train = np.array(Y_train, dtype="float32")
    X_test = np.array(X_test, dtype="float32")
    Y_test = np.array(Y_test, dtype="float32")

    # Shuffle training and test data
    X_train, Y_train = shuffle(X_train, Y_train)
    X_test, Y_test = shuffle(X_test, Y_test)

    # Split test data into validation and test sets
    split_index = int(0.5 * X_test.shape[0])
    X_val = X_test[split_index:]
    Y_val = Y_test[split_index:]
    X_test = X_test[:split_index]
    Y_test = Y_test[:split_index]

    return X_train, Y_train, X_val, Y_val, X_test, Y_test, tokenizer

Create Embedding Matrix

A common tool in NLP is to use word embeddings, which are vectors that characterizes words. Each word has a corresponding vector, and words with similar meaning are expected to have embedding vectors that are close to each other in the vector space. Using word embeddings has many advantages, one of which is making models more robust to rarely seen words. Word embeddings can be learned as part of the model, or you can use pretrained embeddings. I tried both approaches and found that using pretrained embeddings was a better choice for this project since it made the model overfit less. There are many types of word embeddings. I decided to use pretrained GloVe embeddings which can be downloaded from https://nlp.stanford.edu/projects/glove/.

In order to use the embeddings in our models, we need to create a matrix where each row contains an embedding vector for the word that has a word index equal to the index of that row.

embedding_matrix = sc_utils.create_embedding_matrix(tokenizer)

def create_embedding_matrix(tokenizer):
    # Load GloVe embedding vectors
    embedding_path = os.path.join(os.getcwd(), "data", "glove.6B", "glove.6B.100d.txt")
    word_to_embedding = {}
    with open(embedding_path, encoding="utf8") as f:
        for line in f.readlines():
            values = line.split()
            word = values[0]
            embedding_vec = np.asarray(values[1:], dtype="float32")
            word_to_embedding[word] = embedding_vec
    print("Embedding vectors loaded")

    # Create embedding matrix
    embedding_vec_dim = 100
    vocab_len = len(tokenizer.word_index) + 1
    embedding_matrix = np.zeros((vocab_len, embedding_vec_dim))
    for word, i in tokenizer.word_index.items():
        embedding_vec = word_to_embedding.get(word)
        if embedding_vec is not None:
            embedding_matrix[i] = embedding_vec
    print("Embedding matrix created")

    return embedding_matrix

Create Models

Now it is time to create some models. First let’s create a very simple model to get a baseline performance. The baseline model simply averages the word embeddings of an input. The Embedding layer is used to convert sequences of word indices to sequences of embedding vectors. Not that since we are using pretrained embeddings that we don’t want to train further, the weights are initialized to our embedding matrix and the trainable flag is set to false. Also note that the parameter mask_zero is set to true. This masks the padded zeros in the input sequences, which is good since we only need them to make sure that the input lengths are equal. The output is a dense layer with a single output and sigmoid activation since we are doing binary classification.

from keras.models import Sequential, Input, Model
from keras.layers import Dense, Flatten, Embedding, Average, Activation, Lambda, Dropout, LSTM, Bidirectional
from keras.initializers import Constant
import keras.backend as K
from keras import regularizers

def create_baseline_model(embedding_matrix, input_len):
    model = Sequential()
    model.add(Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],
        embeddings_initializer=Constant(embedding_matrix), input_length=input_len,
        trainable=False, mask_zero=True))
    model.add(Lambda(lambda x: K.mean(x, axis=1)))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

Let’s now create a more complex model. This model uses a recurrent neural network (RNN) with LSTM layers, which is very common for many types of sequential data. I tried several hyperparameter settings such as varying the number of LSTM, Dropout, and Dense layers. I also varied the output sizes of the LSTM and Dense layers, and the rate of the Dropout layers. The model I ended up with, by choosing the one with the highest accuracy on the validation set, has two layers of LSTM units and one extra Dense layer. Note that the parameter return_sequences is set to true for the first LSTM layer. This makes the layer return all outputs, instead of returning the last one only, which is required when connecting two recurrent layers.

def create_rnn_model(embedding_matrix, input_len):
    model = Sequential()
    model.add(Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],
        embeddings_initializer=Constant(embedding_matrix), input_length=input_len,
        trainable=False, mask_zero=True))
    model.add(LSTM(64, return_sequences=True, recurrent_dropout=0.5))
    model.add(Dropout(0.5))
    model.add(LSTM(64))
    model.add(Dense(64, activation="relu"))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])
    return model

I also tried some versions of a bidirectional RNN and some versions of a RNN where the embeddings were learned instead of pretrained. None of them performed better on the validation set. Implementations of two of these models are available with the source code but omitted here.

Train and Evaluate Models

I used a batch size of 200 for all training runs. I trained the baseline model for 500 epochs since it took a lot of epochs before the training loss converged and each epoch ran very quickly. The baseline model scored 77.16% accuracy on the validation set.

model = model_factory.create_baseline_model(embedding_matrix, INPUT_LENGTH)
model.fit(X_train, Y_train, batch_size=200, epochs=500)
val_loss, val_accuracy = model.evaluate(X_val, Y_val, verbose=0)

I only trained the RNN model for 30 epochs since the training loss converged much quicker. The RNN model scored 85.45% accuracy on the validation set. Since this is the final model I also evaluated it on the test set. The accuracy on the test set was 85.43%, which suggests that the model generalizes well (at least on highly polar movie reviews).

model = model_factory.create_rnn_model(embedding_matrix, INPUT_LENGTH)
model.fit(X_train, Y_train, batch_size=200, epochs=30)
val_loss, val_accuracy = model.evaluate(X_val, Y_val, verbose=0)
test_loss, test_accuracy = model.evaluate(X_test, Y_test, verbose=0)

Conclusion

Sentiment classification can be used for many applications, for example when analyzing customer reviews or survey responses. In marketing and business development one might be interested in extracting customer information from textual data, such as the percentage of people that are happy with your product. It can be used to assess the emotion of a user that inputs text or speech into an application, which could be of interest in all kinds of applications ranging from chat bots to medical self-help systems.

In order to further improve the performance of the model there are more things to try that I didn’t bother with. One of which is the input length of sequences. I set it to 100, which means that only the first 100 words of each review is considered. Many reviews in the dataset are much longer, and using more words might improve the classification accuracy. However, this would make the training times slower. Another thing to try is to change the dimensionality of the word embeddings. The pretrained GloVe embeddings comes with vectors of dimensions 50, 100, 200, and 300, but I only tried the 100 dimensional ones.

I hope this post helped you out, and feel free to send me any questions.

Facial Landmark Detection

Fri, 05 Oct 2018 00:00:00 +0000

Introduction

Landmark detection is a computer vision problem where an algorithm tries to find the locations of landmarks (also called keypoints) in an image. In facial landmark detection the landmarks correspond to facial features such as the center of the eyes or the edges of the mouth. After the algorithm predicts the location of the landmarks, they can be used for applications such as applying filters to the image (like Snapchat filters), detecting the emotional state of the person in the image, or classifying whether the person is male or female.

In this post I will show you how I implemented facial landmark detection in Keras. The dataset I used can be found at https://www.kaggle.com/c/facial-keypoints-detection/data. It contains grayscale images of size (96, 96) and up to 15 landmarks for each image.

The source code can be found at https://github.com/CarlFredriksson/facial_landmark_detection.

Implementation

Load Data

The dataset contains csv files for training and testing. The test data contains only images and no landmarks, thus is of no use except for manual visual testing. I discarded the test data and instead tested the final model with images I found on my own. The training data contains 7049 rows, where each row contains landmark coordinates separated by commas and image pixels separated by spaces. Unfortunately many of the rows have missing landmark columns and after discarding those rows, we are left with 2140 labeled examples. After shuffling, I split the labeled examples into a training set and a validation set. The fraction of examples that is put in the validation set is determined by the parameter validation_split.

import os
import numpy as np
from pandas.io.parsers import read_csv
from sklearn.utils import shuffle

X_train, Y_train, X_val, Y_val = fld_utils.load_data(validation_split=0.2)

def load_data(validation_split):
    # Create path to csv file
    cwd = os.getcwd()
    csv_path = os.path.join(cwd, "data/training.csv")

    # Load data from csv file into data frame, drop all rows that have missing values
    data_frame = read_csv(csv_path)
    print(data_frame["Image"].count())
    data_frame = data_frame.dropna()
    print(data_frame["Image"].count())

    # Convert the rows of the image column from pixel values separated by spaces to numpy arrays
    data_frame["Image"] = data_frame["Image"].apply(lambda img: np.fromstring(img, sep=" "))

    # Create numpy matrix from image column by stacking the rows vertically
    X_data = np.vstack(data_frame["Image"].values)

    # Normalize pixel values to (0, 1) range
    X_data = X_data / 255

    # Convert to float32, which is the default for Keras
    X_data = X_data.astype("float32")

    # Reshape each row from one dimensional arrays to (height, width, num_channels) = (96, 96, 1)
    X_data = X_data.reshape(-1, 96, 96, 1)

    # Extract labels representing the coordinates of facial landmarks
    Y_data = data_frame[data_frame.columns[:-1]].values

    # Normalize coordinates to (0, 1) range
    Y_data = Y_data / 96
    Y_data = Y_data.astype("float32")

    # Shuffle data
    X_data, Y_data = shuffle(X_data, Y_data)

    # Split data into training set and validation set
    split_index = int(X_data.shape[0] * (1 - validation_split))
    X_train = X_data[:split_index]
    Y_train = Y_data[:split_index]
    X_val = X_data[split_index:]
    Y_val = Y_data[split_index:]

    return X_train, Y_train, X_val, Y_val

Create Models

Landmark detection is a type of regression problem, since given an input image, the objective is to predict 30 numbers corresponding to the 15 two dimensional landmarks. To achieve this objective we will create a convolutional neural network (CNN). I experimented with different network architectures and chose the one that resulted in the lowest validation loss. The hyperparameters that I varied were: the number of convolutional layers, the filter size and number of filters of the convolutional layers, the number of fully connected layers and their size, the number of dropout layers (which ended up at zero). Many of the historically well-performing CNN models shrink the height and width of the layer inputs and grow the number of filters as the network gets deeper, which is also the case for the model that I ended up with. The height and width are unchanged by the convolutional layers as the “same” padding scheme is used. Thus the shrinking of height and width happens solely in the max-pooling layers. The final model is pretty simple and not that deep, which is great for training times.

I also created a baseline model to compare with. The baseline model is a simple feedforward neural network with one hidden layer.

Both models use an Adam optimizer and mean squared error as the loss function.

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Conv2D, MaxPool2D, Dropout

def create_baseline_model():
    model = Sequential()
    model.add(Flatten(input_shape=(96, 96, 1)))
    model.add(Dense(512, activation="relu"))
    model.add(Dense(30))
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

def create_cnn_model():
    model = Sequential()
    model.add(Conv2D(32, (5, 5), input_shape=(96, 96, 1), activation="relu", padding="same"))
    model.add(MaxPool2D(pool_size=(2, 2)))

    model.add(Conv2D(64, (5, 5), activation="relu", padding="same"))
    model.add(MaxPool2D(pool_size=(2, 2)))

    model.add(Conv2D(128, (5, 5), activation="relu", padding="same"))
    model.add(MaxPool2D(pool_size=(2, 2)))

    model.add(Flatten())

    model.add(Dense(512, activation="relu"))
    model.add(Dense(30))
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

Train Models

I trained the models for 100 epochs using a batch size of 200. After training, the final loss values are saved in a text file, the histories of loss values are plotted together, and the model architectures and weights are saved. In Keras you can save a model in a single .h5 file that contains the architecture, weights, and optimizer state or you can save the architecture and weights separately. Since I won’t continue training the model after saving it I don’t need the optimizer state, and I opted to save the architecture and weights separately, which saves some disk space.

import fld_utils
import model_factory

X_train, Y_train, X_val, Y_val = fld_utils.load_data(validation_split=0.2)
NUM_EPOCHS = 100

# Run models
models = []
model_names = []
models.append(model_factory.create_baseline_model())
model_names.append("baseline")
models.append(model_factory.create_cnn_model())
model_names.append("cnn")
fld_utils.run_models(X_train, Y_train, X_val, Y_val, models, model_names, NUM_EPOCHS)

def run_models(X_train, Y_train, X_val, Y_val, models, model_names, num_epochs):
    results_file = open("output/results.txt", "w")
    histories = []

    for model, model_name in zip(models, model_names):
        # Train model
        history = model.fit(X_train, Y_train, batch_size=200, epochs=num_epochs, validation_data=(X_val, Y_val))
        histories.append(history)

        # Evaluate
        final_train_loss = model.evaluate(X_train, Y_train, verbose=0)
        final_val_loss = model.evaluate(X_val, Y_val, verbose=0)
        results_file.write(model_name + "> final_train_loss: " + str(final_train_loss) + ", final_val_loss: " + str(final_val_loss) + "\n")

        # Save model
        model.save_weights("saved_models/" + model_name + "_model_weights.h5")
        with open("saved_models/" + model_name + "_model_architecture.json", "w") as f:
            f.write(model.to_json())

    plot_histories(histories, model_names, "histories.png")
    results_file.close()

def plot_histories(histories, model_names, plot_name):
    for history, model_name in zip(histories, model_names):
        plt.plot(history.epoch, np.array(history.history["loss"]), label=model_name + " train loss")
        plt.plot(history.epoch, np.array(history.history["val_loss"]), label=model_name + " val loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss (Mean squared error)")
    plt.legend()
    plt.yscale("log")
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

Predict Landmarks

Let us plot some landmark predictions on a randomly chosen image from the validation set to get a sense of the performance of the models. The first image is always chosen, but since the data is shuffled before partitioned, it is a randomly chosen image. The landmarks are extracted from network output by grouping the 30 values into 15 two-dimensional points. Note that the network predictions needs to be multiplied by image size since the values we have trained with were normalized.

import numpy as np
import fld_utils

X_train, Y_train, X_val, Y_val = fld_utils.load_data(validation_split=0.2)
img = X_val[0]
img_size_x, img_size_y = img.shape[0], img.shape[1]

# Plot correct landmarks
landmarks = fld_utils.extract_landmarks(Y_val[0], img.shape[0], img.shape[1])
fld_utils.save_img_with_landmarks(img, landmarks, "data_visual.png", gray_scale=True)

# Baseline model
model = fld_utils.load_model("baseline")
y_pred = model.predict(np.expand_dims(img, axis=0))[0]
landmarks = fld_utils.extract_landmarks(y_pred, img_size_x, img_size_y)
fld_utils.save_img_with_landmarks(img, landmarks, "baseline_prediction.png", gray_scale=True)

# CNN model
model = fld_utils.load_model("cnn")
y_pred = model.predict(np.expand_dims(img, axis=0))[0]
landmarks = fld_utils.extract_landmarks(y_pred, img_size_x, img_size_y)
fld_utils.save_img_with_landmarks(img, landmarks, "cnn_prediction.png", gray_scale=True)

import matplotlib.pyplot as plt
from keras.models import load_model, model_from_json

def load_model(model_name):
    with open("saved_models/" + model_name + "_model_architecture.json", "r") as f:
        model = model_from_json(f.read())
    model.load_weights("saved_models/" + model_name + "_model_weights.h5")
    return model

def extract_landmarks(y_pred, img_size_x, img_size_y):
    landmarks = []
    for i in range(0, len(y_pred), 2):
        landmark_x, landmark_y = y_pred[i] * img_size_x, y_pred[i+1] * img_size_y
        landmarks.append((landmark_x, landmark_y))
    return landmarks

def save_img_with_landmarks(img, landmarks, plot_name, gray_scale=False):
    if gray_scale:
        plt.imshow(np.squeeze(img), cmap=plt.get_cmap("gray"))
    else:
        plt.imshow(np.squeeze(img))
    for landmark in landmarks:
        plt.plot(landmark[0], landmark[1], "go")
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

The correct landmarks:

The baseline prediction:

The cnn prediction:

We can see that the baseline model performs horribly but the cnn model performs very well on the chosen image.

Sunglasses Filter

Now we are finally ready to use our model in an application. I have created a simple script that applies a sunglasses filter to a face image. The predicted landmarks are used to scale, position, and rotate the sunglasses image. Since we have trained on normalized grayscale images of size (96, 96), we need to load the input image in grayscale mode, resize, and normalize it. To apply the filter we paste the processed sunglasses image on top of the original version of the input image that hasn’t had any preprocessing done to it. I normally use the library cv2 for image processing, but in this script PIL is also used. I used PIL because it has convenient functions for rotating images and pasting images on top of each other.

import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt
import fld_utils

# Load original image
face_img_path = "input/picard.png"
orig_img = cv2.imread(face_img_path)
orig_img = cv2.cvtColor(orig_img, cv2.COLOR_BGR2RGB)
orig_size_x, orig_size_y = orig_img.shape[0], orig_img.shape[1]

# Prepare input image
img = cv2.imread(face_img_path, cv2.IMREAD_GRAYSCALE)
img = cv2.resize(img, dsize=(96, 96), interpolation=cv2.INTER_AREA)
img = np.expand_dims(img, axis=2)
img = img / 255
img = img.astype("float32")

# Predict landmarks
model = fld_utils.load_model("cnn")
y_pred = model.predict(np.expand_dims(img, axis=0))[0]
landmarks = fld_utils.extract_landmarks(y_pred, orig_size_x, orig_size_y)

# Save original image with landmarks on top
fld_utils.save_img_with_landmarks(orig_img, landmarks, "test_img_prediction.png")

# Extract x and y values from landmarks of interest
left_eye_center_x = int(landmarks[0][0])
left_eye_center_y = int(landmarks[0][1])
right_eye_center_x = int(landmarks[1][0])
right_eye_center_y = int(landmarks[1][1])
left_eye_outer_x = int(landmarks[3][0])
right_eye_outer_x = int(landmarks[5][0])

# Load images using PIL
# PIL has better functions for rotating and pasting compared to cv2
face_img = Image.open(face_img_path)
sunglasses_img = Image.open("input/sunglasses.png")

# Resize sunglasses
sunglasses_width = int((left_eye_outer_x - right_eye_outer_x) * 1.4)
sunglasses_height = int(sunglasses_img.size[1] * (sunglasses_width / sunglasses_img.size[0]))
sunglasses_resized = sunglasses_img.resize((sunglasses_width, sunglasses_height))

# Rotate sunglasses
eye_angle_radians = np.arctan((right_eye_center_y - left_eye_center_y) / (left_eye_center_x - right_eye_center_x))
sunglasses_rotated = sunglasses_resized.rotate(np.degrees(eye_angle_radians), expand=True, resample=Image.BICUBIC)

# Compute positions such that the center of the sunglasses is
# positioned at the center point between the eyes
x_offset = int(sunglasses_width * 0.5)
y_offset = int(sunglasses_height * 0.5)
pos_x = int((left_eye_center_x + right_eye_center_x) / 2) - x_offset
pos_y = int((left_eye_center_y + right_eye_center_y) / 2) - y_offset

# Paste sunglasses on face image
face_img.paste(sunglasses_rotated, (pos_x, pos_y), sunglasses_rotated)
face_img.save("output/test_img_sunglasses.png")

Conclusion

This project was very fun and I’m pleased how it turned out. I was impressed with the performance of the model given its simplicity. I did not spend a lot of time on hyperparameter tuning, and I might revisit this step if I use facial landmark detection in a project that needs a more accurate model. Besides tuning hyperparameters, performance could also be improved by training on more data. This could be done by finding a bigger dataset, labeling images on your own, and/or data augmentation. If data augmentation is done, one has to be careful in making sure that the images are still correctly labeled. For example, if an image is flipped horizontally the landmarks also has to be flipped in the same way. However, the order of the landmarks must be consistent between images. If a landmark that represents the center of the left eye is flipped, it now represents the center of the right eye. Thus for a flipped image, the order of landmarks that represent left and right versions of a facial feature has to be swapped to maintain consistency.

Regression using Keras

Sat, 29 Sep 2018 00:00:00 +0000

Introduction

I have started experimenting with Keras, which is a high-level neural networks API. It provides an interface on top of TensorFlow, CNTK, or Theano. So far have been using Keras with TensorFlow as a backend. Keras makes prototyping and experimentation faster and easier, while removing some of the flexibility that TensorFlow has. If you use the TensorFlow backend you can combine the two when the need arises for functionality that Keras does not provide on its own.

In this blog post I am going to show you how to implement simple regression in Keras. We are implementing linear regression as a baseline model, and a small feedforward neural network as a more complex model. For more information about regression, and how to implement it in TensorFlow, check out my previous post Regression Techniques.

The source code can be found at https://github.com/CarlFredriksson/regression_using_keras.

Implementation

Import Modules

We will need to import the following modules:

import numpy as np
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

Generate Data

Since this project is only a proof of concept, I decided to use randomly generated datasets. The generated data is of the form $y=x^2-2$ for $0 \leq x \leq 3$ with some normally distributed random noise added to simulate real world data.

def generate_random_data():
    X = np.expand_dims(np.linspace(0, 3, num=200), axis=1)
    Y = X**2 - 2
    noise = np.random.normal(0, 1, size=X.shape)
    Y = Y + noise
    Y = Y.astype("float32")
    return X, Y

Let us generate datasets for training and validation.

X_train, Y_train = generate_random_data()
X_val, Y_val = generate_random_data()

To visualize the datasets we can plot them using Matplotlib.

def plot_data(X, Y, plot_name):
    plt.scatter(X, Y, color="blue")
    plt.grid()
    plt.xlabel("x")
    plt.ylabel("y")
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

plot_data(X_train, Y_train, "data_train.png")
plot_data(X_val, Y_val, "data_val.png")

Create Baseline Model

To implement simple linear regression we can use a neural network without hidden layers. In Keras we use a single dense layer for this. A dense layer is a normal fully connected layer. Note that the first (and only layer in this case) of a sequential Keras model needs to specify the input shape. To finish creating a model we need to compile it, while specifying what optimizer we want and what loss function to use.

def create_baseline_model():
    model = Sequential()
    model.add(Dense(1, input_shape=(1,)))
    model.compile(optimizer=SGD(lr=0.001), loss="mean_squared_error")
    return model

Create More Complex Model

The more complex model will contain three hidden layers of ten neurons each. Do not forget to add an activation function to the hidden layers.

def create_nn_model():
    model = Sequential()
    model.add(Dense(10, input_shape=(1,), activation="relu"))
    model.add(Dense(10, activation="relu"))
    model.add(Dense(10, activation="relu"))
    model.add(Dense(1))
    model.compile(optimizer=SGD(lr=0.001), loss="mean_squared_error")
    return model

Train Models

Training a model in Keras is very simple with the model.fit function. It is one of the nice high level functions that allows us to get up and runnning very quickly. Since the training set I generated is very small, I decided to use the whole set each batch.

history = model.fit(X_train, Y_train, batch_size=X_train.shape[0], epochs=10000, validation_data=(X_val, Y_val))

The model.fit function returns a history of the training and validation losses. To evaluate a model it is often a good idea to plot the history.

def plot_history(history, plot_name):
    plt.plot(history.epoch, np.array(history.history["loss"]), label="Train loss")
    plt.plot(history.epoch, np.array(history.history["val_loss"]), label="Val loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss (Mean squared error)")
    plt.legend()
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

For the baseline model:

plot_history(history, "history_baseline.png")

For the more complex model:

plot_history(history, "history_nn.png")

We can also evaluate models using the model.evaluate function. This returns the final loss for the specified dataset.

final_train_loss = model.evaluate(X_train, Y_train, verbose=0)
final_val_loss = model.evaluate(X_val, Y_val, verbose=0)

For the run that resulted in the plots above, the final training and validation losses were 1.566 and 1.621 for the baseline model, 1.080 and 1.184 for the more complex model.

Using the Trained Models

In order to output predictions for a given dataset we can use the model.predict function.

Y_predict = model.predict(X_val)

The predictions can be plotted on top of the data using the following function:

def plot_results(X, Y, Y_predict, plot_name):
    plt.scatter(X, Y, color="blue")
    plt.plot(X, Y_predict, color="red")
    plt.grid()
    plt.xlabel("x")
    plt.ylabel("y")
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

For the baseline model:

plot_results(X_val, Y_val, Y_predict, "results_baseline.png")

For the more complex model:

plot_results(X_val, Y_val, Y_predict, "results_nn.png")

Conclusion

Keras is a powerful library that lets developers iterate quickly. For this simple project there was no need for deep networks, but Keras was developed with deep learning in mind, and is well suited for creating deeper and more complex models.

Digit Recognition

Tue, 18 Sep 2018 00:00:00 +0000

Introduction

Digit recognition is a form of multi-class classification where the inputs are images of hand written digits, and the outputs are the corresponding numbers (0-9). Classifying handwritten text or numbers is important for many real world scenarios. For example, a postal service can scan postal codes on envelopes in order to automate grouping of envelopes that should be sent to the same city. Digit recognition is one of the simplest versions of classifying handwriting, and is a good first project for getting into the field of image classification.

I have implemented a small convolutional neural network (CNN) in TensorFlow, that achieves a decent performance on the MNIST test set (≈1% classification error). MNIST is a database containing gray scale images of handwritten digits. The database contains a training set of 60000 examples and a test set of 10000 examples. Read more on http://yann.lecun.com/exdb/mnist/.

The source code can be found at https://github.com/CarlFredriksson/digit_recognition.

Implementation

Loading the MNIST Data

You could download the data from http://yann.lecun.com/exdb/mnist/, but a convenient alternative is to import from keras.datasets.

from tensorflow.keras.datasets import mnist

def load_data():
    (X_train, Y_train), (X_test, Y_test) = mnist.load_data()

    return X_train, Y_train, X_test, Y_test

In order to visualize the data I plotted the first four training examples.

def visualize_data(X, Y, plot_name):
    plt.subplot(221)
    plt.imshow(X[0], cmap=plt.get_cmap("gray"))
    plt.title("y: " + str(Y[0]))
    plt.subplot(222)
    plt.imshow(X[1], cmap=plt.get_cmap("gray"))
    plt.title("y: " + str(Y[1]))
    plt.subplot(223)
    plt.imshow(X[2], cmap=plt.get_cmap("gray"))
    plt.title("y: " + str(Y[2]))
    plt.subplot(224)
    plt.imshow(X[3], cmap=plt.get_cmap("gray"))
    plt.title("y: " + str(Y[3]))
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

Preprocessing

A standard prepropcessing step is to normalize the inputs, thus we divide all pixel values by 255 to put them in the (0-1) range. In the output layer we are going to use the softmax activation, thus we need to convert the outputs from (0-9) to one hot vectors. A one hot vector is a vector with a single 1 and 0 in the other positions. For example, the one hot vectors for 0, 3, and 9 are $[1,0,0,0,0,0,0,0,0,0]$, $[0,0,0,1,0,0,0,0,0,0]$, and $[0,0,0,0,0,0,0,0,0,1]$ respectively. Finally, since we are building a CNN, we also want to add a channels dimension to the input. We only need one channel since the images are in grayscale.

def preprocess_data(X_train, Y_train, X_test, Y_test):
    # Normalize image pixel values from 0-255 to 0-1
    X_train = X_train / 255
    X_test = X_test / 255

    # Change y values from 0-9 to one hot vectors
    Y_train = convert_to_one_hot(Y_train)
    Y_test = convert_to_one_hot(Y_test)

    # Add channels dimension
    X_train = np.expand_dims(X_train, axis=3)
    X_test = np.expand_dims(X_test, axis=3)

    return X_train, Y_train, X_test, Y_test

def convert_to_one_hot(Y):
    Y_onehot = np.zeros((len(Y), Y.max() + 1))
    Y_onehot[np.arange(len(Y)), Y] = 1

    return Y_onehot

To significantly speed up the training process, we are going to divide the training set into mini batches. Instead of doing one step of gradient descent after processing the whole training set, we are going to do one gradient descent step after each mini batch. I used a mini batch size of 200, which gives us 300 mini batches.

def random_mini_batches(X_train, Y_train, mini_batch_size):
    mini_batches = []
    m = X_train.shape[0] # Number of training examples

    # Shuffle training examples
    permutation = list(np.random.permutation(m))
    X_shuffled = X_train[permutation]
    Y_shuffled = Y_train[permutation]

    # Partition into mini-batches
    num_complete_mini_batches = math.floor(m / mini_batch_size)
    for i in range(num_complete_mini_batches):
        X_mini_batch = X_shuffled[i * mini_batch_size : (i + 1) * mini_batch_size]
        Y_mini_batch = Y_shuffled[i * mini_batch_size : (i + 1) * mini_batch_size]
        mini_batch = (X_mini_batch, Y_mini_batch)
        mini_batches.append(mini_batch)

    # Handling the case that the last mini-batch < mini_batch_size
    if m % mini_batch_size != 0:
        X_mini_batch = X_shuffled[num_complete_mini_batches * mini_batch_size : m]
        Y_mini_batch = Y_shuffled[num_complete_mini_batches * mini_batch_size : m]
        mini_batch = (X_mini_batch, Y_mini_batch)
        mini_batches.append(mini_batch)

    return mini_batches

Creating the Model

Now it is time to create the CNN model. To have a short training time I chose a small model with a single convolutional layer. The simple model yields pretty decent results (≈1% classification error), but deeper models can perform even better. A list of the best models can be found at http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html.

The model can be summarized as: convolutional–>max_pool–>dropout–>flatten–>dense–>dense–>softmax. The dropout layer is used to reduce overfitting. The dense layers are fully connected layers. Note the placeholder training_flag which will decide when the dropout should be applied. We only want dropo`$ut applied when training, thus the placeholder is set to false as a default.

def create_model(height, width, channels, num_classes):
    tf.reset_default_graph()

    X = tf.placeholder(dtype=tf.float32, shape=(None, height, width, channels), name="X")
    Y = tf.placeholder(dtype=tf.float32, shape=(None, num_classes), name="Y")
    training_flag = tf.placeholder_with_default(False, shape=())

    conv1 = tf.layers.conv2d(X, filters=32, kernel_size=5, strides=1, padding="same", activation=tf.nn.relu)
    pool1 = tf.layers.max_pooling2d(conv1, pool_size=2, strides=2, padding="valid")
    # Dropout does not apply by default, training=True is needed to make the layer do anything
    # We only want dropout applied during training
    dropout = tf.layers.dropout(pool1, rate=0.2, training=training_flag) 
    flatten = tf.layers.flatten(dropout)
    dense1 = tf.layers.dense(flatten, 128, activation=tf.nn.relu)
    Y_hat = tf.layers.dense(dense1, num_classes, activation=tf.nn.softmax, name="Y_hat")

    # Compute cost
    J = compute_cost(Y, Y_hat)

    return X, Y, training_flag, Y_hat, J

To compute the cost I used the standard softmax cost function. Let $n_c$ be the number of classes ($n_c=10$ in our case). For a single training example with correct one hot vector $y$ and softmax output $\hat{y}$, the cost is:

$$ J(y,\hat{y}) = -\sum_{i=1}^{n_c} y_i \log{\hat{y}_i} $$

To compute the full cost we average over all training examples in the current mini batch. Note that a very small value is added to logarithm inputs, to avoid taking the log of 0.

def compute_cost(Y, Y_hat):
    # Add small value epsilon to tf.log() calls to avoid taking the log of 0
    epsilon = 1e-10
    J = tf.reduce_mean(-tf.reduce_sum(Y * tf.log(Y_hat + epsilon), axis=1), name="J")

    return J

Training the Model

To train the model I used an Adam optimizer with a learning rate of 0.001. I trained for 10 epochs. After training, the graph and variables are saved as a TensorFlow SavedModel. Classification accuracy on the training and test sets are computed by dividing the correctly classified examples by the size of the data set. The model achieves a test set accuracy of about 99%, or in other words 1% error. Error is often used instead of accuracy when comparing models with very high accuracy.

def run_model(X, Y, training_flag, Y_hat, J, X_train, Y_train, X_test, Y_test, mini_batches, LEARNING_RATE, NUM_EPOCHS):
    # Create train op
    optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
    train_op = optimizer.minimize(J)

    # Start session
    with tf.Session() as sess:
        # Initialize variables
        sess.run(tf.global_variables_initializer())

        # Training loop
        for epoch in range(NUM_EPOCHS):
            for (X_mini_batch, Y_mini_batch) in mini_batches:
                _, J_train = sess.run([train_op, J], feed_dict={X: X_mini_batch, Y: Y_mini_batch, training_flag: True})
            print("epoch: " + str(epoch) + ", J_train: " + str(J_train))

        # Final costs
        J_train = sess.run(J, feed_dict={X: X_train, Y: Y_train})
        J_test = sess.run(J, feed_dict={X: X_test, Y: Y_test})

        # Compute training accuracy
        Y_pred = sess.run(Y_hat, feed_dict={X: X_train, Y: Y_train})
        accuracy_train = compute_accuracy(Y_pred, Y_train)

        # Compute test accuracy
        Y_pred = sess.run(Y_hat, feed_dict={X: X_test, Y: Y_test})
        accuracy_test = compute_accuracy(Y_pred, Y_test)

        # Save model
        tf.saved_model.simple_save(sess, "saved_model", inputs={"X": X, "Y": Y}, outputs={"Y_hat": Y_hat})

    return J_train, J_test, accuracy_train, accuracy_test

def compute_accuracy(Y_pred, Y_real):
    Y_pred = np.argmax(Y_pred, axis=1)
    Y_real = np.argmax(Y_real, axis=1)
    num_correct = np.sum(Y_pred == Y_real)
    accuracy = num_correct / Y_real.shape[0]

    return accuracy

Testing on My Own Handwritten Image

I made a simple black and white image in paint.

The image needs to be loaded and preprocessed. The preprocessing involves inverting the colors, since all of the images in MNIST have a black background. The image also needs to be resized to (28,28) which is the size of the MNIST images. As before we divide by 255 to normalize input values, and add a channel dimension. This time we also add a dimension to the start, since the model expects sets of input images.

def load_img(path):
    img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)

    return img

def preprocess_img(img, size, invert_colors=False):
    if invert_colors:
        img = cv2.bitwise_not(img)
    img = cv2.resize(img, dsize=size, interpolation=cv2.INTER_CUBIC)
    img = img / 255
    img = np.expand_dims(img, axis=0)
    img = np.expand_dims(img, axis=3)

    return img

To test the trained model on my own image I loaded the graph and variables from the SavedModel in another session. To do a prediction we need to get the value of the output Y_hat/Softmax:0, with our handwritten image as input X:0. The value of Y:0 is irrelevant at this point, we just need to provide something that fits the expected dimensions. The index :0 is added to names by TensorFlow to avoid duplicates.

# Start session
with tf.Session() as sess:
    # Load model
    tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], "saved_model")

    # Predict digit for test image
    y_pred = sess.run("Y_hat/Softmax:0", feed_dict={"X:0": img, "Y:0": Y_train})[0]
    y_pred = np.argmax(y_pred)
    print("Predicted digit: " + str(y_pred))

As expected the output is Predicted digit: 2.

Conclusion

I am satisfied about how this project went, and it impressed me that such a simple model can get such decent results. In fact, a simple feedforward neural network (FNN) can also achieve pretty respectable results on MNIST. The CNN model I showed is slightly more complex than the small FNN I tried, but I chose the CNN since the classification error was lower (the FNN got about 2%) and the training time was still very fast.

The source code can be found at https://github.com/CarlFredriksson/digit_recognition.

Thank you for reading, and feel free to send me any questions.

Binary Classification

Thu, 06 Sep 2018 00:00:00 +0000

Introduction

Classification is an important Machine Learning task where an algorithm learns to associate input features to a finite number of classes. For example, given pixels from an animal image as input, output what animal is in that image. Binary classification is a special case where the number of classes is two, for example true or false, good or bad, cat or no cat etc.

I have implemented two binary classification models in TensorFlow, one using simple logistic regression and another using a shallow feedforward neural network (FNN). Logistic regression is a linear classifier, which means that it can only divide input data using a straight line (or plane/hyperplane depending on the dimensionality). Using a FNN with a logistic node as output allows us to divide data using more complex functions. I decided to only use two dimensional data for visualization purposes, but the code can easily be adapted to data with higher dimensionality.

The source code can be found at https://github.com/CarlFredriksson/binary_classification.

Implementation

Import Modules

We will need to import the following modules:

import os
import numpy as np
import tensorflow as tf
import matplotlib as mpl
import matplotlib.pyplot as plt

Generate Data

Since this project is only a proof of concept, I decided to use generated training data instead of looking for some real world data. I generated linear data that can be divided using a straight line, and set one side of the line to be class 0 and the other class 1. Some data points had their classes flipped randomly in order to simulate noise that would exist in the real world. Classes were flipped with higher probability for the data points that were close to the boundary line.

def generate_linear_data(num_data_points):
    X = np.random.rand(num_data_points, 2)

    # Set classes with the line x_2 = 0.5 as a boundary
    Y = np.expand_dims([0 if (X[i, 1] > 0.5) else 1 for i in range(num_data_points)], axis=1)

    # Flip the class of the points randomly, with higher probability if the point is close to the boundary
    for i in range(Y.shape[0]):
        distance = abs(X[i, 1] - 0.5)
        flip_prob = max(0, 0.3 - distance)
        if np.random.rand() < flip_prob:
            Y[i] = (Y[i] + 1) % 2

    return X, Y

I also implemented a function for generating non-linear data in the form of a circle. This time I did not bother with adding more noise, since logistic regression will be completely useless on this data set anyway, and it will be sufficient to show the difference between the classification models.

def generate_non_linear_data(num_data_points):
    X = np.random.rand(num_data_points, 2)

    # Set classes with a ring centered at (0.5, 0.5) as a boundary
    Y = np.expand_dims([0 if (np.linalg.norm(X[i] - np.array([0.5, 0.5])) < 0.3) else 1 for i in range(num_data_points)], axis=1)

    return X, Y

Let us generate some training data that will be used to train the models.

Let us also generate some test data that will be used to evaluate the models.

I used the following function to plot the data sets:

def plot_data(X, Y, plot_name):
    colors = Y[:, 0]
    plt.scatter(X[:, 0], X[:, 1], c=colors, cmap=mpl.colors.ListedColormap(["red", "blue"]), edgecolors=["black"])
    plt.xlabel("x_1")
    plt.ylabel("x_2")
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

Logistic Regression

The first model uses simple logistic regression which outputs $\hat{y} = \sigma(x^T w + b)$, where $\sigma$ is the sigmoid function, $x$ is an input vector, $w$ is a weight vector, and $b$ is a bias vector. We predict class 0 if $\hat{y} <= 0.5$ and class 1 if $\hat{y} > 0.5$. The objective is to find the best decision boundary. A decision boundary separates points in input space where on one side of the boundary one class is predicted, and on the other side another class is predicted. We want a cost function that leads to a decision boundary that predicts the correct class for as many training examples as possible. Let $y \in {0,1}$ be the correct class for a training example $x$, and $\hat{y} = \sigma(x^T w + b)$. The logistic cost function for one training example is:

$$ L(y, \hat{y}) = -(y \log{\hat{y}} + (1 - y) \log{(1 - \hat{y})}) $$

For a training set of many examples we want to compute the average cost. Let $X_{train}$ be the training set with $|X_{train}|$ number of training examples, and $Y_{train}$ the correct classes for those examples. We have the complete logistic cost function:

$$ J(X_{train}, Y_{train}, \hat{y}) = \frac{1}{|X_{train}|} \sum_{x \in X_{train}} L(Y_{train}(x), \hat{y}(x)) $$

The reason for using this cost function instead of the squared error cost, is that it is a convex function which leads to more effective gradient descent.

def logistic_regression_2D(X_train, Y_train, X_test, Y_test, learning_rate, num_epochs, db_plot_name):
    tf.reset_default_graph()

    # Create parameters
    W = tf.get_variable("W", shape=(2, 1), initializer=tf.contrib.layers.xavier_initializer())
    b = tf.get_variable("b", shape=(1, 1), initializer=tf.zeros_initializer())

    # Forward propagation
    X = tf.placeholder(dtype=tf.float32, shape=(None, 2), name="X")
    Y = tf.placeholder(dtype=tf.float32, shape=(None, 1), name="Y")
    Y_hat = tf.sigmoid(tf.matmul(X, W) + b)

    # Compute cost, add small value epsilon to tf.log() calls to avoid taking the log of 0
    epsilon = 1e-10
    J = -tf.reduce_mean(Y * tf.log(Y_hat + epsilon) + (1 - Y) * tf.log(1 - Y_hat + epsilon))

    # Create train op
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_op = optimizer.minimize(J)

    # Start session
    with tf.Session() as sess:
        # Initialize variables
        sess.run(tf.global_variables_initializer())

        # Training loop
        for i in range(num_epochs):
            sess.run(train_op, feed_dict={X: X_train, Y: Y_train})
            J_train = sess.run(J, feed_dict={X: X_train, Y: Y_train})
            if i%1000 == 0:
                print("i: " + str(i) + ", J_train: " + str(J_train))

        # Evaluate
        J_train = sess.run(J, feed_dict={X: X_train, Y: Y_train})
        J_test = sess.run(J, feed_dict={X: X_test, Y: Y_test})

        # Plot decision boundary
        predict_func = lambda X_grid: sess.run(Y_hat, feed_dict={X: X_grid, Y: Y_train})
        bc_utils.plot_decision_boundary(X_train, Y_train, predict_func, db_plot_name)

        return J_train, J_test

def plot_decision_boundary(X_train, Y_train, predict_func, plot_name):
    interval = np.arange(-0.1, 1.1, 0.001)
    X_1, X_2 = np.meshgrid(interval, interval)
    X_grid = np.c_[X_1.ravel(), X_2.ravel()]
    Y_grid = predict_func(X_grid)
    predictions_grid = np.array([round(x[0]) for x in Y_grid])
    predictions_grid = predictions_grid.reshape(X_1.shape)
    plt.contourf(X_1, X_2, predictions_grid, cmap=mpl.colors.ListedColormap(["#ff6868", "#6875ff"]))
    plt.scatter(X_train[:, 0], X_train[:, 1], c=Y_train[:, 0], cmap=mpl.colors.ListedColormap(["red", "blue"]), edgecolors=["black"])
    plt.savefig("output/" + plot_name, bbox_inches="tight")
    plt.clf()

As one might expect, this simple model yielded a good result when applied to our generated linear data. The final cost after 20000 epochs and a learning rate of 0.1, was 0.2094 on the training set and 0.2434 on the test set.

As we have already discussed, logistic regression is a linear classifier and should have horrible results on a non-linear data set. This was the case for our generated non-linear data. The algorithm could not find a good decision boundary, and instead predicted the same class for the whole domain. The final cost after 20000 epochs and a learning rate of 0.1, was 0.5892 on the training set and 0.5800 on the test set.

Neural Network Classification

The second model uses a small FNN. The network has two hidden layers with ten neurons each. ReLU is used as an activation function for all layers except the output layer, which uses the sigmoid function. Note that if the hidden layers are removed, we are left with simple logistic regression. I used the same cost function as in the first model.

def nn_binary_classification_2D(X_train, Y_train, X_test, Y_test, learning_rate, num_epochs, db_plot_name):
    tf.reset_default_graph()

    # Create parameters
    W_1 = tf.get_variable("W_1", shape=(2, 10), initializer=tf.contrib.layers.xavier_initializer())
    b_1 = tf.get_variable("b_1", shape=(1, 10), initializer=tf.zeros_initializer())

    W_2 = tf.get_variable("W_2", shape=(10, 10), initializer=tf.contrib.layers.xavier_initializer())
    b_2 = tf.get_variable("b_2", shape=(1, 10), initializer=tf.zeros_initializer())

    W_3 = tf.get_variable("W_3", shape=(10, 1), initializer=tf.contrib.layers.xavier_initializer())
    b_3 = tf.get_variable("b_3", shape=(1, 1), initializer=tf.zeros_initializer())

    # Forward propagation
    X = tf.placeholder(dtype=tf.float32, shape=(None, 2), name="X")
    Y = tf.placeholder(dtype=tf.float32, shape=(None, 1), name="Y")

    Y_hat = tf.matmul(X, W_1) + b_1
    Y_hat = tf.nn.relu(Y_hat)
    Y_hat = tf.matmul(Y_hat, W_2) + b_2
    Y_hat = tf.nn.relu(Y_hat)
    Y_hat = tf.sigmoid(tf.matmul(Y_hat, W_3) + b_3)

    # Compute cost, add small value epsilon to tf.log() calls to avoid taking the log of 0
    epsilon = 1e-10
    J = -tf.reduce_mean(Y * tf.log(Y_hat + epsilon) + (1 - Y) * tf.log(1 - Y_hat + epsilon))

    # Create train op
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_op = optimizer.minimize(J)

    # Start session
    with tf.Session() as sess:
        # Initialize variables
        sess.run(tf.global_variables_initializer())

        # Training loop
        for i in range(num_epochs):
            sess.run(train_op, feed_dict={X: X_train, Y: Y_train})
            J_train = sess.run(J, feed_dict={X: X_train, Y: Y_train})
            if i%1000 == 0:
                print("i: " + str(i) + ", J_train: " + str(J_train))

        # Evaluate
        J_train = sess.run(J, feed_dict={X: X_train, Y: Y_train})
        J_test = sess.run(J, feed_dict={X: X_test, Y: Y_test})

        # Plot decision boundary
        predict_func = lambda X_grid: sess.run(Y_hat, feed_dict={X: X_grid, Y: Y_train})
        bc_utils.plot_decision_boundary(X_train, Y_train, predict_func, db_plot_name)

        return J_train, J_test

The FNN model also yielded a good result when applied to our generated linear data. The final cost after 20000 epochs and a learning rate of 0.1, was 0.1979 on the training set and 0.2452 on the test set. It did overfit the training set more than the first model as the cost difference between the training set and the test set was greater, and we can see that the decision boundary was not linear. Some overfitting is to be expected since the FNN model is capable of more complex functions, and we did not use any anti-overfitting techniques such as regularization.

The big difference in performance comes when we switch to the non-linear data set. The more flexible FNN model can learn to find an approximately circular decision boundary, and fit the training set very well. The final cost after 20000 epochs and a learning rate of 0.1, was 0.0048 on the training set and 0.1214 on the test set. This time both costs where smaller but the overfitting greater, which can be explained by the relatively small amount of noise in the generated non-linear data, thus allowing a better fit to the training set.

Conclusion

Logistic regression can be a good choice when you are sure that a linear decision boundary will be sufficient. Otherwise you should probably use a neural network or a similarly flexible model. You might have to put in some work in order to find a good network architecture and to avoid overfitting, but the resulting model should have better results on most non-linear data sets.

The source code can be found at https://github.com/CarlFredriksson/binary_classification.

Thank you for reading, and feel free to send me any questions.

Regression Techniques

Thu, 30 Aug 2018 00:00:00 +0000

Introduction

Regression is an important part of Machine Learning, where one tries to find an underlying function that is a good fit to some noisy data. As an example, you might want to predict how much you could sell a house for given data from other house sales. The data would contain input/output pairs of house features (such as size, number of rooms, location, etc.) and corresponding price.

I have implemented two models in TensorFlow, one using simple linear regression and another using a shallow feedforward neural network (FNN). For visualization purposes, I decided to only use one dimensional data, but the code can easily be adapted to data with multiple dimensions.

The source code can be found at https://github.com/CarlFredriksson/regression_techniques.

Implementation

Import Modules

We will need to import the following modules:

import os
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

Generate Data

Since this project is only a proof of concept, I decided to use generated training data instead of looking for some real world data.

def generate_random_data(a=1, b=2, non_linear=False, noise_stdev=1):
    """Generate data with random noise."""
    X = np.expand_dims(np.linspace(-5, 5, num=50), axis=1)
    Y = a * X + b
    if non_linear:
        Y = a * X**2 + b
    noise = np.random.normal(0, noise_stdev, size=X.shape)
    Y += noise
    Y = Y.astype("float32")

    return X, Y

The function can generate linear data of the form $y=ax+b$ and non-linear data of the form $y=ax^2+b$. I used $a=1$ and $b=2$ for all data sets.

Let us generate some training data that will be used to train the models.

Let us also generate some test data that will be used to evaluate the models.

Linear Regression

The first model I implemented used simple linear regression. This technique assumes that the underlying function is of the form $y=ax+b$, and the objective is to find the values for $a$ and $b$ that best fit the data. We need to define a cost function that makes us achieve this objective, and a very common one is squared error cost. Let us define $m$ as the number of training examples, $X_{train}$ as the input features, and $Y_{train}$ as the correct outputs for those features. Both $X_{train}$ and $Y_{train}$ have the dimensions $(m,1)$. Let us also define $Y_{estimate}$ as the values we get using our estimated $a$ and $b$ values ($a_{estimate}$ and $b_{estimate}$). The cost is defined as:

$$ J(X_{train}, Y_{train}, Y_{estimate}) = \frac{1}{m} \sum_{x \in X_{train}} (Y_{train}(x) - Y_{estimate}(x))^2 $$

or in other words:

$$ J(X_{train}, a, b, a_{estimate}, b_{estimate}) = \frac{1}{m} \sum_{x \in X_{train}} ((ax+b) - (a_{estimate}x+b_{estimate}))^2 $$

def linear_regression_1D(X_train, Y_train, X_test, Y_test, learning_rate, num_iterations):
    y_est = []

    # Create parameters
    a_estimate = tf.Variable(0, dtype=tf.float32)
    b_estimate = tf.Variable(0, dtype=tf.float32)

    # Compute cost
    X_placeholder = tf.placeholder(dtype=tf.float32, shape=(None, 1), name="X_placeholder")
    Y_placeholder = tf.placeholder(dtype=tf.float32, shape=(None, 1), name="Y_placeholder")
    Y_estimate = a_estimate * X_placeholder + b_estimate
    J = tf.reduce_mean((Y_placeholder - Y_estimate)**2)

    # Create train op
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_op = optimizer.minimize(J)

    # Start session
    with tf.Session() as sess:
        # Initialize variables
        sess.run(tf.global_variables_initializer())

        # Training loop
        for i in range(num_iterations):
            sess.run(train_op, feed_dict={X_placeholder: X_train, Y_placeholder: Y_train})

        # Create estimated Y values
        y_est, J_train = sess.run([Y_estimate, J], feed_dict={X_placeholder: X_train, Y_placeholder: Y_train})
        J_test = sess.run(J, feed_dict={X_placeholder: X_test, Y_placeholder: Y_test})

    return y_est, J_train, J_test

As one might expect, this simple model yielded a good result when applied to our generated linear data. It fit the training set pretty well with a final cost of 0.9422.

More importantly, it also had good results on the test set with a final cost of 1.0851.

However, when the model is used on a non-linear data set, it obviously fails miserably. On the training set the final cost was 53.182.

On the test set the final cost was 62.886.

Neural Network Regression

The second model I implemented uses a small FNN. The network has two hidden layers with ten neurons each. ReLU is used as an activation function for all layers except the output layer, which has a single linear neuron. Note that if the hidden layers are removed, we are left with simple linear regression. I used the same cost function as in the first model.

def nn_regression_1D(X_train, Y_train, X_test, Y_test, learning_rate, num_iterations):
    tf.reset_default_graph()
    y_est = []
    
    # Create parameters
    W_1 = tf.get_variable("W_1", shape=(1, 10), initializer=tf.contrib.layers.xavier_initializer())
    b_1 = tf.get_variable("b_1", shape=(1, 10), initializer=tf.zeros_initializer())

    W_2 = tf.get_variable("W_2", shape=(10, 10), initializer=tf.contrib.layers.xavier_initializer())
    b_2 = tf.get_variable("b_2", shape=(1, 10), initializer=tf.zeros_initializer())

    W_3 = tf.get_variable("W_3", shape=(10, 1), initializer=tf.contrib.layers.xavier_initializer())
    b_3 = tf.get_variable("b_3", shape=(1, 1), initializer=tf.zeros_initializer())

    # Forward propagation
    X_placeholder = tf.placeholder(dtype=tf.float32, shape=(None, 1), name="X_placeholder")
    Y_placeholder = tf.placeholder(dtype=tf.float32, shape=(None, 1), name="Y_placeholder")

    X = tf.matmul(X_placeholder, W_1) + b_1
    X = tf.nn.relu(X)
    X = tf.matmul(X, W_2) + b_2
    X = tf.nn.relu(X)
    Y_estimate = tf.matmul(X, W_3) + b_3

    # Compute cost
    J = tf.reduce_mean((Y_placeholder - Y_estimate)**2)

    # Create training operation
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    train_op = optimizer.minimize(J)

    # Start session
    with tf.Session() as sess:
        # Initialize variables
        sess.run(tf.global_variables_initializer())

        # Training loop
        for i in range(num_iterations):
            sess.run(train_op, feed_dict={X_placeholder: X_train, Y_placeholder: Y_train})

        # Create estimated Y values
        y_est, J_train = sess.run([Y_estimate, J], feed_dict={X_placeholder: X_train, Y_placeholder: Y_train})
        J_test = sess.run(J, feed_dict={X_placeholder: X_test, Y_placeholder: Y_test})
    
    return y_est, J_train, J_test

The FNN model also yielded a good result when applied to our generated linear data. It fit the training set even better than the first model with a final cost of 0.7937.

However, it had worse performance on the test set with a final cost of 1.1286, which is slightly higher than for the first model. This significant difference between performance on the training and test sets are due to overfitting. Overfitting can be combatted by introducing regularization/dropout, getting more data, and/or reducing the size of the network.

The second model shows its strength when used on the non-linear data set. Since it does not assume the form of the underlying function, it can fit to non-linear data as well. The final cost on the training set was 1.0929.

It once again did some overfitting, and the final cost on the test set was 1.6068.

Conclusion

A simple linear regression model can be a good choice when you are absolutely sure that the underlying function is linear. Otherwise I would use a neural network or a similarly flexible model. You might have to put in more work to find a good network architecture and to avoid overfitting, but you do not have to worry about the possible non-linearity of your data set.

The source code can be found at https://github.com/CarlFredriksson/regression_techniques.

Thank you for reading, and feel free to send me any questions.

Neural Style Transfer

Mon, 20 Aug 2018 00:00:00 +0000

Introduction

Neural Style Transfer (NST) is one of the coolest techniques in deep learning. It is an algorithm that generates a new image starting from a content image (the cat in the image above) and a style image. The objective is for the generated image to contain the “content” of the content image, but have the same “style” as the style image. NST is a form of transfer learning which uses a pretrained convolutional neural network (CNN). Instead of training the weights of the network, we train the generated image, which is used as an input to network, until the objective is achieved sufficiently. The algorithm was created by Gatys et al. (2015).

I have implemented NST in TensorFlow using a pretrained VGG-19 network trained on the ImageNet dataset. To avoid having to search the Internet for pretrained weights and having to create the layers myself, I used a pretrained Keras model from keras.applications. I extracted the layers of the Keras model and used them to create a TensorFlow graph.

The source code can be found at https://github.com/CarlFredriksson/neural_style_transfer.

Implementation

Import modules

To get started we need to import some modules.

import os
import cv2
import numpy as np
import tensorflow as tf
from tensorflow.keras import backend as K
from tensorflow.keras.applications import VGG19
from tensorflow.keras.layers import MaxPooling2D

Define constants

Let us define some constants we will need.

CONTENT_IMG_PATH = "./input/cat.jpg"
STYLE_IMG_PATH = "./input/starry_night.jpg"
GENERATED_IMG_PATH = "./output/generated_img.jpg"
IMG_SIZE = (400, 300)
NUM_COLOR_CHANNELS = 3
ALPHA = 10
BETA = 40
NOISE_RATIO = 0.6
CONTENT_LAYER_INDEX = 13
STYLE_LAYER_INDICES = [1, 4, 7, 12, 17]
STYLE_LAYER_COEFFICIENTS = [0.2, 0.2, 0.2, 0.2, 0.2]
NUM_ITERATIONS = 500
LEARNING_RATE = 2
VGG_IMAGENET_MEANS = np.array([103.939, 116.779, 123.68]).reshape((1, 1, 3)) # In blue-green-red order
LOG_GRAPH = False

The purpose of some of these are self-apparent and the others will explained later on. Feel free to play around with different constant values and see how the results are affected.

Load and preprocess images

We need to load the content and style images, and do some simple preprocessing before they can be used as input to the network. I used the library OpenCV (cv2) for image processing.

def load_img(path, size, color_means):
    """Load image from path, preprocess it, and return the image."""
    img = cv2.imread(path)
    img = cv2.resize(img, dsize=size, interpolation=cv2.INTER_CUBIC)
    img = img.astype("float32")
    img -= color_means
    img = np.expand_dims(img, axis=0)

    return img

content_img = load_img(CONTENT_IMG_PATH, IMG_SIZE, VGG_IMAGENET_MEANS)
style_img = load_img(STYLE_IMG_PATH, IMG_SIZE, VGG_IMAGENET_MEANS)

The images are normalized using the means of each color channel for ImageNet, contained in the constant VGG_IMAGENET_MEANS. Note that the order of the colors are BGR (Blue Green Red) and not RGB, because the network we will use is trained on BGR images and OpenCV uses BGR by default.

We will also need a function for saving images. Remember to add the color means that were subtracted in preprocessing.

def save_img(img, path, color_means):
    """Save image to path after postprocessing."""
    img += color_means
    img = np.clip(img, 0, 255)
    img = img.astype("uint8")
    cv2.imwrite(path, img)

Initialize the generated image

The generated image is initialized as a combination of the content image and random noise. We could initialize to pure noise, but this would result in longer training time before the content of the generated image resembles the content of the content image.

def create_noisy_img(img, noise_ratio):
    """Add noise to img and return it."""
    noise = np.random.uniform(-20, 20, (img.shape[0], img.shape[1], img.shape[2], img.shape[3])).astype("float32")
    noisy_img = noise_ratio * noise + (1 - noise_ratio) * img

    return noisy_img

generated_img_init = create_noisy_img(content_img, NOISE_RATIO)

Cost function

In order to satisfy the objective of training a generated image to have the content of the content image and the style of the style image, we need to define a cost function that can be optimized. We will define two separate parts of this function, the content cost and the style cost, which we will combine to a total cost.

Content cost

How can we make sure that the content in the generated image matches the content of the content image? In a CNN, the activations of shallow layers detect low level features such as edges. The activations of deep layers detect higher level features, such as objects like cats or cars. We will use deep layer activations to ensure that the high level features of the generated image are similar to the high level features of the content image.

Let $a^{(c)}$ be the activations of some deep layer when the content image is used as input to the network, and $a^{(g)}$ be the activations when the generated image is used. Let $n_h, n_w, n_c$ be the height, width, and number of channels of the chosen layer. The content cost is defined as:

$$ J_{content}(a^{(c)},a^{(g)}) = \frac{1}{4 \times n_h \times n_w \times n_c} \sum_{\text{all entries}}(a^{(c)} - a^{(g)})^2 $$

def content_cost(a_c, a_g):
    """Return a tensor representing the content cost."""
    _, n_h, n_w, n_c = a_c.shape

    return (1/(4 * n_h * n_w * n_c)) * tf.reduce_sum(tf.square(tf.subtract(a_c, a_g)))

Style cost

The next question is how can we make sure that the style of the generated image matches the style of the style image? The style cost function is a bit more complicated than the content cost function. To start of we need to know how to compute the gram matrix for a set of vectors $(v_1,\dots,v_n)$. The entries $G_{ij}$ of a gram matrix are computed by:

$$ G_{ij} = v_i \cdot v_j = v_i^T v_j $$

Intuitively $G_{ij}$ is one way to measure how similar $v_i$ is to $v_j$, since if they are similar their dot product should be large. We will compute gram matrices by unrolling activations for some layer into a matrix with one row for each channel of that layer, and then taking the matrix product between the matrix and its transpose. The resulting gram matrix $G$ is also called a style matrix and has dimensions $(n_c,n_c)$. As an example of what the style matrix captures: if the channel (also called filter) $i$ is detecting vertical textures, and $j$ is detecting horizontal textures, then $G_{ij}$ measure how much vertical and horizontal textures occur together in the image.

Using the gram matrices $G^{(s)}$ and $G^{(g)}$ for the style and generated images respectively, the style cost for a given layer $l$ is defined as:

$$ J_{style}^{[l]}(G^{(s)},G^{(g)}) = \frac{1}{4 \times n_c^2 \times (n_h \times n_w)^2} \sum_{i=1}^{n_c} \sum_{j=1}^{n_c} {(G_{ij}^{(s)} - G_{ij}^{(g)})^2} $$

Unlike the content cost, the style cost normally uses several layers with corresponding coefficients $\lambda^{[l]}$:

$$ J_{style} = \sum_l \lambda^{[l]} J_{style}^{[l]} $$

def style_cost(a_s_layers, a_g_layers, style_layer_coefficients):
    """Return a tensor representing the style cost."""
    style_cost = 0
    for i in range(len(a_s_layers)):
        # Compute gram matrix for the activations of the style image
        a_s = a_s_layers[i]
        _, n_h, n_w, n_c = a_s.shape
        a_s_unrolled = tf.reshape(tf.transpose(a_s), [n_c, n_h*n_w])
        a_s_gram = tf.matmul(a_s_unrolled, tf.transpose(a_s_unrolled))

        # Compute gram matrix for the activations of the generated image
        a_g = a_g_layers[i]
        a_g_unrolled = tf.reshape(tf.transpose(a_g), [n_c, n_h*n_w])
        a_g_gram = tf.matmul(a_g_unrolled, tf.transpose(a_g_unrolled))

        # Compute style cost for the current layer
        style_cost_layer = (1/(4 * n_c**2 * (n_w* n_h)**2)) * tf.reduce_sum(tf.square(tf.subtract(a_s_gram, a_g_gram)))

        style_cost += style_cost_layer * style_layer_coefficients[i]
    
    return style_cost

Total cost

Let $\alpha$ and $\beta$ be constants. The total cost is defined as a linear combination of the content cost and style cost:

$$ J = \alpha J_{content} + \beta J_{style} $$

def total_cost(content_cost, style_cost, alpha, beta):
    """Return a tensor representing the total cost."""
    return alpha * content_cost + beta * style_cost

Create graph

Now it is time to create the computation graph and store the tensors that corresponds to the layer activations we are interested in. Let us start by creating a TensorFlow variable which we will use as an input to the network.

input_var = tf.Variable(content_img, dtype=tf.float32, expected_shape=(None, None, None, NUM_COLOR_CHANNELS), name="input_var")

The variable is initialized to the content image, and will later be assigned the style and generated images. Now let us get the pretrained VGG-19 Keras model, extract the layers, and store the output tensors we will need. Note that the MaxPooling layers are swapped for AvgPooling layers. This gives NST better performance.

def create_output_tensors(input_variable, content_layer_index, style_layer_indices):
    """
    Create output tensors, using a pretrained Keras VGG19-model.
    Return tensors for content and style layers.
    """
    vgg_model = VGG19(weights="imagenet", include_top=False)
    layers = [l for l in vgg_model.layers]

    x = layers[1](input_variable)
    x_content_tensor = x
    x_style_tensors = []
    if 1 in style_layer_indices:
        x_style_tensors.append(x)

    for i in range(2, len(layers)):
        # Use layers from vgg model, but swap max pooling layers for average pooling
        if type(layers[i]) == MaxPooling2D:
            x = tf.nn.avg_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
        else:
            x = layers[i](x)

        # Store appropriate layer outputs
        if i == content_layer_index:
            x_content_tensor = x
        if i in style_layer_indices:
            x_style_tensors.append(x)

    return x_content_tensor, x_style_tensors

x_content, x_styles = create_output_tensors(input_var, CONTENT_LAYER_INDEX, STYLE_LAYER_INDICES)

The constant CONTENT_LAYER_INDEX=13 makes us use the activations of layer block4_conv2 for computing content cost. The constant STYLE_LAYER_INDICES = [1, 4, 7, 12, 17] makes us use the activations of the layers block1_conv1, block2_conv1, block3_conv1, block4_conv1, and block5_conv1 for computing style cost. Feel free to use other layers, or change the STYLE_LAYER_COEFFICIENTS. You can use vgg_model.summary() to see the available layers.

Create training operation

Now it is time to create the training operation. Note that the Keras session is used instead of tf.Session(), this is because we have used a Keras model and the weights for that model is contained in the Keras session. It is important to not use tf.global_variables_initializer() in this case, since the pretrained weights would be randomized. Instead we use tf.variables_initializer([input_var]). Also note that the AdamOptimizer has variables that we initialize using tf.variables_initializer(optimizer.variables()).

optimizer = tf.train.AdamOptimizer(LEARNING_RATE)

# Use the Keras session instead of creating a new one
with K.get_session() as sess:
    sess.run(tf.variables_initializer([input_var]))

    # Extract the layer activations for content and style images
    a_content = sess.run(x_content, feed_dict={K.learning_phase(): 0})
    sess.run(input_var.assign(style_img))
    a_styles = sess.run(x_styles, feed_dict={K.learning_phase(): 0})

    # Define the cost function
    J_content = content_cost(a_content, x_content)
    J_style = style_cost(a_styles, x_styles, STYLE_LAYER_COEFFICIENTS)
    J_total = total_cost(J_content, J_style, ALPHA, BETA)

    # Log the graph. To display use "tensorboard --logdir=log".
    if LOG_GRAPH:
        writer = tf.summary.FileWriter("log", sess.graph)
        writer.close()

    # Assign the generated random initial image as input
    sess.run(input_var.assign(generated_img_init))

    # Create the training operation
    train_op = optimizer.minimize(J_total, var_list=[input_var])
    sess.run(tf.variables_initializer(optimizer.variables()))

Train the generated image

It is finally time to train the generated image! The generated image is saved every 20th iteration so you can see the progress. The following code should be inside the with K.get_session() as sess: block created above.

for i in range(NUM_ITERATIONS):
    sess.run(train_op)

    if (i%20) == 0:
        print(
            "Iteration: " + str(i) +
            ", Content cost: " + "{:.2e}".format(sess.run(J_content)) +
            ", Style cost: " + "{:.2e}".format(sess.run(J_style)) +
            ", Total cost: " + "{:.2e}".format(sess.run(J_total))
        )

        # Save the generated image
        generated_img = sess.run(input_var)[0]
        save_img(generated_img, GENERATED_IMG_PATH, VGG_IMAGENET_MEANS)

# Save the generated image
generated_img = sess.run(input_var)[0]
save_img(generated_img, GENERATED_IMG_PATH, VGG_IMAGENET_MEANS)

Conclusion

I am happy about how the project turned out. Neural Style Transfer is a really cool technique, and it is always nice with visual results that can be appreciated by non-ML people!

The way I extracted the layers from the pretrained Keras model does seem slightly “hacky”, but overall I feel like it is a decent alternative that I probably will use again.

The source code can be found at https://github.com/CarlFredriksson/neural_style_transfer.

Thank you for reading, and feel free to send me any questions.