Conditional Generative Adversarial Networks
Introduction
A Generative Adversarial Network (GAN) takes noise as input and generates data that resembles examples from the training set. A Conditional Generative Adversarial Network (CGAN) is a type of GAN that condition on additional information. The additional information makes it possible to have more control over the data generation. CGANs where introduced in Conditional Generative Adversarial Nets (Mirza, Osindero, 2014).
A GAN can be conditioned on many types of information. For example, if you have a training set of animal images and their respective class labels, you can condition the model on animal class labels to be able to generate specific animals. Another example is to condition on images in order to do image-to-image translation, such as in Image-to-Image Translation with Conditional Adversarial Networks (Isola, Yan Zhu, Zhou, Efros, 2017).
In this post we are going to use TensorFlow to implement a CGAN for generating specific digits that look handwritten. The dataset we are going to use is the MNIST database of handwritten digits. In my last post I wrote about DCGANs and generated random handwritten digits, but in this post we are going to condition the GAN on digit labels in order to generate specific digits.
The source code can be found at https://github.com/CarlFredriksson/cgan_tensorflow.
Theory
For an introduction to the theory behind GANs, see my previous post Generative Adversarial Networks. The only thing that have changed is that the generator $G$ and the discriminator $D$ now condition on additional information $y$.
One training step consists of:
- Sampling a mini batch of $m$ noise vectors ${z^{(1)},\dots,z^{(m)}}$
- Sampling a mini batch of $m$ training examples ${(x^{(1)},y^{(1)}),\dots,(x^{(m)},y^{(m)})}$
- Updating $D$ by doing one gradient descent step on its loss function:
$$ J_D = - \frac{1}{m} \sum_{i=1}^{m} \Big[log \thinspace D\big(x^{(i)}, y^{(i)}\big) + log\big(1 - D\big(G\big(z^{(i)},y^{(i)}\big),y^{(i)}\big)\big)\Big] $$
- Updating $G$ by doing one gradient descent step on its loss function:
$$ J_G = - \frac{1}{m} \sum_{i=1}^{m} \Big[log \thinspace D\big(G\big(z^{(i)},y^{(i)}\big),y^{(i)}\big)\Big] $$
Implementation
Import Packages
These are the packages we need to import.
import math
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from tensorflow.keras.datasets import mnist
from sklearn.utils import shuffle
Load MNIST Data
You could download the dataset from http://yann.lecun.com/exdb/mnist/, but a convenient alternative is to import it from tensorflow.keras.datasets. Note that in the GAN post we didn’t need to store the labels, but this time we are going to condition on them.
X_train, Y_train = utils.load_mnist_data()
def load_mnist_data():
(X_train, Y_train), _ = mnist.load_data()
return X_train, Y_train
In order to visualize the data we can plot some examples.
NUM_DIGITS = 10
SAMPLE_SIZE = NUM_DIGITS**2
utils.plot_sample(X_train[:SAMPLE_SIZE], "output/mnist_data.png")
def plot_sample(sample, path):
n = int(np.sqrt(sample.shape[0]))
fig = plt.figure(figsize=(8, 8))
for i in range(n*n):
ax = plt.subplot(n, n, i + 1)
ax.imshow(sample[i], cmap=plt.get_cmap("gray"))
ax.axis("off")
ax.set_xticklabels([])
ax.set_yticklabels([])
fig.subplots_adjust(hspace=0.025, wspace=0.025)
plt.savefig(path, bbox_inches="tight")
plt.clf()
plt.close()
Preprocessing
The generator is going to have a tanh activation at the output layer. Thus we want to normalize the pixel values of the training images from the range $(0,255)$ to $(-1,1)$. We also add a channel dimension of depth 1 since the images were loaded in grayscale.
X_train = utils.preprocess_images(X_train)
def preprocess_images(X):
X = X / 255
X = X - 0.5
X = X * 2
X = np.expand_dims(X, axis=-1)
return X
The labels also need some preprocessing. We will convert them to one-hot vectors and add some dimensions in order to prepare them for convolutional layers. Note that dimensions are not added in the baseline model.
Y_train = preprocess_labels(Y_train)
def preprocess_labels(Y):
Y = utils.convert_to_one_hot(Y)
Y = np.expand_dims(Y, axis=1)
Y = np.expand_dims(Y, axis=1)
return Y
Postprocessing
Later we are going to plot samples of generated images and we will need a function that is the inverse of the preprocess function.
def postprocess_images(X):
X = np.squeeze(X, axis=-1)
X = X / 2
X = X + 0.5
X = X * 255
return X
Partition Training Data into Mini Batches
The training data is randomly shuffled and partitioned into mini batches.
BATCH_SIZE = 128
mini_batches = utils.random_mini_batches(X_train, Y_train, BATCH_SIZE)
def random_mini_batches(X_train, Y_train, batch_size):
mini_batches = []
m = X_train.shape[0]
X_train, Y_train = shuffle(X_train, Y_train)
# Partition into mini-batches
num_complete_batches = math.floor(m / batch_size)
for i in range(num_complete_batches):
startIndex = i * batch_size
endIndex = (i + 1) * batch_size
X_batch = X_train[startIndex : endIndex]
Y_batch = Y_train[startIndex : endIndex]
mini_batches.append((X_batch, Y_batch))
# Handling the case that the last mini-batch < batch_size
if m % batch_size != 0:
startIndex = num_complete_batches * batch_size
endIndex = m
X_batch = X_train[startIndex : endIndex]
Y_batch = Y_train[startIndex : endIndex]
mini_batches.append((X_batch, Y_batch))
return mini_batches
Baseline Model
Let’s start by creating a simple CGAN model in order to establish baseline performance. The generator is a simple feedforward neural network (FNN) that takes 100 dimensional noise vectors $Z$ and 10 dimensional one-hot vectors $Y$ as input. It outputs images with dimensions $(28,28,1)$ and pixel values in the range $(-1,1)$.
def generator(Z, Y):
with tf.variable_scope("Generator"):
x = tf.concat([Z, Y], 1)
x = tf.layers.dense(x, 128, activation="relu")
x = tf.layers.dense(x, 784, activation="tanh")
x = tf.reshape(x, [-1, 28, 28, 1])
return x
The discriminator is also a simple FNN that takes images $X$ and one-hot vectors $Y$ as input. It outputs single values representing guesses of whether an input image is a real or fake image of the given class $y$.
def discriminator(X, Y, reuse=False):
with tf.variable_scope("Discriminator", reuse=reuse):
x = tf.layers.flatten(X)
x = tf.concat([x, Y], 1)
x = tf.layers.dense(x, 128, activation="relu")
x = tf.layers.dense(x, 1, activation="sigmoid")
return x
If you want the rest of the code for the baseline model check out the source code. After training the CGAN for 100 epochs I got the following result.
CDCGAN Model
Let’s now create a deep convolutional CGAN (CDCGAN) model which is better suited for our problem. $Y_{fill}$ will contain the same labels as in $Y$, copied to fill $(28,28,10)$. This is done in order to fit the additional information into the purely convolutional architecture of the discriminator.
NOISE_DIM = 100
X = tf.placeholder(tf.float32, shape=(None, X_train.shape[1], X_train.shape[2], X_train.shape[3]))
Y = tf.placeholder(tf.float32, shape=(None, 1, 1, NUM_DIGITS))
Y_fill = tf.placeholder(tf.float32, shape=(None, X_train.shape[1], X_train.shape[2], NUM_DIGITS))
Z = tf.placeholder(tf.float32, shape=(None, 1, 1, NOISE_DIM))
is_training = tf.placeholder(tf.bool, shape=())
G, D_real, D_fake = create_cdcgan(X, Y, Y_fill, Z, is_training)
def create_cdcgan(X, Y, Y_fill, Z, is_training):
G = generator(Z, Y, is_training)
D_real = discriminator(X, Y_fill, is_training)
D_fake = discriminator(G, Y_fill, is_training, reuse=True)
return G, D_real, D_fake
The generator has the same architecture as in the DCGANs post, except for the additional input $Y$.
def generator(Z, Y, is_training):
with tf.variable_scope("Generator"):
x = tf.concat([Z, Y], 3)
# x.shape: (?, 1, 1, 110)
x = tf.layers.conv2d_transpose(x, 256, 7, strides=1, padding="valid", use_bias=False)
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.leaky_relu(x)
# x.shape: (?, 7, 7, 256)
x = tf.layers.conv2d_transpose(x, 128, 5, strides=2, padding="same", use_bias=False)
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.leaky_relu(x)
# x.shape: (?, 14, 14, 128)
x = tf.layers.conv2d_transpose(x, 1, 5, strides=2, padding="same")
x = tf.nn.tanh(x)
# x.shape: (?, 28, 28, 1)
return x
The discriminator also has the same architecture as in the DCGANs post, except for the additional input $Y_{fill}$.
def discriminator(X, Y_fill, is_training, reuse=False):
with tf.variable_scope("Discriminator", reuse=reuse):
x = tf.concat([X, Y_fill], 3)
# x.shape: (?, 28, 28, 11)
x = tf.layers.conv2d(x, 128, 5, strides=2, padding="same", use_bias=False)
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.leaky_relu(x)
# x.shape: (?, 14, 14, 128)
x = tf.layers.conv2d(x, 256, 5, strides=2, padding="same", use_bias=False)
x = tf.layers.batch_normalization(x, training=is_training)
x = tf.nn.leaky_relu(x)
# x.shape: (?, 7, 7, 256)
x = tf.layers.conv2d(x, 1, 7, strides=1, padding="valid")
x = tf.nn.sigmoid(x)
# x.shape: (?, 1, 1, 1)
return x
Create Training Steps
We will use the same optimizers as in the DCGANs post.
LEARNING_RATE = 0.0002
BETA1 = 0.5
G_loss_func, D_loss_func = utils.create_loss_funcs(D_real, D_fake)
G_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Generator")
D_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Discriminator")
G_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, beta1=BETA1).minimize(G_loss_func, var_list=G_vars)
D_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, beta1=BETA1).minimize(D_loss_func, var_list=D_vars)
def create_loss_funcs(D_real, D_fake):
eps = 1e-12
G_loss_func = tf.reduce_mean(-tf.log(D_fake + eps))
D_loss_func = tf.reduce_mean(-(tf.log(D_real + eps) + tf.log(1 - D_fake + eps)))
return G_loss_func, D_loss_func
Train Model
I trained the model for 20 epochs and plotted a sample of generated images after each epoch.
NUM_EPOCHS = 20
# Start session
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
# Training loop
for epoch in range(NUM_EPOCHS):
for (X_batch, Y_batch) in mini_batches:
Z_batch = utils.generate_Z_batch((X_batch.shape[0], 1, 1, NOISE_DIM))
Y_fill_batch = Y_batch * np.ones((X_batch.shape[0], Y_fill.shape[1], Y_fill.shape[2], Y_fill.shape[3]))
# Compute losses
G_loss, D_loss = sess.run([G_loss_func, D_loss_func], feed_dict={X: X_batch, Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
print("Epoch [{0}/{1}] - G_loss: {2}, D_loss: {3}".format(epoch, NUM_EPOCHS - 1, G_loss, D_loss))
# Run training steps
_ = sess.run(G_train_step, feed_dict={Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
_ = sess.run(D_train_step, feed_dict={X: X_batch, Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
# Plot generated images
Y_batch = np.ones((NUM_DIGITS, NUM_DIGITS))
Y_batch = (Y_batch * np.arange(NUM_DIGITS)).T.reshape((SAMPLE_SIZE, )).astype(int)
Y_batch = preprocess_labels(Y_batch)
Y_fill_batch = Y_batch * np.ones((SAMPLE_SIZE, Y_fill.shape[1], Y_fill.shape[2], Y_fill.shape[3]))
Z_batch = utils.generate_Z_batch((SAMPLE_SIZE, 1, 1, NOISE_DIM))
gen_imgs = sess.run(G, feed_dict={Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
gen_imgs = utils.postprocess_images(gen_imgs)
utils.plot_sample(gen_imgs, "output/cdcgan/cdcgan_gen_data_" + str(epoch) + ".png")
def generate_Z_batch(size):
return np.random.uniform(low=-1, high=1, size=size)
Images generated after the 20th epoch:
Conclusion
It was interesting to see that the results were considerably better for the conditioned version of the DCGAN compared to the unconditioned one in the DCGANs post. This is probably because the additional information forces the generator and discriminator to be more specific. The unconditioned DCGAN generated some images that looked like combinations of two digits, which barely happened with the conditioned version.
Thank you for reading and feel free to send me any questions.