Conditional Generative Adversarial Networks

Top image

Introduction

A Generative Adversarial Network (GAN) takes noise as input and generates data that resembles examples from the training set. A Conditional Generative Adversarial Network (CGAN) is a type of GAN that condition on additional information. The additional information makes it possible to have more control over the data generation. CGANs where introduced in Conditional Generative Adversarial Nets (Mirza, Osindero, 2014).

A GAN can be conditioned on many types of information. For example, if you have a training set of animal images and their respective class labels, you can condition the model on animal class labels to be able to generate specific animals. Another example is to condition on images in order to do image-to-image translation, such as in Image-to-Image Translation with Conditional Adversarial Networks (Isola, Yan Zhu, Zhou, Efros, 2017).

In this post we are going to use TensorFlow to implement a CGAN for generating specific digits that look handwritten. The dataset we are going to use is the MNIST database of handwritten digits. In my last post I wrote about DCGANs and generated random handwritten digits, but in this post we are going to condition the GAN on digit labels in order to generate specific digits.

The source code can be found at https://github.com/CarlFredriksson/cgan_tensorflow.

Theory

For an introduction to the theory behind GANs, see my previous post Generative Adversarial Networks. The only thing that have changed is that the generator $G$ and the discriminator $D$ now condition on additional information $y$.

One training step consists of:

$$ J_D = - \frac{1}{m} \sum_{i=1}^{m} \Big[log \thinspace D\big(x^{(i)}, y^{(i)}\big) + log\big(1 - D\big(G\big(z^{(i)},y^{(i)}\big),y^{(i)}\big)\big)\Big] $$

$$ J_G = - \frac{1}{m} \sum_{i=1}^{m} \Big[log \thinspace D\big(G\big(z^{(i)},y^{(i)}\big),y^{(i)}\big)\Big] $$

Implementation

Import Packages

These are the packages we need to import.

import math
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from tensorflow.keras.datasets import mnist
from sklearn.utils import shuffle

Load MNIST Data

You could download the dataset from http://yann.lecun.com/exdb/mnist/, but a convenient alternative is to import it from tensorflow.keras.datasets. Note that in the GAN post we didn’t need to store the labels, but this time we are going to condition on them.

X_train, Y_train = utils.load_mnist_data()
def load_mnist_data():
    (X_train, Y_train), _ = mnist.load_data()
    return X_train, Y_train

In order to visualize the data we can plot some examples.

NUM_DIGITS = 10
SAMPLE_SIZE = NUM_DIGITS**2
utils.plot_sample(X_train[:SAMPLE_SIZE], "output/mnist_data.png")
def plot_sample(sample, path):
    n = int(np.sqrt(sample.shape[0]))
    fig = plt.figure(figsize=(8, 8))
    for i in range(n*n):
        ax = plt.subplot(n, n, i + 1)
        ax.imshow(sample[i], cmap=plt.get_cmap("gray"))
        ax.axis("off")
        ax.set_xticklabels([])
        ax.set_yticklabels([])
    fig.subplots_adjust(hspace=0.025, wspace=0.025)
    plt.savefig(path, bbox_inches="tight")
    plt.clf()
    plt.close()

MNIST data

Preprocessing

The generator is going to have a tanh activation at the output layer. Thus we want to normalize the pixel values of the training images from the range $(0,255)$ to $(-1,1)$. We also add a channel dimension of depth 1 since the images were loaded in grayscale.

X_train = utils.preprocess_images(X_train)
def preprocess_images(X):
    X = X / 255
    X = X - 0.5
    X = X * 2
    X = np.expand_dims(X, axis=-1)
    return X

The labels also need some preprocessing. We will convert them to one-hot vectors and add some dimensions in order to prepare them for convolutional layers. Note that dimensions are not added in the baseline model.

Y_train = preprocess_labels(Y_train)
def preprocess_labels(Y):
    Y = utils.convert_to_one_hot(Y)
    Y = np.expand_dims(Y, axis=1)
    Y = np.expand_dims(Y, axis=1)
    return Y

Postprocessing

Later we are going to plot samples of generated images and we will need a function that is the inverse of the preprocess function.

def postprocess_images(X):
    X = np.squeeze(X, axis=-1)
    X = X / 2
    X = X + 0.5
    X = X * 255
    return X

Partition Training Data into Mini Batches

The training data is randomly shuffled and partitioned into mini batches.

BATCH_SIZE = 128
mini_batches = utils.random_mini_batches(X_train, Y_train, BATCH_SIZE)
def random_mini_batches(X_train, Y_train, batch_size):
    mini_batches = []
    m = X_train.shape[0]
    X_train, Y_train = shuffle(X_train, Y_train)

    # Partition into mini-batches
    num_complete_batches = math.floor(m / batch_size)
    for i in range(num_complete_batches):
        startIndex = i * batch_size
        endIndex = (i + 1) * batch_size
        X_batch = X_train[startIndex : endIndex]
        Y_batch = Y_train[startIndex : endIndex]
        mini_batches.append((X_batch, Y_batch))

    # Handling the case that the last mini-batch < batch_size
    if m % batch_size != 0:
        startIndex = num_complete_batches * batch_size
        endIndex = m
        X_batch = X_train[startIndex : endIndex]
        Y_batch = Y_train[startIndex : endIndex]
        mini_batches.append((X_batch, Y_batch))

    return mini_batches

Baseline Model

Let’s start by creating a simple CGAN model in order to establish baseline performance. The generator is a simple feedforward neural network (FNN) that takes 100 dimensional noise vectors $Z$ and 10 dimensional one-hot vectors $Y$ as input. It outputs images with dimensions $(28,28,1)$ and pixel values in the range $(-1,1)$.

def generator(Z, Y):
    with tf.variable_scope("Generator"):
        x = tf.concat([Z, Y], 1)
        x = tf.layers.dense(x, 128, activation="relu")
        x = tf.layers.dense(x, 784, activation="tanh")
        x = tf.reshape(x, [-1, 28, 28, 1])

    return x

The discriminator is also a simple FNN that takes images $X$ and one-hot vectors $Y$ as input. It outputs single values representing guesses of whether an input image is a real or fake image of the given class $y$.

def discriminator(X, Y, reuse=False):
    with tf.variable_scope("Discriminator", reuse=reuse):
        x = tf.layers.flatten(X)
        x = tf.concat([x, Y], 1)
        x = tf.layers.dense(x, 128, activation="relu")
        x = tf.layers.dense(x, 1, activation="sigmoid")

    return x

If you want the rest of the code for the baseline model check out the source code. After training the CGAN for 100 epochs I got the following result.

GAN

CDCGAN Model

Let’s now create a deep convolutional CGAN (CDCGAN) model which is better suited for our problem. $Y_{fill}$ will contain the same labels as in $Y$, copied to fill $(28,28,10)$. This is done in order to fit the additional information into the purely convolutional architecture of the discriminator.

NOISE_DIM = 100
X = tf.placeholder(tf.float32, shape=(None, X_train.shape[1], X_train.shape[2], X_train.shape[3]))
Y = tf.placeholder(tf.float32, shape=(None, 1, 1, NUM_DIGITS))
Y_fill = tf.placeholder(tf.float32, shape=(None, X_train.shape[1], X_train.shape[2], NUM_DIGITS))
Z = tf.placeholder(tf.float32, shape=(None, 1, 1, NOISE_DIM))
is_training = tf.placeholder(tf.bool, shape=())
G, D_real, D_fake = create_cdcgan(X, Y, Y_fill, Z, is_training)
def create_cdcgan(X, Y, Y_fill, Z, is_training):
    G = generator(Z, Y, is_training)
    D_real = discriminator(X, Y_fill, is_training)
    D_fake = discriminator(G, Y_fill, is_training, reuse=True)

    return G, D_real, D_fake

The generator has the same architecture as in the DCGANs post, except for the additional input $Y$.

def generator(Z, Y, is_training):
    with tf.variable_scope("Generator"):
        x = tf.concat([Z, Y], 3)
        # x.shape: (?, 1, 1, 110)

        x = tf.layers.conv2d_transpose(x, 256, 7, strides=1, padding="valid", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 7, 7, 256)

        x = tf.layers.conv2d_transpose(x, 128, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 14, 14, 128)

        x = tf.layers.conv2d_transpose(x, 1, 5, strides=2, padding="same")
        x = tf.nn.tanh(x)
        # x.shape: (?, 28, 28, 1)

    return x

The discriminator also has the same architecture as in the DCGANs post, except for the additional input $Y_{fill}$.

def discriminator(X, Y_fill, is_training, reuse=False):
    with tf.variable_scope("Discriminator", reuse=reuse):
        x = tf.concat([X, Y_fill], 3)
        # x.shape: (?, 28, 28, 11)

        x = tf.layers.conv2d(x, 128, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 14, 14, 128)

        x = tf.layers.conv2d(x, 256, 5, strides=2, padding="same", use_bias=False)
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.leaky_relu(x)
        # x.shape: (?, 7, 7, 256)

        x = tf.layers.conv2d(x, 1, 7, strides=1, padding="valid")
        x = tf.nn.sigmoid(x)
        # x.shape: (?, 1, 1, 1)

    return x

Create Training Steps

We will use the same optimizers as in the DCGANs post.

LEARNING_RATE = 0.0002
BETA1 = 0.5
G_loss_func, D_loss_func = utils.create_loss_funcs(D_real, D_fake)
G_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Generator")
D_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="Discriminator")
G_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, beta1=BETA1).minimize(G_loss_func, var_list=G_vars)
D_train_step = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE, beta1=BETA1).minimize(D_loss_func, var_list=D_vars)
def create_loss_funcs(D_real, D_fake):
    eps = 1e-12
    G_loss_func = tf.reduce_mean(-tf.log(D_fake + eps))
    D_loss_func = tf.reduce_mean(-(tf.log(D_real + eps) + tf.log(1 - D_fake + eps)))

    return G_loss_func, D_loss_func

Train Model

I trained the model for 20 epochs and plotted a sample of generated images after each epoch.

NUM_EPOCHS = 20
# Start session
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    # Training loop
    for epoch in range(NUM_EPOCHS):
        for (X_batch, Y_batch) in mini_batches:
            Z_batch = utils.generate_Z_batch((X_batch.shape[0], 1, 1, NOISE_DIM))
            Y_fill_batch = Y_batch * np.ones((X_batch.shape[0], Y_fill.shape[1], Y_fill.shape[2], Y_fill.shape[3]))

            # Compute losses
            G_loss, D_loss = sess.run([G_loss_func, D_loss_func], feed_dict={X: X_batch, Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
            print("Epoch [{0}/{1}] - G_loss: {2}, D_loss: {3}".format(epoch, NUM_EPOCHS - 1, G_loss, D_loss))

            # Run training steps
            _ = sess.run(G_train_step, feed_dict={Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
            _ = sess.run(D_train_step, feed_dict={X: X_batch, Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})

        # Plot generated images
        Y_batch = np.ones((NUM_DIGITS, NUM_DIGITS))
        Y_batch = (Y_batch * np.arange(NUM_DIGITS)).T.reshape((SAMPLE_SIZE, )).astype(int)
        Y_batch = preprocess_labels(Y_batch)
        Y_fill_batch = Y_batch * np.ones((SAMPLE_SIZE, Y_fill.shape[1], Y_fill.shape[2], Y_fill.shape[3]))
        Z_batch = utils.generate_Z_batch((SAMPLE_SIZE, 1, 1, NOISE_DIM))
        gen_imgs = sess.run(G, feed_dict={Y: Y_batch, Y_fill: Y_fill_batch, Z: Z_batch, is_training: True})
        gen_imgs = utils.postprocess_images(gen_imgs)
        utils.plot_sample(gen_imgs, "output/cdcgan/cdcgan_gen_data_" + str(epoch) + ".png")
def generate_Z_batch(size):
    return np.random.uniform(low=-1, high=1, size=size)

Images generated after the 20th epoch:

Generated epoch 19

Conclusion

It was interesting to see that the results were considerably better for the conditioned version of the DCGAN compared to the unconditioned one in the DCGANs post. This is probably because the additional information forces the generator and discriminator to be more specific. The unconditioned DCGAN generated some images that looked like combinations of two digits, which barely happened with the conditioned version.

Thank you for reading and feel free to send me any questions.