Facial Landmark Detection

Facial landmarks


Landmark detection is a computer vision problem where an algorithm tries to find the locations of landmarks (also called keypoints) in an image. In facial landmark detection the landmarks correspond to facial features such as the center of the eyes or the edges of the mouth. After the algorithm predicts the location of the landmarks, they can be used for applications such as applying filters to the image (like Snapchat filters), detecting the emotional state of the person in the image, or classifying whether the person is male or female.

In this post I will show you how I implemented facial landmark detection in Keras. The dataset I used can be found at https://www.kaggle.com/c/facial-keypoints-detection/data. It contains grayscale images of size (96, 96) and up to 15 landmarks for each image.

The source code can be found at https://github.com/CarlFredriksson/facial_landmark_detection.


Load Data

The dataset contains csv files for training and testing. The test data contains only images and no landmarks, thus is of no use except for manual visual testing. I discarded the test data and instead tested the final model with images I found on my own. The training data contains 7049 rows, where each row contains landmark coordinates separated by commas and image pixels separated by spaces. Unfortunately many of the rows have missing landmark columns and after discarding those rows, we are left with 2140 labeled examples. After shuffling, I split the labeled examples into a training set and a validation set. The fraction of examples that is put in the validation set is determined by the parameter validation_split.

import os
import numpy as np
from pandas.io.parsers import read_csv
from sklearn.utils import shuffle
X_train, Y_train, X_val, Y_val = fld_utils.load_data(validation_split=0.2)
def load_data(validation_split):
    # Create path to csv file
    cwd = os.getcwd()
    csv_path = os.path.join(cwd, "data/training.csv")

    # Load data from csv file into data frame, drop all rows that have missing values
    data_frame = read_csv(csv_path)
    data_frame = data_frame.dropna()

    # Convert the rows of the image column from pixel values separated by spaces to numpy arrays
    data_frame["Image"] = data_frame["Image"].apply(lambda img: np.fromstring(img, sep=" "))

    # Create numpy matrix from image column by stacking the rows vertically
    X_data = np.vstack(data_frame["Image"].values)

    # Normalize pixel values to (0, 1) range
    X_data = X_data / 255

    # Convert to float32, which is the default for Keras
    X_data = X_data.astype("float32")

    # Reshape each row from one dimensional arrays to (height, width, num_channels) = (96, 96, 1)
    X_data = X_data.reshape(-1, 96, 96, 1)

    # Extract labels representing the coordinates of facial landmarks
    Y_data = data_frame[data_frame.columns[:-1]].values

    # Normalize coordinates to (0, 1) range
    Y_data = Y_data / 96
    Y_data = Y_data.astype("float32")

    # Shuffle data
    X_data, Y_data = shuffle(X_data, Y_data)

    # Split data into training set and validation set
    split_index = int(X_data.shape[0] * (1 - validation_split))
    X_train = X_data[:split_index]
    Y_train = Y_data[:split_index]
    X_val = X_data[split_index:]
    Y_val = Y_data[split_index:]

    return X_train, Y_train, X_val, Y_val

Create Models

Landmark detection is a type of regression problem, since given an input image, the objective is to predict 30 numbers corresponding to the 15 two dimensional landmarks. To achieve this objective we will create a convolutional neural network (CNN). I experimented with different network architectures and chose the one that resulted in the lowest validation loss. The hyperparameters that I varied were: the number of convolutional layers, the filter size and number of filters of the convolutional layers, the number of fully connected layers and their size, the number of dropout layers (which ended up at zero). Many of the historically well-performing CNN models shrink the height and width of the layer inputs and grow the number of filters as the network gets deeper, which is also the case for the model that I ended up with. The height and width are unchanged by the convolutional layers as the “same” padding scheme is used. Thus the shrinking of height and width happens solely in the max-pooling layers. The final model is pretty simple and not that deep, which is great for training times.

I also created a baseline model to compare with. The baseline model is a simple feedforward neural network with one hidden layer.

Both models use an Adam optimizer and mean squared error as the loss function.

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Conv2D, MaxPool2D, Dropout
def create_baseline_model():
    model = Sequential()
    model.add(Flatten(input_shape=(96, 96, 1)))
    model.add(Dense(512, activation="relu"))
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model
def create_cnn_model():
    model = Sequential()
    model.add(Conv2D(32, (5, 5), input_shape=(96, 96, 1), activation="relu", padding="same"))
    model.add(MaxPool2D(pool_size=(2, 2)))

    model.add(Conv2D(64, (5, 5), activation="relu", padding="same"))
    model.add(MaxPool2D(pool_size=(2, 2)))

    model.add(Conv2D(128, (5, 5), activation="relu", padding="same"))
    model.add(MaxPool2D(pool_size=(2, 2)))


    model.add(Dense(512, activation="relu"))
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

Train Models

I trained the models for 100 epochs using a batch size of 200. After training, the final loss values are saved in a text file, the histories of loss values are plotted together, and the model architectures and weights are saved. In Keras you can save a model in a single .h5 file that contains the architecture, weights, and optimizer state or you can save the architecture and weights separately. Since I won’t continue training the model after saving it I don’t need the optimizer state, and I opted to save the architecture and weights separately, which saves some disk space.

import fld_utils
import model_factory

X_train, Y_train, X_val, Y_val = fld_utils.load_data(validation_split=0.2)

# Run models
models = []
model_names = []
fld_utils.run_models(X_train, Y_train, X_val, Y_val, models, model_names, NUM_EPOCHS)
def run_models(X_train, Y_train, X_val, Y_val, models, model_names, num_epochs):
    results_file = open("output/results.txt", "w")
    histories = []

    for model, model_name in zip(models, model_names):
        # Train model
        history = model.fit(X_train, Y_train, batch_size=200, epochs=num_epochs, validation_data=(X_val, Y_val))

        # Evaluate
        final_train_loss = model.evaluate(X_train, Y_train, verbose=0)
        final_val_loss = model.evaluate(X_val, Y_val, verbose=0)
        results_file.write(model_name + "> final_train_loss: " + str(final_train_loss) + ", final_val_loss: " + str(final_val_loss) + "\n")

        # Save model
        model.save_weights("saved_models/" + model_name + "_model_weights.h5")
        with open("saved_models/" + model_name + "_model_architecture.json", "w") as f:

    plot_histories(histories, model_names, "histories.png")
def plot_histories(histories, model_names, plot_name):
    for history, model_name in zip(histories, model_names):
        plt.plot(history.epoch, np.array(history.history["loss"]), label=model_name + " train loss")
        plt.plot(history.epoch, np.array(history.history["val_loss"]), label=model_name + " val loss")
    plt.ylabel("Loss (Mean squared error)")
    plt.savefig("output/" + plot_name, bbox_inches="tight")


Predict Landmarks

Let us plot some landmark predictions on a randomly chosen image from the validation set to get a sense of the performance of the models. The first image is always chosen, but since the data is shuffled before partitioned, it is a randomly chosen image. The landmarks are extracted from network output by grouping the 30 values into 15 two-dimensional points. Note that the network predictions needs to be multiplied by image size since the values we have trained with were normalized.

import numpy as np
import fld_utils

X_train, Y_train, X_val, Y_val = fld_utils.load_data(validation_split=0.2)
img = X_val[0]
img_size_x, img_size_y = img.shape[0], img.shape[1]

# Plot correct landmarks
landmarks = fld_utils.extract_landmarks(Y_val[0], img.shape[0], img.shape[1])
fld_utils.save_img_with_landmarks(img, landmarks, "data_visual.png", gray_scale=True)

# Baseline model
model = fld_utils.load_model("baseline")
y_pred = model.predict(np.expand_dims(img, axis=0))[0]
landmarks = fld_utils.extract_landmarks(y_pred, img_size_x, img_size_y)
fld_utils.save_img_with_landmarks(img, landmarks, "baseline_prediction.png", gray_scale=True)

# CNN model
model = fld_utils.load_model("cnn")
y_pred = model.predict(np.expand_dims(img, axis=0))[0]
landmarks = fld_utils.extract_landmarks(y_pred, img_size_x, img_size_y)
fld_utils.save_img_with_landmarks(img, landmarks, "cnn_prediction.png", gray_scale=True)
import matplotlib.pyplot as plt
from keras.models import load_model, model_from_json
def load_model(model_name):
    with open("saved_models/" + model_name + "_model_architecture.json", "r") as f:
        model = model_from_json(f.read())
    model.load_weights("saved_models/" + model_name + "_model_weights.h5")
    return model
def extract_landmarks(y_pred, img_size_x, img_size_y):
    landmarks = []
    for i in range(0, len(y_pred), 2):
        landmark_x, landmark_y = y_pred[i] * img_size_x, y_pred[i+1] * img_size_y
        landmarks.append((landmark_x, landmark_y))
    return landmarks
def save_img_with_landmarks(img, landmarks, plot_name, gray_scale=False):
    if gray_scale:
        plt.imshow(np.squeeze(img), cmap=plt.get_cmap("gray"))
    for landmark in landmarks:
        plt.plot(landmark[0], landmark[1], "go")
    plt.savefig("output/" + plot_name, bbox_inches="tight")

The correct landmarks:

Correct landmarks

The baseline prediction:

Baseline prediction

The cnn prediction:

CNN prediction

We can see that the baseline model performs horribly but the cnn model performs very well on the chosen image.

Sunglasses Filter

Now we are finally ready to use our model in an application. I have created a simple script that applies a sunglasses filter to a face image. The predicted landmarks are used to scale, position, and rotate the sunglasses image. Since we have trained on normalized grayscale images of size (96, 96), we need to load the input image in grayscale mode, resize, and normalize it. To apply the filter we paste the processed sunglasses image on top of the original version of the input image that hasn’t had any preprocessing done to it. I normally use the library cv2 for image processing, but in this script PIL is also used. I used PIL because it has convenient functions for rotating images and pasting images on top of each other.

import numpy as np
import cv2
from PIL import Image
import matplotlib.pyplot as plt
import fld_utils

# Load original image
face_img_path = "input/picard.png"
orig_img = cv2.imread(face_img_path)
orig_img = cv2.cvtColor(orig_img, cv2.COLOR_BGR2RGB)
orig_size_x, orig_size_y = orig_img.shape[0], orig_img.shape[1]

# Prepare input image
img = cv2.imread(face_img_path, cv2.IMREAD_GRAYSCALE)
img = cv2.resize(img, dsize=(96, 96), interpolation=cv2.INTER_AREA)
img = np.expand_dims(img, axis=2)
img = img / 255
img = img.astype("float32")

# Predict landmarks
model = fld_utils.load_model("cnn")
y_pred = model.predict(np.expand_dims(img, axis=0))[0]
landmarks = fld_utils.extract_landmarks(y_pred, orig_size_x, orig_size_y)

# Save original image with landmarks on top
fld_utils.save_img_with_landmarks(orig_img, landmarks, "test_img_prediction.png")

# Extract x and y values from landmarks of interest
left_eye_center_x = int(landmarks[0][0])
left_eye_center_y = int(landmarks[0][1])
right_eye_center_x = int(landmarks[1][0])
right_eye_center_y = int(landmarks[1][1])
left_eye_outer_x = int(landmarks[3][0])
right_eye_outer_x = int(landmarks[5][0])

# Load images using PIL
# PIL has better functions for rotating and pasting compared to cv2
face_img = Image.open(face_img_path)
sunglasses_img = Image.open("input/sunglasses.png")

# Resize sunglasses
sunglasses_width = int((left_eye_outer_x - right_eye_outer_x) * 1.4)
sunglasses_height = int(sunglasses_img.size[1] * (sunglasses_width / sunglasses_img.size[0]))
sunglasses_resized = sunglasses_img.resize((sunglasses_width, sunglasses_height))

# Rotate sunglasses
eye_angle_radians = np.arctan((right_eye_center_y - left_eye_center_y) / (left_eye_center_x - right_eye_center_x))
sunglasses_rotated = sunglasses_resized.rotate(np.degrees(eye_angle_radians), expand=True, resample=Image.BICUBIC)

# Compute positions such that the center of the sunglasses is
# positioned at the center point between the eyes
x_offset = int(sunglasses_width * 0.5)
y_offset = int(sunglasses_height * 0.5)
pos_x = int((left_eye_center_x + right_eye_center_x) / 2) - x_offset
pos_y = int((left_eye_center_y + right_eye_center_y) / 2) - y_offset

# Paste sunglasses on face image
face_img.paste(sunglasses_rotated, (pos_x, pos_y), sunglasses_rotated)

Test image prediction

Test image sunglasses


This project was very fun and I’m pleased how it turned out. I was impressed with the performance of the model given its simplicity. I did not spend a lot of time on hyperparameter tuning, and I might revisit this step if I use facial landmark detection in a project that needs a more accurate model. Besides tuning hyperparameters, performance could also be improved by training on more data. This could be done by finding a bigger dataset, labeling images on your own, and/or data augmentation. If data augmentation is done, one has to be careful in making sure that the images are still correctly labeled. For example, if an image is flipped horizontally the landmarks also has to be flipped in the same way. However, the order of the landmarks must be consistent between images. If a landmark that represents the center of the left eye is flipped, it now represents the center of the right eye. Thus for a flipped image, the order of landmarks that represent left and right versions of a facial feature has to be swapped to maintain consistency.