ANN without ML Framework


The aim of this article is to show how you can create an artificial neural network (ANN) without using a machine learning (ML) framework such as TensorFlow or PyTorch. This is not practical for most projects, but it’s a great exercise for building intuition about neural networks. The implemented network will be a simple feedforward neural network with a few customizable hyperparameters such as the number of hidden layers and their sizes. The network will be used to solve a simple regression problem.

The source code can be found at

Initialize Parameters

The only import we will need to create the network is

import numpy as np

Let’s start by creating a class for our network.

class SimpleNeuralNetwork:

Inside this class we are going to write some methods. Let’s start with the __init__ method that initializes the weights and biases of the network. The weights are initialized to normally distributed random numbers and the biases are initialized to zero.

def __init__(self, input_layer_size, hidden_layer_sizes, output_layer_size):
    self.W = []
    self.b = []
    if len(hidden_layer_sizes) == 0:
        self.add_layer(input_layer_size, output_layer_size)
    elif len(hidden_layer_sizes) >= 1:
        self.add_layer(input_layer_size, hidden_layer_sizes[0])
        for i in range(1, len(hidden_layer_sizes)):
            self.add_layer(hidden_layer_sizes[i-1], hidden_layer_sizes[i])
        self.add_layer(hidden_layer_sizes[-1], output_layer_size)
def add_layer(self, previous_layer_size, layer_size):
    self.W.append(np.random.randn(previous_layer_size, layer_size) * 0.01)
    self.b.append(np.zeros((1, layer_size)))

Forward Propagation

We have initialized the necessary member variables to write a forward propagation method. Let $m$ denote the batch size, $n$ the input layer size (layer 0), and $o$ the output layer size. Let $X \in \mathbb{R}^{m \times n}$ denote a batch of input vectors and $Y \in \mathbb{R}^{m \times o}$ their corresponding target output vectors. Let $W^{[l]} \in \mathbb{R}^{\text{size of }l-1 \times \text{size of }l}$ denote the weights, $b^{[l]} \in \mathbb{R}^{1 \times \text{size of }l}$ the biases, and $A^{[l]} \in \mathbb{R}^{m \times \text{size of }l}$ the activations of layer $l \in {1,2,\ldots,L}$ with layer $l = L$ being the output layer. All layers will use the ReLU activation function except the output layer, which will be linear since we will be using the network on a regression problem. More formally

$$ \begin{aligned} A^{[0]} &= X \\ Z^{[l]} &= A^{[l-1]} W^{[l]} + b^{[l]} \qquad \forall l \in {1,2,\ldots,L} \\ A^{[l]} &= max(0, Z) \qquad \forall l \in {1,2,\ldots,L-1} \\ A^{[L]} &= Z \end{aligned} $$

Let $\mathcal{L}$ denote the loss. The loss function we will use is mean squared error (MSE). More formally

$$ \mathcal{L} = \frac{1}{2 m} \sum_{i=1}^m \sum_{j=1}^{o} \big[Y_{i,j} - A^{[L]}_{i,j} \big]^2 $$

We could divide by $o$ instead of 2 to make it more of a true mean, but dividing by 2 makes the gradient cleaner. Note that we store the computed $Z$ and $A$ values in order to use them for computing the gradient by backpropagation.

def propagate_forward(self, X, Y):
    batch_size = np.shape(X)[0]
    Z_cache = [None] * (len(self.W))
    A_cache = [None] * (len(self.W))

    A_prev = X
    for i in range(len(self.W)):
        Z =, self.W[i]) + self.b[i]
        A = Z
        # If not at the output layer, apply relu
        if i < (len(self.W) - 1):
            A = np.maximum(0, Z)
        Z_cache[i] = Z
        A_cache[i] = A
        A_prev = A

    loss = 1/(2*batch_size) * np.sum((Y - A_cache[-1])**2)

    return loss, Z_cache, A_cache


We will use backpropagation to efficiently compute the gradient of the loss function. Backpropagation might seem mysterious at first, but the idea is simply to use the chain rule of derivatives and the stored $Z$ and $A$ values for efficiency. We can work our way backwards

$$ \begin{aligned} \frac{\delta \mathcal{L}}{\delta A^{[L]}} &= \frac{1}{m} \sum A^{[L]} - Y \\ \frac{\delta \mathcal{L}}{\delta Z^{[L]}} &= \frac{\delta \mathcal{L}}{\delta A^{[L]}} \frac{\delta A^{[L]}}{\delta Z^{[L]}} = \frac{\delta \mathcal{L}}{\delta A^{[L]}} \qquad \text{(since the output layer is linear)} \\ \frac{\delta \mathcal{L}}{\delta W^{[L]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L]}} \frac{\delta Z^{[L]}}{\delta W^{[L]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L]}} A^{[L-1]} \\ \frac{\delta \mathcal{L}}{\delta b^{[L]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L]}} \frac{\delta Z^{[L]}}{\delta b^{[L]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L]}} \\ \frac{\delta \mathcal{L}}{\delta A^{[L-1]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L]}} \frac{\delta Z^{[L]}}{\delta A^{[L-1]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L]}} W^{[L-1]} \\ \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} &= \frac{\delta \mathcal{L}}{\delta A^{[L-1]}} \frac{\delta A^{[L-1]}}{\delta Z^{[L-1]}} = \frac{\delta \mathcal{L}}{\delta A^{[L-1]}} \frac{\delta max(0,Z^{[L-1]})}{\delta Z^{[L-1]}} \\ \frac{\delta \mathcal{L}}{\delta W^{[L-1]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} \frac{\delta Z^{[L-1]}}{\delta W^{[L-1]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} A^{[L-2]} \\ \frac{\delta \mathcal{L}}{\delta b^{[L-1]}} &= \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} \frac{\delta Z^{[L-1]}}{\delta b^{[L-1]}} = \frac{\delta \mathcal{L}}{\delta Z^{[L-1]}} \\ &\vdots \end{aligned} $$

A pattern emerges and the computation of partial derivatives can be implemented iteratively. Some of the matrix multiplications above might not have matching dimensions but I find it more productive to focus on those details when implementing, since we probably want to change the order of some computations to make it more efficient anyway (for example by not averaging over examples immediately). This is not the only possible implementation and I encourage you to explore alternatives.

def propagate_backward(self, X, Y, Z_cache, A_cache):
    batch_size = np.shape(X)[0]
    dW = [None] * (len(self.W))
    db = [None] * (len(self.b))

    # dA is short for dL/dA etc.
    dA = A_cache[-1] - Y
    dZ = dA
    for i in reversed(range(1, len(A_cache))):
        A = A_cache[i-1]
        dW[i] = (1/batch_size) *, dZ)
        db[i] = (1/batch_size) * np.sum(dZ, axis=0)
        dA =, self.W[i].T)
        dZ = np.multiply(dA, self.relu_derivative(Z_cache[i-1]))
    dW[0] = (1/batch_size) *, dZ)
    db[0] = (1/batch_size) * np.sum(dZ, axis=0)

    return dW, db
def relu_derivative(self, x):
    return (x > 0) * 1

Training Step

During training we will alternate forward propagation and backpropagation. Let’s create a method that combines them for convenience.

def propagate(self, X, Y):
    loss, Z_cache, A_cache = self.propagate_forward(X, Y)
    dW, db = self.propagate_backward(X, Y, Z_cache, A_cache)
    return loss, A_cache[-1], dW, db

A single training step consists of one combined propagation and one gradient descent step.

loss, predictions, dW, db = neural_network.propagate(X_train, Y_train)
for j in range(len(neural_network.W)):
    neural_network.W[j] -= learning_rate * dW[j]
    neural_network.b[j] -= learning_rate * db[j]

Regression Problem

Let’s generate some quadratic data with random noise added.

def generate_random_data():
    X = np.expand_dims(np.linspace(-3, 3, num=200), axis=1)
    Y = X**2 - 2
    noise = np.random.normal(0, 2, size=X.shape)
    Y = Y + noise
    return X, Y

We can now use our SimpleNeuralNetwork class to create model that can fit to the generated data.

if __name__ == "__main__":
    X, Y = utils.generate_random_data()
    utils.plot_data(X, Y, "output/data.png")

    neural_network = SimpleNeuralNetwork(1, [10, 10], 1)
    learning_rate = 0.01
    num_iterations = 10000
    for i in range(num_iterations):
        loss, predictions, dW, db = neural_network.propagate(X, Y)
        for j in range(len(neural_network.W)):
            neural_network.W[j] -= learning_rate * dW[j]
            neural_network.b[j] -= learning_rate * db[j]
        if i % 1000 == 0:
            print("loss:", loss)

    loss, Z_cache, A_cache = neural_network.propagate_forward(X, Y)
    utils.plot_results(X, Y, A_cache[-1], "output/results.png")



I hope you found this article helpful. Thank you for reading and feel free to send me any questions.