Modern neural network architectures are deep – typically involving 3 and more hidden layers. They use backpropagation to determine the rate of weight updates through all layers. Backpropagation was first popularized by the paper Learning representations by back-propagating errors | Nature back in 1986. It is one of the most important algorithms in machine learning.
In this article I will derive generic back propagation formulas so that we could move on to more interesting uses of neural networks such as image classification in the next articles.
Usual Setup of a Multi-Layer Perceptron
Figure 1 shows a typical neural network consisting of 2 hidden layers. During forward pass, input vector X(k-1) is fed into first layer k, then is forwarded to next hidden layer k+1 and into output neuron at output layer k+2. Then, during backpropagation, gradient of the online error with respect to weights is calculated, from the last-layer neuron back to the beginning of the network through layers k+1, then k.

In Figure 1 we have weights wi,j(k) where i refers to the node number in the current layer, j refers to j’th output from the layer before, k refers to the current layer’s number. zi(k) is the output from node i at layer k. It is equal to xi(k) if the node is a linear neuron. If the node is a sigmoid neuron then it is the following:
More generally, xi(k) is input from node i at previous layer k.
z values at each layer k, k+1 and k+2 are calculated as follows:
Backpropagation at Layer k+2
First, we are interested in computing gradient of online error with respect to weights wk,l(k+2) of node k at last layer k+2. Subscript k is the node number and (k+2) in brackets is the layer number:
We note that for both sigmoid and linear unit, the following equations hold:
This is because Part 1 simplifies to -(yi – f(xi)) for both sigmoid and linear units and Part 2 is according to equations for zk(k+2) above. yi is the desired output for last neuron and f(xi) is the actual output from last neuron.
Backpropagation at Layer k+1
Next we move on to layer k+1 and derive gradient of online error with respect to weights wl,i(k+1):
We note that part 2 is equal to xi(k) according to equations for zl(k+1) above.
For part 1, we expand the partial derivatives by chain rule:
Part 3 expands into the following:
Which can be calculated by summation over all nodes k at layer k+2. Note that partial J over partial z is coming from the Part 1 of back propagation at layer k+2.
Part 4 expands into the following:
Backpropagation at Layer k (optional)
Similarly we derive gradient of online error with respect to weights at layer k:
We expand Part 1 by chain rule:
Then,
General case of backpropagation
Generally, use the following to computer the gradient at layer k:
Once gradient of online error with respect to weights is found, the weights are updated by iterating through neurons at each layer and using the following update rule:
Implementation-wise, there are two important methods in the training function of a neural network. First is forward_pass, second is back_prop:
def forward_pass(self, x_values):
for neuron in self.layer_one:
neuron.update_output(x_values)
neuron.z = neuron.output
if neuron.type == "relu":
neuron.output = activation_function_relu(neuron.z)
for neuron in self.layer_two:
inputs = self.get_inputs(self.layer_one)
neuron.update_output(inputs)
neuron.z = neuron.output
if neuron.type == "sigmoid":
neuron.output = apply_sigmoid(neuron.z)
def back_prop(self, y_actual, x_values):
y_desired_values = [1 if y_actual == label else 0 for label in labels_include]
# Update weights on last layer neuron
for id, neuron in enumerate(self.layer_two):
neuron.partial_online_partial_weights = np.zeros(len(neuron.weights))
neuron.partial_online_partial_output = neuron.output - y_desired_values[id]
inputs = self.get_inputs(self.layer_one)
for id, el in enumerate(neuron.partial_online_partial_weights):
neuron.partial_online_partial_weights[id] = neuron.partial_online_partial_output * inputs[id]
# Update weights on layer 1 neurons
for id_neuron, neuron in enumerate(self.layer_one):
neuron.partial_online_partial_output = 0
for id_next_neuron, next_neuron in enumerate(self.layer_two):
neuron.partial_online_partial_output += next_neuron.partial_online_partial_output * next_neuron.weights[id_neuron + 1]
partial_a_partial_z = 1 if neuron.z > 0 else 0
neuron.partial_online_partial_output *= partial_a_partial_z
neuron.partial_online_partial_weights = np.zeros(len(neuron.weights))
for id, el in enumerate(neuron.partial_online_partial_weights):
neuron.partial_online_partial_weights[id] = neuron.partial_online_partial_output * x_values[id]
for id_neuron, neuron in enumerate(self.layer_one):
neuron.update_weights(neuron.partial_online_partial_weights)
for id_neuron, neuron in enumerate(self.layer_two):
neuron.update_weights(neuron.partial_online_partial_weights)
Both methods are called in the train function of a neural network class:
def train(self, X_train, y_train):
for rec_num in range(int(self.num_of_records / 10)):
print("rec_num: ", rec_num)
y_actual = y_train[rec_num]
x_values = np.zeros(self.input_numbers + 1)
x_values[0] = 1
x_values[1:] = X_train[rec_num][:]
self.forward_pass(x_values)
self.back_prop(y_actual, x_values)
In the code I only used 1 hidden layer.
Conclusion
Backpropagation remains the backbone of modern neural networks, providing the mechanism through which models learn from data. By efficiently computing gradients layer by layer, it enables deep architectures to adjust millions of parameters and uncover complex patterns. In this article, I derived the mathematical foundations of backpropagation for a network with two hidden layers, showing how the chain rule lies at the heart of this learning process.
While the math is essential, understanding backpropagation also opens the door to practical challenges such as vanishing gradients, weight initialization, and the choice of activation functions. These challenges have inspired innovations like ReLU activations, batch normalization, and advanced optimizers that continue to push deep learning forward.
Ultimately, backpropagation is not just a formula—it is the key principle that makes neural networks trainable and powerful. Whether applied in image recognition, natural language processing, or reinforcement learning, its role is indispensable. By grasping both the theoretical underpinnings and practical implications, you build a foundation for exploring more advanced architectures and optimization techniques in the ever-evolving field of deep learning.
