Backpropagation: Fundamentals of Neural Networks Part 3

< Previous Article

Modern neural network architectures are deep – typically involving 3 and more hidden layers. They use backpropagation to determine the rate of weight updates through all layers. Backpropagation was first popularized by the paper Learning representations by back-propagating errors | Nature back in 1986. It is one of the most important algorithms in machine learning.

In this article I will derive generic back propagation formulas so that we could move on to more interesting uses of neural networks such as image classification in the next articles.

Table of Contents

Usual Setup of a Multi-Layer Perceptron

Figure 1 shows a typical neural network consisting of 2 hidden layers. During forward pass, input vector X(k-1) is fed into first layer k, then is forwarded to next hidden layer k+1 and into output neuron at output layer k+2. Then, during backpropagation, gradient of the online error with respect to weights is calculated, from the last-layer neuron back to the beginning of the network through layers k+1, then k.

Figure 1. A typical neural network with multiple hidden layers.

In Figure 1 we have weights wi,j(k) where i refers to the node number in the current layer, j refers to j’th output from the layer before, k refers to the current layer’s number. zi(k) is the output from node i at layer k. It is equal to xi(k) if the node is a linear neuron. If the node is a sigmoid neuron then it is the following:

\[ x_i(k) = g(z_i(k)) = \frac{1}{1+e^{-z}} \]

More generally, xi(k) is input from node i at previous layer k.

z values at each layer k, k+1 and k+2 are calculated as follows:

\[z_i(k) = w_{i,0}(k) + \sum_j w_{i,j}(k) * x_j(k-1)\] \[z_l(k+1) = w_{l,0}(k+1) + \sum_i w_{l,i}(k+1) * x_i(k)\] \[z_k(k+2) = w_{k,0}(k+2) + \sum_l w_{k,l}(k+2) * x_l(k+1)\]

Backpropagation at Layer k+2

First, we are interested in computing gradient of online error with respect to weights wk,l(k+2) of node k at last layer k+2. Subscript k is the node number and (k+2) in brackets is the layer number:

\[ \frac{\partial J_{online}(d,w)}{\partial w_{k,l}(k+2)} = \frac{\partial J_{online}(d,w)}{\partial z_{k}(k+2)} \frac{\partial z_k(k+2)}{\partial w_{k,l}(k+2)} \]

We note that for both sigmoid and linear unit, the following equations hold:

\[ \textbf{Part 1.} \frac{\partial J_{online}(d,w)}{\partial z_{k}(k+2)} * \textbf{Part 2.} \frac{\partial z_k(k+2)}{\partial w_{k,l}(k+2)} = -(y_{i} – f(x_i)) * x_l(k+1) \]

This is because Part 1 simplifies to -(yi – f(xi)) for both sigmoid and linear units and Part 2 is according to equations for zk(k+2) above. yi is the desired output for last neuron and f(xi) is the actual output from last neuron.

Backpropagation at Layer k+1

Next we move on to layer k+1 and derive gradient of online error with respect to weights wl,i(k+1):

\[ \frac{\partial J_{online}(d,w)}{\partial w_{l,i}(k+1)} = \textbf{Part 1.} \frac{\partial J_{online}(d,w)}{\partial z_{l}(k+1)} \textbf{Part 2.} \frac{\partial z_l(k+1)}{\partial w_{l,i}(k+1)} = \frac{\partial J_{online}(d,w)}{\partial z_{l}(k+1)} * x_i(k) \]

We note that part 2 is equal to xi(k) according to equations for zl(k+1) above.

For part 1, we expand the partial derivatives by chain rule:

\[ \frac{\partial J_{online}(d,w)}{\partial z_{l}(k+1)} = \textbf{Part 3.} \sum_{k} \frac{\partial J_{online}(d_i, w)}{\partial x_l(k+1)} * \textbf{Part 4.} \frac{\partial x_l(k+1)}{\partial z_l(k+1)} \]

Part 3 expands into the following:

\[ \textbf{Part 3.} \sum_{k} \frac{\partial J_{online}(d_i, w)}{\partial x_l(k+1)} = \sum_k (\frac{\partial J_{online}(d_i, w)}{\partial z_k(k+2)} * \frac{\partial z_k(k+2)}{\partial x_l(x+1)}) = \sum_k (\frac{\partial J_{online}(d_i, w)}{\partial z_k(k+2)} * w_{k,l}(k+2)) \]

Which can be calculated by summation over all nodes k at layer k+2. Note that partial J over partial z is coming from the Part 1 of back propagation at layer k+2.

Part 4 expands into the following:

\[ \frac{\partial x_l(k+1)}{\partial z_l(k+1)} = 1 \] \[ \textbf{ if node L at layer k+1 is a linear unit} \]

\[ \textbf{OR} \] \[ \frac{\partial x_l(k+1)}{\partial z_l(k+1)} = \frac{e^{-z_l(k+1)}}{(1+e^{-z_l(k+1)})^2} = g(z)(1-g(z)) \] \[ \textbf{where} \] \[ g(z) = \frac{1}{1+e^{-z}} = g(w_0 + \sum^{k}_{j=1}x^{(j)}_i w_j) \] \[ \textbf{if unit l at layer k+1 is a sigmoid activation unit} \]

Backpropagation at Layer k (optional)

Similarly we derive gradient of online error with respect to weights at layer k:

\[ \frac{\partial J_{online} (d_i, w)}{\partial w_{i,j}(k)} = \textbf{Part 1. } \frac{\partial J_{online}(d_i, w)}{\partial z_i(k)} * \frac{\partial z_i(k)}{\partial w_{i,j}(k)} =\frac{\partial J_{online}(d_i, w)}{\partial z_i(k)} * x_j(k-1) \]

We expand Part 1 by chain rule:

\[ \frac{\partial J_{online}(d_i, w)}{\partial z_i(k)} = \sum_l (\frac{\partial J{online}(d_i, w)}{\partial x_i(k)}) * \frac{\partial x_i(k)}{\partial z_i(k)} \]

Then,

\[ \sum_l (\frac{\partial J{online}(d_i, w)}{\partial x_i(k)}) = \sum_l (\frac{\partial J_{online}(d_i, w)}{\partial z_l(k+1)} * \frac{\partial z_l(k+1)}{\partial x_i(k)}) =\sum_l (\frac{\partial J_{online}(d_i, w)}{\partial z_l(k+1)} * w_{l,i}(k+1)) \]

General case of backpropagation

Generally, use the following to computer the gradient at layer k:

\[ \frac{\partial J_{online}}{\partial w_{i,j}(k)} = \delta_i(k)* \frac{\partial x_i(k)}{\partial z_i(k)} * x_j(k-1) \] \[ \textbf{where} \] \[ \delta_i(k) =\sum_l (\frac{\partial J_{online}(d_i, w)}{\partial z_l(k+1)} * w_{l,i}(k+1)) = \sum_l \delta_l(k+1) w_{l,i}(k+1) \] \[ \textbf{from the layer in front until the last unit: } \] \[ \delta_i(K) = -(y – f(x,w)) \]

Once gradient of online error with respect to weights is found, the weights are updated by iterating through neurons at each layer and using the following update rule:

\[ w = w – a * \frac{\partial J_{online}}{\partial w} \]

Implementation-wise, there are two important methods in the training function of a neural network. First is forward_pass, second is back_prop:

    def forward_pass(self, x_values):
        for neuron in self.layer_one:
            neuron.update_output(x_values)
            neuron.z = neuron.output
            if neuron.type == "relu":
                neuron.output = activation_function_relu(neuron.z)
        for neuron in self.layer_two:
            inputs = self.get_inputs(self.layer_one)
            neuron.update_output(inputs)
            neuron.z = neuron.output
            if neuron.type == "sigmoid":
                neuron.output = apply_sigmoid(neuron.z)

def back_prop(self, y_actual, x_values):
        y_desired_values = [1 if y_actual == label else 0 for label in labels_include]

        # Update weights on last layer neuron
        for id, neuron in enumerate(self.layer_two):
            neuron.partial_online_partial_weights = np.zeros(len(neuron.weights))
            neuron.partial_online_partial_output = neuron.output - y_desired_values[id]
            inputs = self.get_inputs(self.layer_one)
            for id, el in enumerate(neuron.partial_online_partial_weights):
                neuron.partial_online_partial_weights[id] = neuron.partial_online_partial_output * inputs[id]

        # Update weights on layer 1 neurons
        for id_neuron, neuron in enumerate(self.layer_one):
            neuron.partial_online_partial_output = 0
            for id_next_neuron, next_neuron in enumerate(self.layer_two):
                neuron.partial_online_partial_output += next_neuron.partial_online_partial_output * next_neuron.weights[id_neuron + 1]
            partial_a_partial_z = 1 if neuron.z > 0 else 0
            neuron.partial_online_partial_output *= partial_a_partial_z

            neuron.partial_online_partial_weights = np.zeros(len(neuron.weights))
            for id, el in enumerate(neuron.partial_online_partial_weights):
                neuron.partial_online_partial_weights[id] = neuron.partial_online_partial_output * x_values[id]

        for id_neuron, neuron in enumerate(self.layer_one):
            neuron.update_weights(neuron.partial_online_partial_weights)

        for id_neuron, neuron in enumerate(self.layer_two):
            neuron.update_weights(neuron.partial_online_partial_weights)

Both methods are called in the train function of a neural network class:

def train(self, X_train, y_train):
        for rec_num in range(int(self.num_of_records / 10)):
            print("rec_num: ", rec_num)
            y_actual = y_train[rec_num]
            x_values = np.zeros(self.input_numbers + 1)
            x_values[0] = 1
            x_values[1:] = X_train[rec_num][:]

            self.forward_pass(x_values)
            self.back_prop(y_actual, x_values)

In the code I only used 1 hidden layer.

Conclusion

Backpropagation remains the backbone of modern neural networks, providing the mechanism through which models learn from data. By efficiently computing gradients layer by layer, it enables deep architectures to adjust millions of parameters and uncover complex patterns. In this article, I derived the mathematical foundations of backpropagation for a network with two hidden layers, showing how the chain rule lies at the heart of this learning process.

While the math is essential, understanding backpropagation also opens the door to practical challenges such as vanishing gradients, weight initialization, and the choice of activation functions. These challenges have inspired innovations like ReLU activations, batch normalization, and advanced optimizers that continue to push deep learning forward.

Ultimately, backpropagation is not just a formula—it is the key principle that makes neural networks trainable and powerful. Whether applied in image recognition, natural language processing, or reinforcement learning, its role is indispensable. By grasping both the theoretical underpinnings and practical implications, you build a foundation for exploring more advanced architectures and optimization techniques in the ever-evolving field of deep learning.