Sigmoid Neuron - Fundamentals of Neural Networks Part 2

Table of Contents

Introduction to Sigmoid Function

Previously we have covered Single-Node Perceptron neural networks. The next important topics to study are the sigmoid neuron of a neural network and logistic regression. Sigmoid function is defined as:

\[ g(z) = \frac{1}{1+e^{-z}} \]

This function originates in mathematical biology. Various fields have used it long before its adoption in artificial neural networks. Here’s a brief timeline:

Early Origins in Mathematics and Biology

Mid-19th Century: Pierre François Verhulst initially studied the sigmoid function as the logistic function in 1838 to model population growth. It describes how populations grow rapidly at first but slow down as they approach a carrying capacity.
1920s-1930s: The sigmoid (logistic) function gained prominence in statistics for modeling probabilities in logistic regression. Ronald Fisher introduced it. Others later refined it.

Adoption in Neural Networks

1943: Warren McCulloch and Walter Pitts introduced the concept of artificial neurons.
1980s: Sigmoid function was first widely used in artificial neural networks during the resurgence of interest in connectionist models. Rumelhart, Hinton, and Williams introduced it in their seminal 1986 paper. Their paper was titled “Learning representations by back-propagating errors.” Backpropagation made the sigmoid function particularly prominent. The differentiability of the sigmoid function made it ideal for computing gradients in deep learning.

The sigmoid function was preferred because:

It is smooth and differentiable, enabling gradient-based learning.
It maps input values to a range between 0 and 1, allowing for probabilistic interpretations.
Although sigmoid functions have since been largely replaced by ReLU and other activation functions due to their limitations (e.g., vanishing gradients), their use in the 1980s and 1990s was critical for the development of modern neural networks.

Sigmoid Neuron-Based Perceptron

A sigmoid-based perceptron is an enhancement of the classic perceptron. It incorporates the sigmoid activation function to enable smoother decision boundaries. This also allows for probabilistic outputs. The original perceptron can only handle linearly separable data with binary outputs. In contrast, the sigmoid-based perceptron can map inputs to a continuous range of values. These values range between 0 and 1. This is particularly useful for classification tasks where outputs can represent probabilities.

The sigmoid function is defined as:

\[ g(z) = \frac{1}{1+e^{-z}} \]

Here

\[ z = \sum_{j=1}^{k} x_jw_j + b \]

Where w are the weights. x are the input features and b is the bias.

Sigmoid Activation Function — Figure 1. Sigmoid Function Curve.

The smooth, S-shaped curve of the sigmoid function allows the model to:

Handle Non-Linearity: By introducing non-linearity, the sigmoid-based perceptron can approximate complex decision boundaries.
Output Probabilities: The function’s output can be interpreted as the probability of an input belonging to a particular class.

The sigmoid-based perceptron serves as a building block for more advanced neural network architectures. It plays a crucial role in early models for tasks like binary classification, logistic regression, and probabilistic modeling.

Applications of Sigmoid Neuron-Based Perceptron Neural Network

Sigmoid neurons play a foundational role in neural network architectures, particularly in early applications of machine learning. Their ability to output probabilities between 0 and 1 makes them ideal for binary classification and other probabilistic tasks. Below are some key applications of sigmoid neurons:

Binary Classification. E.g. spam detection, disease prediction, sentiment analysis.
Logistic Regression – Sigmoid neurons are at the core of logistic regression.
Probability Estimation – In addition to classification, sigmoid neurons are used to estimate probabilities.
Early Neural Networks – Before researchers developed advanced activation functions like ReLU, they commonly used sigmoid neurons. Sigmoid neurons served as a standard building block in neural networks.
Control Systems – E.g. robot motion control, dynamic decision-making in autonomous systems.
Economics and Behavioral Modeling – Predicting choices in economics, psychology, and marketing.

Sigmoid Neuron Architecture and Mathematical Derivations

Sigmoid Neuron Neural Network Architecture — Figure 2. Neural Network Architecture for a Sigmoid-based Perceptron

The sigmoid node in a neural network comes at the end in the output layer. It takes in output of a perceptron node one layer back and modifies it with the sigmoid function.

The classification problem solved by sigmoid neuron is learning to map output values y into class 0 or class 1:

\[\textbf{classification problem}: f: Y -> \{ 0; 1 \} \]

The problem can also be formulated as learning the probability of class 1 given input features x:

\[\textbf{probabilistic problem}: f: Y -> [ 0; 1] \]

Sigmoid function is defined as:

\[ p(y=1|x,w) = g(z) = g(\sum_{j=0}^{k} x_jw_j) = \frac{1}{1+e^{-z}} \]

Sigmoid neuron is a convenience addition to single-node perceptron. It classifies data points into a binary class {0, 1}. Alternatively, it provides a probability of having class 1: Y -> [0, 1]. With a single-node perceptron it was possible to linearly separate data points based on real-number values of output y. With the addition of a sigmoid neuron, the output y becomes a binary class of either 1 or 0. The input feature vector x provides values for the horizontal and vertical axis.

One way of estimating errors is to round up the probabilistic output to either 1 or 0 and compare with the actual output class:

\[ \textbf{Error = } \begin{cases} 1 \; \textbf{if} \; f(x_i, w_i) \neq y_i, \\ 0 \; \textbf{if} \; f(x_i, w_i) = y_i \end{cases} \]

Output of function f is decided by predicting:

\[ \begin{cases} y_i = 1 \; \textbf{if} \; p(y_i = 1 | x, w) \geq \frac{1}{2}\\ y_i = 0 \; \textbf{otherwise} \; \end{cases} \]

Alternatively, use a (minimize error \ maximize likelihood) approach:

\[ L(D | w) = P(D | w) \] \[ \textbf{Error} = – L(D| w) \]

L stands for likelihood, which is the probability of seeing data D given weights w. To optimize classifier, error is minimized by maximizing likelihood.

Now let’s define probability P of seeing data D:

\[ P(D | w) = \prod_{i=0}^{n} u_i^{y_i} (1 – u_i)^{1 – y_i} \]

where

\[ u_i = p(y_i = 1 | \vec{w}, x) = g(z) = \frac{1}{1+e^{-z}} = g(w_0+\sum_{j=1}^{k} x_i^{(j)}w_j) \]

Minimizing Error – Maximizing Log-Likelihood

The loss function is defined in terms of the partial derivatives of the error. To simplify the calculations, let’s use a trick: maximize log-likelihood instead:

\[ I(D|w) = \log(\prod_{i=0}^{n} u_i^{y_i} (1 – u_i)^{1 – y_i}) \] \[ = \sum_{i=0}^{n} \log(u_i^{y_i} (1 – u_i)^{1 – y_i})) \] \[ = \sum_{i=0}^{n} y_i \log(u_i) + (1 – y_i) \log(1 – u_i) \] \[ \sum_{i=0}^{n} – J_{online}(d_i, \vec{w}) \]

Then if we differentiate online error J

\[ – \frac{\partial{}}{\partial{w_0}} J_{online}(d_i, w) \] \[(\textbf{Step 2}) = – (\frac{y_i \frac{\partial{(u_i)}} {\partial{w_0}}}{\ln(a) u_i} + \frac{(1 – y_i) \frac{\partial{(1 – u_i)}} {\partial{w_0}}} {ln(a) (1 – u_i)}) = – \frac{1}{ln(a)} (y_i (1 – u_i) – ( 1- y_i) u_i) \] \[= – \frac{1}{ln(a)} (y_i – u_i) \]

Step 2 requires knowledge of:

\[ \textbf{(3)} \frac{\partial{u_i}}{\partial{w_0}} = \frac{\partial{g(z)}}{\partial{z}} \frac{\partial{z}}{\partial{w_0}} \]

First, we find:

\[ \frac{\partial{g(z)}}{\partial{z}} = \frac{\partial{}}{\partial{z}} \frac{1}{(1+e^{-z})} = \frac{\partial{(1 + e^{-z})^{-1}}}{\partial{e^{-z}}} \frac{\partial{e^{-z}}}{\partial{z}} = – (1 + e^{-z})^{-2} e^{-z} (-1) = \frac{e^{-z}}{(1 + e ^ {-z})^2}\]

And

\[ \frac{\partial{z}}{\partial{w_0}} = 1 \]

Then we substitute into equation (3):

\[ \frac{\partial{u_i}}{\partial{w_0}} = \frac{\partial{g(z)}}{\partial{z}} \frac{\partial{z}}{\partial{w_0}} = \frac{e^{-z}}{(1 + e ^ {-z})^2} = \frac{1 + e^{-z} – 1}{(1 + e ^ {-z})^2} = \frac{1}{1 + e^{-z}} – \frac{1}{(1 + e^{-z})^2} = g(z) (1 – g(z)) = u_i (1 – u_i) \]

Now we go back to:

\[ – \frac{\partial{}}{\partial{w_0}} J_{online}(d_i, w) = – \frac{1}{ln(a)} (y_i – u_i) \]

If we take the base of the logarithm to be e we get:

\[ (\textbf{Step 4}) \] \[ – \frac{\partial{}}{\partial{w_0}} J_{online}(d_i, w) = – (y_i – u_i) \] \[ – \frac{\partial{}}{\partial{w_1}} J_{online}(d_i, w) = – x_1 (y_i – u_i) \] \[…\] \[ – \frac{\partial{}}{\partial{w_k}} J_{online}(d_i, w) = – x_k (y_i – u_i) \]

Weight Updates

Weights are updated according to:

\[ w_j = w_j – a \frac{\partial{}}{\partial{w_j}} J_{online} \]

\[ (\textbf{Step 5}) \] \[ w_0^{(i+1)} = w_0^{(i)} + a (y_i – u_i) \] \[ w_1^{(i+1)} = w_1^{(i)} + a x_1 (y_i – u_i) \] \[…\] \[ w_k^{(i+1)} = w_k^{(i)} + a x_k (y_i – u_i) \]

One can also maximize likelihood by setting the derivatives of likelihood L to be zero:

\[ \textbf{(Step 4 modified)} \] \[ 0 = \frac{\partial{L(d_i, w)}}{\partial{w_0}} = \sum_{i=0}^{n} (y_i – u_i) \] \[ 0 = \frac{\partial{L(d_i, w)}}{\partial{w_1}} = \sum_{i=0}^{n} x_i (y_i – u_i) \] \[…\]

Steps 4 and 5 are common in neural networks. Step 4 (modified) is usually done when you implement logistic regression.

NumPy Implementation of a Sigmoid Neuron

import math

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

num_inputs = 1000
num_features = 2
X, y = make_blobs(n_samples=num_inputs, n_features=num_features, centers=2, cluster_std=0.5, random_state=0)

# initialize weights
learning_rate = 0.01
weights = np.random.rand(num_features + 1)
new_weights = np.zeros(weights.shape)

def sigmoid(Z):
    return 1/ (1 + math.e**(-Z))

def update_weights(record_num):
    x_values = [1, X[record_num, 0], X[record_num, 1]]
    for i in range(len(weights)):
        Z = np.dot(x_values, weights)
        loss = - x_values[i] * (y[record_num] - sigmoid(Z))
        new_weights[i] = weights[i] - learning_rate * loss


for record_num in range(num_inputs):
    update_weights(record_num)
    weights = np.copy(new_weights)

def prediction(x_values):
    y_values_predicted = np.zeros(x_values.shape[0])
    for i in range(len(x_values)):
        x_values_full = [1, x_values[i, 0], x_values[i, 1]]
        Z = np.dot(x_values_full, weights)
        y_values_predicted[i] = sigmoid(Z)
        if y_values_predicted[i] >= 0.5:
            y_values_predicted[i] = 1
        else:
            y_values_predicted[i] = 0
    return y_values_predicted

predictions = prediction(X)
a = 3
plt.scatter(X[:, 0], X[:, 1], c=predictions)
plt.show()

Experimenting with Program Output

Linearly Separable Clusters

Sigmoid neuron learns to classify data points with input vector x into class 1 or 0 denoted by output y. It can place linear boundaries within the input vector x. Then, it checks which cluster the data points belong to from the output value y.

Sigmoid Neuron Output 1 — Two clusters separable by a line

Sigmoid Neuron Output 2 — Two clusters separable by a line

All pictures show how two clusters of points get separated by a linear boundary.

The model can also output probabilities in the range [0, 1]. As shown on the following picture:

Sigmoid Neuron Output 4 — Probabilistic classification

As points move from the center of the purple cluster, the probability of a point belonging to a purple class decreases. This probability continues to drop as points move into the center of the yellow cluster. This change is captured by the color shift between purple and yellow.

Comparison of Perceptrons, Sigmoid Neurons, and Modern Neural Network Layers

Neural networks have evolved significantly from their origins, with perceptrons and sigmoid neurons serving as foundational concepts. Below is a comparison of perceptrons, sigmoid neurons, and modern neural network layers.

Feature	Perceptrons	Sigmoid Neurons	Modern Neural Network Layers
Activation Function	Step function (binary output: 0 or 1)	Sigmoid function (g(z) = 1/(1+e^-z))	Advanced functions like ReLU, tanh, softmax, etc (watch Activation Functions in Neural Networks)
Output Range	Binary (0 or 1)	Continuous range (0 to 1, interpretable as probability)	Varies: ranges include [0,1], (-1, 1), or unbounded, depending on the activation function
Non-Linearity	Not capable of modeling non-linear relationships	Introduce non-linearity via sigmoid activation	Strong non-linearity with diverse activation options, enabling deep learning
Learning Capability	Can only solve linearly separable problems	Handles non-linear decision boundaries	Can model highly complex, hierarchical patterns
Training Algorithm	Perceptron learning rule (basic weight updates)	Gradient descent	Stochastic gradient descent (SGD) and advanced optimizers like Adam, RMSprop, etc.
Use Cases	Early binary classification tasks	Binary classification, logistic regression	Image recognition, natural language processing, time-series analysis, etc
Limitations	Inability to learn non-linear patterns	Susceptible to vanishing gradients; limited in dep networks	Addresses vanishing gradients (ReLU) and supports large-scale, efficient learning

Key Advancements:

Perceptrons laid the groundwork for artificial neural networks but were limited to linear decision boundaries.
Sigmoid Neurons introduced smooth transitions and probabilistic outputs, enabling basic non-linear learning and making logistic regression possible.
Modern Neural Network Layers (e.g., convolutional layers, recurrent layers) have leveraged advanced activation functions, optimization techniques, and architectures like CNNs and transformers to achieve state-of-the-art performance across diverse domains.

This evolution illustrates the growing complexity and power of neural networks. These models are capable of tackling real-world problems at scale.

Vanishing Gradient Problem

The vanishing gradient problem is a fundamental challenge in training deep neural networks. The problem occurs when the gradients of the loss function become very small. This happens as they are propagated back through the network during training. This can lead to several issues:

Slow Training: Extremely small gradients cause very slow updates to the weights. This issue makes it difficult for the network to learn effectively.
Poor Performance: The network may fail to capture meaningful patterns. This happens especially in the earlier layers. Their weights are not updated adequately.
Layer Inactivation: Layers closer to the input may effectively “stop learning” because their gradients diminish to near-zero.

The vanishing gradient problem is particularly severe in networks with activation functions like the sigmoid or tanh. This is because their derivatives are small for input values far from zero.

The graph of the derivative of sigmoid, g(z) (1 – g(z)) looks like this:

Sigmoid Function and its Derivative Graphs — Figure 3. Showing derivative (blue) and sigmoid function (red).

The derivative g(z) = g(z)(1−g(z)) reaches a maximum of 0.25 when g(z)=0.5, but it becomes very small when z is large (positive or negative). This small derivative causes gradients to shrink exponentially as they are backpropagated through layers.

Solution to Vanishing Gradient Problem

To solve this problem, alternatives to sigmoid like ReLU are developed. The ReLU function is defined as:

\[ g(z) = \max(0, z) \]

Its derivative is:

\[ g'(z) = \begin{cases} 1, z > 0\\ 0, z \leq 0 \; \end{cases} \]

Advantages of ReLU:

Non-Saturating Gradients: For positive inputs, the gradient is always 1, preventing it from vanishing.
Efficient Computation: Simpler computation compared to sigmoid or tanh.
Sparsity: By outputting 0 for negative inputs, ReLU introduces sparsity in the network, improving computational efficiency.

Limitations:

Dying ReLU Problem: Some neurons can output 0 for all inputs (due to large negative weights), effectively becoming inactive.

The vanishing gradient problem posed significant challenges in training deep networks. This was especially true in earlier architectures that relied on sigmoid or tanh activations. The introduction of ReLU and its variants has largely mitigated this issue. Other innovations, like batch normalization and skip connections, also contribute. These techniques enable the training of extremely deep networks with remarkable performance in various tasks. These advancements have been pivotal in the rise of deep learning. They have revolutionized its applications in fields like computer vision, natural language processing, and more.

Conclusion

The sigmoid neuron represents a significant milestone in the evolution of neural networks. It bridges the gap between the simplicity of perceptrons and the complexity of modern deep learning architectures. By introducing the sigmoid activation function, neural networks gained the ability to handle non-linear relationships. They also produce probabilistic outputs, making them suitable for a wide range of classification and decision-making tasks. Despite its historical importance, the limitations of sigmoid neurons showed the need for improvements in activation functions. Problems like the vanishing gradient highlighted the need for better training techniques.

Modern neural networks have addressed these challenges through several innovations. These include ReLU and its variants, batch normalization, and architectures such as ResNet. These developments have enabled the training of deeper and more powerful models. This allows neural networks to achieve state-of-the-art results in diverse applications. These applications range from image recognition to natural language processing.

The principles of the sigmoid neuron continue to inform and inspire advancements in artificial intelligence. Sigmoid neurons are the foundation of logistic regression and early neural networks. They remain an essential concept for understanding the fundamentals of machine learning. They also contribute to the progress of deep learning. Through continued innovation, the legacy of the sigmoid neuron endures in shaping the future of intelligent systems.

< Previous Article

Sigmoid Neuron – Fundamentals of Neural Networks Part 2