Single-Node Perceptron: Fundamentals of Neural Networks Part 1

Table of Contents

Introduction

Neural networks have been part of AI research for quite some time. It all started with a perceptron, which is a simplest Artificial Neural Network. It was introduced in 1958 by Frank Rosenblatt. Despite its simplicity, it serves as a fundamental building block in artificial neural networks and has several practical applications. It excels in solving linearly separable problems and performs well in tasks requiring binary classification or basic decision-making systems.

Applications of Single-Node Perceptron Neural Network

Below are key real-world applications of a single-node perceptron:

Binary Classification Problems – Spam Detection or Credit Approval. A single-node perceptron can classify data into two distinct groups based on a linear boundary.
Image Recognition – Handwritten Digit Recognition or Edge Detection in images. Single-layer perceptrons can identify simple image patterns. Identifies whether a digit is a 0 or 1 based on pixel intensity thresholds. Detects basic shapes and edges in images by evaluating pixel contrast.
Signal Processing – Noise Filtering or Fault Detection in Systems. Differentiates between useful signals and background noise in audio or communication channels. Identifies whether a specific signal indicates a fault or an acceptable operational state.
Medical Diagnosis – Disease Prediction. Classifies patients as high-risk or low-risk for a specific disease based on certain metrics like blood pressure or cholesterol levels.
Sentiment Analysis – Product Reviews. Single-layer perceptrons can classify text data into positive or negative sentiment when analyzed based on predefined keywords. Categorizes user reviews into satisfied or dissatisfied sentiments.

Back then when perceptron was created, neural networks could only learn to recognize simple numbers. Backpropagation emerged in 1974. A multilayer perceptron followed in 1986. With these advancements, the neural networks grew in complexity. They started to solve harder problems.

One of the most important components of a neural network is the gradient. So next I present gradient meaning and its mathematical derivations.

Gradient Introduction

In mathematics and machine learning, a gradient represents the rate of change or slope of a function with respect to its input variables. It is a vector that points in the direction of the steepest ascent of the function. Its magnitude indicates the rate of change in that direction.

Gradient has broad applications in various disciplines. The following is the list of applications:

Optimization Algorithms: Gradient Descent, Stochastic Gradient Descent (SGD).
Physics: Represents the direction and magnitude of maximum change in fields like temperature or pressure.
Computer Graphics: Used in shading and rendering techniques.
Machine Learning: Training neural networks and fine-tuning model parameters.

Understanding gradient requires familiarity with calculus.

Gradient Theory

Step 1.

Suppose we have the following nonlinear multivariate function:

\[f(x, y, z) = g(x) + h(y) + j(z) \]

Gradient of f is defined as:

\[\nabla f = <\frac{\partial{f}}{\partial{x}}, \frac{\partial{f}}{\partial{y}}, \frac{\partial{f}}{\partial{z}}>\]

We also show that:

\[\frac{\partial{f}}{\partial{x}} = \frac{\partial{g}}{\partial{x}} \frac{\partial{x}}{\partial{x}} = \frac{\partial{g}}{\partial{x}} = \frac{dg}{dx}\]

The last equality is true since g is only dependant on x. Partial derivative with respect to x keeps y and z constant. Same applies for y and z.

Step 2.

Now we take total derivative of f:

\[df = \frac{dg}{dx} dx + \frac{dh}{dy} dy + \frac{dj}{dz} dz\]

This is equivalent to:

\[df = \frac{\partial{f}}{\partial{x}} dx + \frac{\partial{f}}{\partial{y}} dy + \frac{\partial{f}}{\partial{z}} dz\]

If we divide left part and right part by dt we get:

\[\frac{df}{dt} = \frac{\partial{f}}{\partial{x}} \frac{dx}{dt} + \frac{\partial{f}}{\partial{y}} \frac{dy}{dt} + \frac{\partial{f}}{\partial{z}} \frac{dz}{dt}\]

Rewriting this in dot product notation:

\[\frac{df}{dt} = \nabla f \cdot \vec{v} = <\frac{\partial{f}}{\partial{x}}, \frac{\partial{f}}{\partial{y}}, \frac{\partial{f}}{\partial{z}}> \cdot <\frac{dx}{dt}, \frac{dy}{dt}, \frac{dz}{dt}>\]

Step 3.

Suppose f = c (constant), then:

\[\frac{df}{dt} = 0 = \frac{\partial{f}}{\partial{x}} \frac{dx}{dt} + \frac{\partial{f}}{\partial{y}} \frac{dy}{dt} + \frac{\partial{f}}{\partial{z}} \frac{dz}{dt} = \nabla f \cdot \vec{v}\]

\[ \nabla f \cdot \vec{v} = 0 = |\nabla f||\vec{v}| \cos(\nabla \angle \vec{v})\] \[ => \] \[\nabla f \perp \vec{v}; \nabla f \angle \vec{v} = 90^\circ\]

f doesn’t change with time t, hence movement v is within level curve along x, y, z …

\[=>\] \[ \nabla f \perp \vec{v}; \]

Gradient f is perpendicular to velocity v. Velocity makes a point move along the level curve (no change in f) while gradient makes a point move perpendicular to level curve (change in f happens). This means gradient f causes changes in f. I now show that this change is positive and is steepest, resulting in largest increase in function value f.

Directional Derivative

Next, we prove that moving along gradient f results in largest increase in the value of f. Directional derivative is defined as:

It is derived by taking total derivative with respect to ds:

The angle between the gradient and unit vector is zero (theta = 0). Cosine is 1, meaning the unit vector is placed along the gradient vector. This gives largest value of

So gradient f points in the direction of largest increase of function f. The magnitude of the gradient vector |grad f| gives the rate of increase of f in the direction of the gradient.

To learn more about gradient and directional derivative see MIT OpenCourseWare Lecture 12.

Single Node Perceptron – The Basics of Neural Network

Single Layer/Single Node Perceptron is one kind of perceptron that can only learn patterns that are linearly separable. It works well for jobs where a straight line may be used to separate the data into different categories. The following Figure 1 is a network architecture of the perceptron:

Figure 1. Neural Network Architecture for a Single-Node Perceptron

Components of a Single-Node Perceptron:

Input Features. Usually numbers representing values along x-axis, image pixels or text in number format.
Weights. Every feature in the input is given a weight that establishes how much of an impact it has on the output. To determine the ideal values, these weights are changed during training.
Summation Function. By integrating the inputs with their corresponding weights, the perceptron determines the weighted total of its inputs.
Output. Known values as desired output of the network.
Bias. Bias such as w0 improves flexibility in learning.
Input, Hidden, Output layers. The three layers common to all neural networks.

Additionally, there can be activation functions. The weighted sum is passed through the activation function comparing it to a threshold to produce a binary output (0 or 1).

The perceptron adjusts its weights and bias using a learning algorithm such as gradient descend. The goal is to minimize the error in predicting output.

Generalization error:

The goal is to find weights that minimize the error function. One way is to use optimization – derivatives of the error with respect to each parameter must be 0:

\[ \]

\[ \frac{\partial{E}}{\partial{w_0}} = \frac{-2}{n} \sum_{i=0}^{n} (y_i – f(x_i,w)) = 0 \] \[ \frac{\partial{E}}{\partial{w_1}} = \frac{-2}{n} \sum_{i=0}^{n} (y_i – f(x_i,w)) \cdot x_1 = 0 \] \[ … \] \[ \frac{\partial{E}}{\partial{w_k}} = \frac{-2}{n} \sum_{i=0}^{n} (y_i – f(x_i,w)) \cdot x_k = 0 \]

Alternative technique for solving the same optimization problem is the gradient-descent method.

Gradient-Descent Method – Optimization by Neural Network

Gradient-descent method moves the weights in the error decreasing direction. Goal is to change the parameter value of w according to the gradient:

\[ w_j = w_j – a \frac{\partial{}}{\partial{w_j}} E(y_i – f(x_i, w))^2|_{w^*} \]

There is one variable that needs more attention.

In the context of machine learning, alpha (α) refers to the learning rate, a critical hyperparameter used in optimization algorithms like gradient descent. The learning rate determines the size of the steps the algorithm takes toward minimizing the error during the training process. It controls how quickly or slowly the model updates its weights in response to the calculated gradient of the loss function.

A high learning rate (α\alphaα) can lead to faster convergence but risks overshooting the minimum. A low learning rate ensures more precise convergence but may require significantly more iterations, increasing computational cost. Some common learning rate values are 0.01 and 0.001.

Minimizing Errors – Goal of Neural Network

Error for entire dataset:

\[ J_n = \frac{1}{n} \sum (y_i – f(x_i))^2 \]

Error for individual sample:

\[ J_{online} = \frac{1}{2} \sum (y_i – f(x_i))^2 \]

Here the division by 2 simplifies the derivative. This doesn’t affect correctness of the optimization.

Here I have chosen to use online learning rather than learning from batches. Batch Learning and Online Learning are two fundamental approaches to training machine learning models. Each method has its strengths, weaknesses, and ideal use cases, depending on the nature of the data and computational resources.

For batch learning, the model is trained using the entire dataset or large chunks of it at once. The training process occurs in epochs, where the model iterates over the dataset multiple times. Requires all the data to be available at once before training begins. Suitable for static datasets that do not change over time.

For online learning, the model is trained incrementally, using one data point (or small mini-batch) at a time. The model updates its parameters immediately after processing each instance. Can handle streaming data or data arriving in real-time. Suitable for dynamic datasets where data is constantly changing.

Weight Updates

Next is the weights update formula. The weights are updated to make online error smaller:

\[ w_j = w_j – a \frac{\partial{}}{\partial{w_j}} J_{online} \]

This results in the following equations for weight updates:

\[ w_0^{(i+1)} = w_0^{(i)} – a \frac{\partial{}}{\partial{w_0}} J_{online} |_{\vec{w}^{(i)}} \] \[ = w_0^{(i)} + a (y_i – f(x_i)) \] \[ w_1^{(i+1)} = w_1^{(i)} + a (y_i – f(x_i)) x_1^{(i)} \] \[…\] \[ w_k^{(i+1)} = w_k^{(i)} + a (y_i – f(x_i)) x_k^{(i)} \]

There are two additional hyperparameters that are relevant to the weight updates: epoch and iterations.

An epoch refers to one complete pass of the entire training dataset through the learning algorithm. In each epoch, the model processes every training sample, calculates the loss, and updates its weights accordingly.

An iteration refers to one update of the model’s parameters (weights and biases) after processing a batch of data. The number of iterations depends on the batch size. Formula: Iterations per Epoch = Total Samples / Batch Size.

In the python implementation of neural network, I use 1 epoch and n iterations per epoch. n refers to the number of records in the dataset. The batch size is 1, giving n/1 = n iterations per epoch.

NumPy Implementation of a Single Node Perceptron Neural Network

With the help of NumPy I create a simple neural network, single node perceptron, that learns parameters of a linear function y = ax + b. a is the slope and b is the bias values.

import numpy as np
import matplotlib.pyplot as plt

input_numbers = 1
num_of_records = 1000

x_inputs = np.zeros((num_of_records, input_numbers))
y_outputs = np.zeros((num_of_records, input_numbers))
alpha = 0.8
bias = 1

# Building dataset from points on a 2D plot
from_val = -2
to_val = 10
step = (to_val - from_val) / num_of_records
x_axis_values = np.arange(from_val, to_val, step)
random_shift = -1 + 2 * np.random.rand(len(x_axis_values))
np.copyto(x_inputs[:, 0], x_axis_values)
np.copyto(y_outputs[:, 0], bias + alpha * x_axis_values + random_shift)

# Defining weights used in model training
weights = np.random.rand(input_numbers + 1)
alpha_learn = 0.01

for rec_num in range(num_of_records):
    x_values = np.zeros(input_numbers + 1)
    x_values[0] = 1
    x_values[1:] = x_inputs[rec_num][:]

    #Apply weights
    func_value = 0
    for i in range(input_numbers + 1):
        func_value += weights[i] * x_values[i]

    #Update weights
    weights += np.multiply(alpha_learn * (y_outputs[rec_num] - func_value), x_values)

x1 = [-2, 10]
y1 = [weights[0] + weights[1] * x1[0], weights[0] + weights[1] * x1[1]]
plt.plot(x1, y1, marker='o', c='r')

plt.scatter(x_inputs, y_outputs, s=2.0)
plt.show()

Experimenting with Program Output

Linear Regression

Program finds the equation of a line that best fits data. If we plot the learned equations for a line and the data we get the following output:

The learned equation for the line produces red segment with x values between -2 and 10. The line fits the input data (blue dots) by minimizing errors between the output y value and the predicted y value for each data record.

Linear Decision Boundary – Grouping on y by x

The model can also be trained to classify output into one of two categories: those with y > 3 and those with y <= 3:

The model attempts to classify all points with values of y greater than 3 into an orange category. Then it classifies all points with y values smaller than 3 into a blue category. It is not perfect and there has to be an additional step in the neural network pipeline – an activation function. The function takes predicted and actual y values, which are any real numbers, and scope them to set {0, 1} as follows:

def activation_function(y):
    if y > 3:
        return 1
    else:
        return 0

Then, whenever you encounter use of predicted y value or actual y value, pass them through the activation function:

...
y_predicted = activation_function(y_predicted)

#Update weights
weights += np.multiply(alpha * (activation_function(y_outputs[rec_num]) - y_predicted), x_values)

...

The limitation of the model is that it is trained to predict output correctly only when there is a single linear decision boundary. It is not possible to include multiple boundaries, or non-linear boundaries, with a single node perceptron. The model basically maps values of x from a real number range (-Inf, +Inf) to a set of two numbers {0, 1}. Imagine a straight line moving all around the place over x-y axes graph. It tries to learn the bias and slope that results in a line that partitions space into two parts most effectively. One part of space is predicated to contain data values with class 0 and the other part is predicted to contain data values with class 1. The number of non-matching class values from both spaces serves an indication of how well the line partitions space. Only that line that produces least errors will become the perfect fit.

Linear Decision Boundary – Grouping on y by y

The accuracy of predicting y values near the boundary depends on the model’s ability. The model must predict values close to it. (The boundary is all values greater than 3 and all values smaller than 3.) On Figure 3, it is clear that the model finds the boundary in terms of x that best matches segmentation in terms of y. This makes sense as we only provide x values. y values are provided but they do not get used by the model as a source of prediction. To include y values in the prediction, one must provide the same y output values as extra input. This input is in addition to the x values already present.

When you include y output as an additional vector of x for input, you get the model to correctly use the boundary defined on y axis:

The model now colors points in blue when their representative y value is lower than or equal 3. It also colors points in orange when their representative y value is higher than 3.

This concludes our experimentation with program output.

Comparison of Architectures

Single Node Perceptron is the simplest form of a neural network. Not only that, but it also is a building block for other more complex neural network structures. For example, the perceptron node often precedes the sigmoid function node at the end of the network. Also, large neural networks often deploy dozens of single-node perceptrons to create multiple hidden layers and multiple processing units in each layer. See table below for descriptions of most popular neural network architectures:

Name	Network Architecture	Capabilities	Limitations
Single-Node Perceptron	Consists of a single layer of neurons. Single node per layer.	Can only solve problems that are linearly separable. Can draw a straight line or hyperplane to separate the classes.	Can’t solve problems that require non-linear decision boundaries. Struggles with tasks like XOR, where data is not linearly separable.
Multi-Layer Perceptron (MLP)	Consists of multiple layers of neurons: one input, several hidden, one output. Each neuron in a layer is connected to every neuron in the next layer.	Can learn non-linear relationships between inputs and outputs. Can tackle image recognition and speech processing. It uses non-linear activation functions like ReLU, Sigmoid, or Tanh to introduce non-linearity.	Are computationally expensive and require significant data for training.
Convolutional Neural Networks (CNNs)	CNNs are designed for processing grid-like data, such as images. They consist of layers of convolutional filters, pooling layers, and fully connected layers.	Effective for image classification, object detection, and segmentation. Can detect edges, textures.	Less suited for sequential or time-series data.
Recurrent Neural Networks (RNNs)	Designed to handle sequential data by maintaining a memory of previous inputs. Allows for processing time-series data such as speech or text.	Used for tasks like natural language processing (NLP), time-series prediction, and speech recognition.	Struggles with long-range dependencies.

Conclusion

The Single-Node Perceptron serves as the foundational building block of neural networks, demonstrating how mathematical principles and learning algorithms can be combined to enable machines to make decisions and classify data. Despite its simplicity, the perceptron effectively showcases key concepts such as weights, biases, activation functions, and gradient descent optimization. However, its limitation in solving non-linear problems highlights the need for more advanced architectures, such as multi-layer perceptrons (MLPs).

Understanding the mechanics of a single-node perceptron not only provides insight into the roots of artificial intelligence but also lays the groundwork for building more complex models. As we progress to more sophisticated neural network structures, the principles learned from the perceptron remain essential. This document marks the first step in exploring the vast field of neural networks, setting the stage for deeper dives into multi-layer architectures and advanced learning techniques.