When you train a neural network, you’re teaching it to make better predictions by adjusting its internal parameters (weights and biases) to minimize loss (how wrong it is). But how does it know what direction to adjust these weights?
Enter Gradient Descent, the engine that drives neural networks to learn.
Imagine standing on a hilly terrain in the fog, trying to find the lowest point. You take small steps downhill in the steepest direction until you reach the bottom.
Gradient Descent works the same way:
It calculates the gradient (slope) of the loss function with respect to each weight.
It updates the weights in the opposite direction of the gradient to reduce the loss.
For each parameter (weight) ww:
w=w−α⋅∂w∂L
Where:
α = Learning rate (how big your steps are)
LL = Loss function
∂L/∂w = Gradient of the loss with respect to ww
Stochastic Gradient Descent (SGD): Uses one sample at a time to compute gradients. Faster but noisier.
Mini-Batch Gradient Descent: Uses small batches (e.g., 32 samples). Balances speed and stability.
Even if you are an absolute beginner to AI/Machine Learning this blog will teach a lot from behind the scenes. A Neural Network is basically a fully connected layer of nodes (neurons) that take in information and output information. There is an input layer which takes in the orignial dataset and output layer which spits out the predictions of the model. In the middle, there are hidden layers which are the main pillar of neural networks (they do the main transformations for the model) I personally built a neural network to predict handwritten digits (MNIST dataset).
A feedforward neural network with:
1 input layer
1 hidden layer (using ReLU)
1 output layer (using softmax)
Training using cross-entropy loss and gradient descent
import numpy as np
Assume:
Input X shape: (features, samples)
Labels Y as one-hot shape: (classes, samples)
def initialize_parameters(input_size, hidden_size, output_size):
W1 = np.random.randn(hidden_size, input_size) * 0.01
B1 = np.zeros((hidden_size, 1))
W2 = np.random.randn(output_size, hidden_size) * 0.01
B2 = np.zeros((output_size, 1))
return W1, B1, W2, B2
ReLU and Softmax:
def ReLU(Z):
return np.maximum(0, Z)
def ReLU_derivative(Z):
return Z > 0
def softmax(Z):
expZ = np.exp(Z - np.max(Z, axis=0, keepdims=True))
return expZ / expZ.sum(axis=0, keepdims=True)
python
CopyEdit
def forward_propagation(W1, B1, W2, B2, X):
Z1 = np.dot(W1, X) + B1
A1 = ReLU(Z1)
Z2 = np.dot(W2, A1) + B2
A2 = softmax(Z2)
return Z1, A1, Z2, A2
Cross-entropy loss:
def compute_loss(Y, A2):
m = Y.shape[1]
loss = -np.sum(Y * np.log(A2 + 1e-8)) / m
return loss
def backward_propagation(Z1, A1, Z2, A2, W2, X, Y):
m = X.shape[1]
dZ2 = A2 - Y
dW2 = (1/m) * np.dot(dZ2, A1.T)
dB2 = (1/m) * np.sum(dZ2, axis=1, keepdims=True)
dZ1 = np.dot(W2.T, dZ2) * ReLU_derivative(Z1)
dW1 = (1/m) * np.dot(dZ1, X.T)
dB1 = (1/m) * np.sum(dZ1, axis=1, keepdims=True)
return dW1, dB1, dW2, dB2
Using gradient descent:
def update_parameters(W1, B1, W2, B2, dW1, dB1, dW2, dB2, learning_rate):
W1 -= learning_rate * dW1
B1 -= learning_rate * dB1
W2 -= learning_rate * dW2
B2 -= learning_rate * dB2
return W1, B1, W2, B2
def get_predictions(A2):
return np.argmax(A2, axis=0)
def get_accuracy(predictions, Y):
return np.mean(predictions == np.argmax(Y, axis=0))
def train(X, Y, hidden_size, learning_rate, epochs):
input_size = X.shape[0]
output_size = Y.shape[0]
W1, B1, W2, B2 = initialize_parameters(input_size, hidden_size, output_size)
for epoch in range(epochs):
Z1, A1, Z2, A2 = forward_propagation(W1, B1, W2, B2, X)
loss = compute_loss(Y, A2)
dW1, dB1, dW2, dB2 = backward_propagation(Z1, A1, Z2, A2, W2, X, Y)
W1, B1, W2, B2 = update_parameters(W1, B1, W2, B2, dW1, dB1, dW2, dB2, learning_rate)
if epoch % 100 == 0:
predictions = get_predictions(A2)
acc = get_accuracy(predictions, Y)
print(f"Epoch {epoch}, Loss: {loss:.4f}, Accuracy: {acc:.4f}")
return W1, B1, W2, B2
If you guys need more help in understanding the math behind this checkout the playlist from youtube its up top!
If you guys want to see my very own code chechout my github @ https://github.com/KaushikRao196
Artificial Intelligence is exactly what it sounds like. It is technology that is able to simulate human intellect in tasks and allow for more efficiency whether it is for a business or even education. When thinking about AI, most people tend to talk about chatbots like ChatGpt and DeepSeek. These innovations fall under Generative AI, which refers to models that can create new text, audio, images and even videos through the learned patterns of existing data.
AI is a massive field of computer science which contains a major subfield: Machine Learning. Before Machine Learning, computers had to be hardcoded using strong logic and search algorithms however, they could not handle messy, complex real world data. ML allowed computers to learn from data instead of relying on fixed rules. ML consists of subfields like deep learning, reinforcement learning, supervised and unsupervised learning.
When talking about Gen AI we are referring to Deep Learning, a type of machine learning that utilises artificial neural networks to find patterns and analyse large amounts of complex data. These networks are inspired from the human brain and its innate ability to recognise features without having to think for long.
There are three main types of Deep Learning architectures: Convolutional Neural Networks(CNNs) used for image generation, Recurrent Networks(RNNs) used for text or time series and Transformers which power large language models like GPT. In the next blog we will discuss in more detail on how neural networks work and the math behind them.