Back to basics: Neural network from scratch

For one of my classes, I had to implement a neural network from scratch. It was a great refresher on the math behind a neural network. Though, that doesn't make implementing and visualizing the math easier.

"You're good in math bro, you should learn machine learning bro"

I write this to remind my future self on the intuition behind the math in the neural network, in case I want a refresher, or if I ever have to write a neural network from scratch again (when pigs start flying). I won't be discussing what each layer does or why we use certain layers, per se. I wanted to show how we derived the forward and backward passes (especially the backward pass).

Setup

The model we'll describe will be a classification model. It takes in input(s), and outputs a probability distribution describing how likely it is to be a specific class. It will consist of the following layers, in order:

Linear
ReLU
Softmax

For our loss function, we'll use the negative log likelihood (NLL) function.

PLEASE NOTE that this is for learning purposes. In practice, depending on what you want (say for classification), you may not want ReLU as the last layer.

For classification, the cross entropy loss is equivalent to having a softmax layer and using NLL loss. So, you can simplify softmax and NLL loss to be the cross entropy loss function. However, I prefer to keep them separate to make math easier to grasp.

Typically, most guides describe the input $x$ to the model as a vector with $D_{in}$ input features. However, we usually train over multiple examples (a batch size $N$ ). Thus, in this guide, we'll define our input to our model as a matrix with dimensions $N \times D_{in}$ . Each row in the input is one training example.

Similarly, our model output will be a $N \times D_{out}$ matrix, where $D_{out}$ represents the number of output features (the number of classes we can predict).

With the model's core dimensions defined, let's dive into the specifics of each layer and the forward pass of the neural network.

Note, I'll define variables that are probably not very common in other tutorials. Idc. These variables made sense to me when I was first learning it.

Linear layer

The linear layer is defined as $h = Wx + b$ (though we may rearrange the matrices multiplication to match dimensions). We will use $h$ to represent the output of the linear layer.

I find it more intuitive to do $h = xW + b$ , rather than having to do transposing and all that jazz. So I will do that. The actual order of matrix multiplication will depend on how you define dimensions and whatever

If we had one training example, our matrix multiplication would look something like this:

$\begin{bmatrix} x_0 & ... & x_{D_{in}} \end{bmatrix} \begin{bmatrix} W_{0, 0} & ... & W_{D_{out}} \\ ... & ... & ... \\ W_{D_{in},0} & ... & W_{D_{in}, D_{out}} \end{bmatrix} + \begin{bmatrix} b_0 & ... & b_{D_{out}} \end{bmatrix} = \begin{bmatrix} h_0 & ... & h_{D_{out}} \end{bmatrix}$

But remember, we typically do training in batches and have multiple examples. Thus, our input $x$ has dimensions $N \times D_{in}$ . Each row is one training example, and we have $N$ examples. Likewise, the output of the layer, $h$ , will be a matrix of dimensions $N \times D_{out}$ . Each row is the output for one example, and we have $N$ examples.

$\begin{bmatrix} h_{0, 0} & ... & h{0, D_{out}} \\ ... \\ h_{N, 0} & ... & h_{N, D_{out}} \end{bmatrix}$

Our weights $W$ will have dimensions of $D_{in} \times D_{out}$ . We use the same weights for each of our examples.

$\begin{bmatrix} W_{0, 0} & ... & W_{D_{out}} \\ ... & W_{i,j} & ... \\ W_{D_{in},0} & ... & W_{D_{in}, D_{out}} \end{bmatrix}$

If we do the matrix multiplication $Wx$ , we get a resultant matrix of dimensions $N \times D_{out}$ , matching our $h$ . Each row in the output is the result of multipling weights to the corresponding input row.

The biases $b$ will have dimensions of $N \times D_{out}$ . Keep in mind that we use the same bias values for all training examples. So really, $b$ is just a vector of with $D_{out}$ columns (just like we showed for the one training example), and expanded along the other dimension to have $N$ rows, so that we can add our biases to each training example.

$\begin{bmatrix} b_0 & ... & b_{D_{out}} \\ ... \\ b_0 & ... & b_{D_{out}} \end{bmatrix}$

Putting it together:

$\begin{bmatrix} x_{0,0} & ... & x_{0,D_{in}} \\ ... \\ x_{N,0} & ... & x_{N,D_{in}} \end{bmatrix} \begin{bmatrix} W_{0, 0} & ... & W_{D_{out}} \\ ... \\ W_{D_{in},0} & ... & W_{D_{in}, D_{out}} \end{bmatrix} + \begin{bmatrix} b_0 & ... & b_{D_{out}} \\ ... \\ b_0 & ... & b_{D_{out}} \end{bmatrix} = \begin{bmatrix} h_{0, 0} & ... & h_{0, D_{out}} \\ ... \\ h_{N, 0} & ... & h_{N, D_{out}} \end{bmatrix}$

ReLU layer

ReLU is pretty easy. It's defined as:

$r = max(0, h)$

Here, $h$ is the output from the linear layer being fed into the input of our ReLU layer. And we'll define $r$ as the output of our ReLU layer.

Essentially, what we're doing is taking every single input $h_{i,j}$ from our input matrix and running it through our ReLU function, e.g. ensuring every value becomes $\geq 0$ .

The dimensions of our input $h$ is $N \times D_{out}$ . ReLU doesn't do any matrix transformations, so it outputs a matrix $r$ of dimensions $N \times D_{out}$ also.

Softmax layer

Softmax is the final layer of our neural network. Typically, according to Wikipedia, softmax takes in an input vector and outputs a vector of same dimensions. This output vector represents a probability distribution - in our classification model, it represents the predicted probabilities for each class. For each column $S_i$ , in the output, $S_i$ represents the probability that the training example is part of class $i$ . And since this is a probability distribution, all the values in the output add up to 1.

For one training example, the softmax is defined as:

$S_i = \frac{e^{r_i}}{\sum_k^{D_{out}} e^{r_k}}$

The input to softmax is a vector $r$ of size $1 \times D_{out}$ . The output of softmax will be a vector $S$ of size $1 \times D_{out}$ as well.

$S = \begin{bmatrix} S_0 & ... & S_j & ... & S_{D_{out}} \end{bmatrix}$

For some math clarification:

$\sum_k^{D_{out}} e^{r_k} = e^{r_0} + e^{r_1} + ... + e^{r_{D_{out}}}$

$\begin{bmatrix} r_0 & ... & r_{D_{out}} \end{bmatrix} \Rightarrow \begin{bmatrix} \frac{e^{r_0}}{\sum_k e^{r_k}} & ... & \frac{e^{r_{D_{out}}}}{\sum_k e^{r_k}} \end{bmatrix}$

For example, if softmax returned $[0.8, 0.15, 0.05]$ , this means our model predicted that there is an 80% chance that the original input $x$ (not $r$ ) belongs to class index 0, and that there's a 5% chance that the input belongs to class index 2.

However, this is just for one training example. To extend this for our model, where we have multiple training examples and $r$ has dimensions $N \times D_{out}$ , we can define our softmax as:

$S_{i, j} = \frac{e^{r_{i, j}}}{\sum_k e^{r_{i, k}}}$

We are basically taking each row in $r$ (a vector), and outputting another vector which is another row in our final output $S$ . It ends up giving us an output matrix $S$ of dimensions $N \times D_{out}$ . Each row represents the predicted probability distribution for one example.

For some math clarification:

$\sum_k e^{r_{i, k}} = e^{r_{i, 0}} + e^{r_{i, 1}} + ... + e^{r_{i, D_{out}}}$

$\begin{bmatrix} r_{0, 0} & ... & r_{0, D_{out}} \\ ... \\ r_{N, 0} & ... & r_{N, D_{out}} \end{bmatrix} \Rightarrow \begin{bmatrix} \frac{e^{r_{0, 0}}} {\sum_k e^{r_{0, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{0, k}}} \\ \\ ... \\ \\ \frac{e^{r_{N, 0}}} {\sum_k e^{r_{N, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{N, k}}} \end{bmatrix}$

Loss function

Softmax is our final layer in our model. It returns a predicted probability distribution for each of our training examples. Now, we need to measure how well our model did.

To do that, we will compute the negative log likelihood loss (NLL loss) for each training example. Let's look at only one for now.

$S = [0.8, 0.15, 0.05]$

Typically, our true probability distribution $y$ will look something like this for classification models:

$y = [1, 0, 0]$

Interpreting all this, this means that the input $x$ actually belongs class index 0 (it has 100% chance as shown in $y$ ). Our model made a prediction that the input has an 80% chance of being in index 0. So it's almost there.

How do we measure how well our model did? We'll calculate a loss value, which is a scalar value. We'll use the NLL loss equation (for one training example):

$\mathcal{L}(y, S) = -\sum_c y_c log(S_c)$

It basically sums up the log likelihood for each output feature $c$ (each possible class), and takes the negative. Why negative? Because we want to minimize the total loss.

Not gonna explain NLL right now, that's a different story

To expand this for multiple training examples, we basically compute the loss for each training example, and output the average loss for all $N$ examples:

$\mathcal{L}(y, S) = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})$

Note that here, $y$ is a matrix of size $N \times D_{out}$ , where each row is the true probability distribution for one example.

Minimizing loss

We go forward in our neural network with our initial input $x$ , get predicted distributions $S$ for each traing example, and finally calculate a final loss value $\mathcal{L}$ for each training example, and take the average over the batch.

$h = Wx + b$

$r = max(0, h)$

$S = \text{Softmax}(r)$

$\mathcal{L} = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})$

Now it's time to update our parameters (our weights and biases). To do so, we'll calculate the gradient of our loss function with respect to our weights or biases for each of out training examples. Then, we'll take the average gradient and update our weights and biases accordingly.

$W_{new} = W_{old} - \frac{1}{N} \sum_{i=0}^N \nabla_W \mathcal{L}_i$

$b_{new} = b_{old} - \frac{1}{N} \sum_{i=0}^N \nabla_b \mathcal{L}_i$

Let's look at how we'll calculate the gradient for each training example. Recall chain rule:

$\frac {\partial \mathcal{L}}{\partial W} = \frac {\partial \mathcal{L}}{\partial S} \frac {\partial S}{\partial r} \frac {\partial r}{\partial h} \frac {\partial h}{\partial W}$

$\frac {\partial \mathcal{L}}{\partial b} = \frac {\partial \mathcal{L}}{\partial S} \frac {\partial S}{\partial r} \frac {\partial r}{\partial h} \frac {\partial h}{\partial b}$

We'll talk in detail about how we'll derive each of these deriviates for one training example, and how we'll update our parameters with respect to multiple training examples, as we described before.

Loss derivative

Recall our loss function for one training example:

$\mathcal{L} = - \sum_c^{D_{out}} y_{c} log(S_{c})$

We need to calculate $\frac {\partial \mathcal{L}}{\partial S}$ . But, let's observe what this derivative looks like.

For one training example, $S$ is the input with dimensions of $1 \times D_{out}$ . Our loss function is simply one scalar value. Thus, we need to compute the following:

$\frac {\partial \mathcal{L}}{\partial S} = \begin{bmatrix} \frac {\partial \mathcal{L}}{\partial S_0} & ... & \frac {\partial \mathcal{L}}{\partial S_i} & ... & \frac {\partial \mathcal{L}}{\partial S_{D_{out}}} \\ \end{bmatrix}$

Let's try and solve each $\frac {\partial \mathcal{L}}{\partial S_i}$ .

$\frac {\partial \mathcal{L}}{\partial S_i} = \frac {\partial \mathcal{L}}{\partial S_i} (- \sum_c^{D_{out}} y_c log(S_c))$

$= \frac {\partial \mathcal{L}}{\partial S_i} (-(y_0 log(S_0) + ... + y_i log(S_i) + ... + y_{D_{out}} log(S_{D_{out}})))$

$= -(\frac {\partial \mathcal{L}}{\partial S_{i}} (y_0 log(S_0)) + ... + \frac {\partial \mathcal{L}}{\partial S_i} (y_i log(S_i)) + ... + \frac {\partial \mathcal{L}}{\partial S_{i}} (y_{D_{out}} log(S_{D_{out}})))$

We only care about taking the derivative with respect to $S_{i}$ . This means that the derivative of all other terms is 0. Thus, we have:

$\frac {\partial \mathcal{L}}{\partial S_{i}} = \frac {\partial \mathcal{L}}{\partial S_{i}} (- y_{i} log(S_{i}))$

$= -\frac{y_{i}}{S_{i}}$

This is the derivative of the loss function with respect to its prediction (softmax) output for one training example. It is a matrix of size $1 \times D_{out}$ .

Softmax derivative

For one training example, the output of the softmax function $S$ is size $1 \times D_{out}$ . The input $r$ to our softmax function has dimensions $1 \times D_{out}$ . We would like to compute the following:

$\frac{\partial S}{\partial r} = \begin{bmatrix} \frac {\partial S}{\partial r_{0}} & ... & \frac {\partial S}{\partial r_i} & ... & \frac {\partial S}{\partial r_{D_{out}}} \end{bmatrix}$

Let's try and solve for each $\frac {\partial S}{\partial r_{i}}$ .

As stated before, $S$ has size $1 \times D_{out}$ . So really, when computing $\frac {\partial S}{\partial r_{i}}$ , we're computing another matrix:

$\frac {\partial S}{\partial r_{i}} = \begin{bmatrix} \frac {\partial S_{0}}{\partial r_{i}} & ... & \frac {\partial S_j}{\partial r_i} & ... & \frac {\partial S_{D_{out}}}{\partial r_{i}} \end{bmatrix}$

In order for us to compute $\frac {\partial S}{\partial r_{i}}$ , we need to compute how each of the components $S_j$ changes with respect to $r_{i}$ .

In other words, our $\frac {\partial S}{\partial r}$ looks like this:

$\frac{\partial S}{\partial r} = \begin{bmatrix} \frac {\partial S_0}{\partial r_{0}} & ... & \frac {\partial S_0}{\partial r_i} & ... & \frac {\partial S_0}{\partial r_{D_{out}}} \\ ... \\ \frac {\partial S_{D_{out}}}{\partial r_{0}} & ... & \frac {\partial S_{D_{out}}}{\partial r_i} & ... & \frac {\partial S_{D_{out}}}{\partial r_{D_{out}}} \end{bmatrix}$

This is also called the Jacobian matrix. I ain't a math expert so I'm not gonna explain it, go read it on Wikipedia

Let's compute $\frac {\partial S_{j}}{\partial r_{i}}$ . Recall our softmax equation for one training example:

$S_j = \frac{e^{r_j}}{\sum_k e^{r_k}}$

I replaced $i$ with $j$ for notation consistency in the current context

We solve for $\frac {\partial S_{j}}{\partial r_{i}}$ . We use quotient rule:

$\frac {\partial S_{j}}{\partial r_{i}} = \frac { (\frac{\partial S_{j}}{\partial r_{i}} e^{r_i}) (\sum_k e^{r_k}) - e^{r_i} \frac {\partial S_{j}}{\partial r_{i}} (\sum_k e^{r_k}) } {(\sum_k e^{r_k})^2}$

Observe that $\frac {\partial S_{j}}{\partial r_{i}} e^{r_i}$ can depend on whether $i = j$ .

$\frac {\partial S_{j}}{\partial r_{i}} e^{r_i} = \begin{cases} e^{r_i} & \text{if }i = j\\ 0 & \text{if }i \neq j \end{cases}$

This is because in the original equation, $S_j$ actually only depends on $e_{r_j}$ . Otherwise, its derivative is zero.

For $\frac {\partial S_{j}}{\partial r_{i}} (\sum_k e^{r_k})$ , its derivative is the same regardless of $i$ or $j$ :

$\frac {\partial S_{j}}{\partial r_{i}} (\sum_k e^{r_k}) = \frac {\partial S_{j}}{\partial r_{i}}(e^{r_0} + ... + e^{r_i} + ... + e^{r_{D_{out}}}) = \begin{cases} e^{r_j} & \text{if }i = j\\ e^{r_j} & \text{if }i \neq j \end{cases}$

After some labor (that I skipped writing about), we arrive to the derivation:

$\frac {\partial S_{j}}{\partial r_{i}} = \begin{cases} S_j \cdot (1 - S_j) & \text{if } i = j \\ -S_i S_j & \text{if }i \neq j \\ \end{cases}$

If you did this by hand, you'll notice that you can substitute $S_j$ into your derivations. That's what I did, in case you're confused.

Future me, if you don't get this, go back to calc 1 smh

In the end, our matrix looks kinda cool with a neat diagonal:

$\frac{\partial S}{\partial r} = \begin{bmatrix} S_0 \cdot (1 - S_0) & ... & -S_0 S_{D_{out}} \\ ... \\ -S_{D_{out}} S_0 & ... & S_{D_{out}} \cdot (1 - S_{D_{out}}) \\ \end{bmatrix}$

ReLU derivative

Just like forward pass, this step is pretty easy. The derivative of ReLU (with respect to each $h$ element in the input matrix) is:

$\frac {\partial \mathcal{r}}{\partial h} = \begin{cases} 0 & \text{if }h < 0 \\ 1 & \text{if }h \geq 0 \\ \end{cases}$

stackoverflow

Linear derivative

Recall the equation for our linear layer:

$h = Wx + b$

Recall that, for one training example, the output matrix $h$ has dimensions $1 \times D_{out}$ . The input matrix $x$ has dimensions $1 \times D_{in}$ , our weights have dimensions $D_{in} \times D_{out}$ , and our biases have dimensions $1 \times D_{out}$ .

First, let's compute $\frac {\partial h}{\partial W}$ . We want the following:

$\frac {\partial h}{\partial W} = \begin{bmatrix} \frac {\partial h}{\partial W_{0, 0}} & ... & \frac {\partial h}{\partial W_{0, j}} & ... & \frac {\partial h}{\partial W_{0, D_{out}}} \\ \\ ... \\ \\ \frac {\partial h}{\partial W_{i, 0}} & ... & \frac {\partial h}{\partial W_{i, j}} & ... & \frac {\partial h}{\partial W_{j, D_{out}}}\\ \\ ... \\ \\ \frac {\partial h}{\partial W_{D_{in}, 0}} & ... & \frac {\partial h}{\partial W_{D_{in}, j}} & ... & \frac {\partial h}{\partial W_{D_{in}, D_{out}}} \end{bmatrix}$

We must also compute $\frac {\partial h}{\partial W_{i, j}}$ :

$\frac {\partial h}{\partial W_{i, j}} = \begin{bmatrix} \frac {\partial h_0}{\partial W_{i, j}} & ... & \frac {\partial h_k}{\partial W_{i, j}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{i, j}} \end{bmatrix}$

So, our Jacobian matrix looks like this:

$\frac {\partial h}{\partial W} = \begin{bmatrix} \begin{bmatrix} \frac {\partial h_0}{\partial W_{0, 0}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{0, 0}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{0, j}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{0, j}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{0, D_{out}}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{0, D_{out}}} \end{bmatrix} \\ \\ ... \\ \\ \begin{bmatrix} \frac {\partial h_0}{\partial W_{i, 0}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{i, 0}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{i, j}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{i, j}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{i, D_{out}}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{i, D_{out}}} \end{bmatrix} \\ \\ ... \\ \\ \begin{bmatrix} \frac {\partial h_0}{\partial W_{D_{in}, 0}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{D_{in}, 0}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{D_{in}, j}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{D_{in}, j}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{D_{in}, D_{out}}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{D_{in}, D_{out}}} \end{bmatrix} \end{bmatrix}$

Now, this Jacobian matrix looks a bit weird. It appears to be a 3D matrix. When I first saw this, I had a hard time visualizing this, especially since this is only one training example (which adds another dimension to our matrix 💀). However, as we continue and simplify stuff, rest assured it will be clear.

Let's continue and compute $\frac {\partial h_k}{\partial W_{i, j}}$ . Using basic matrix multiplication knowledge, we can note the equation for a specific $h_k$ component:

$h_k = (\sum_n^{D_{in}} W_{n, k} x_{n}) + b_k$

Observe that for computing $\frac {\partial h_k}{\partial W_{i, j}}$ , when $i \neq k$ , then the derivative is zero. This is because the $h_k$ depends on only $W_{n, k}$ , as noted in the origianl equation of $h_k$ . Intuitively, it's because during matrix multiplication, only column $k$ in the weights matrix actually affect column $k$ in the final matrix $h.

Order of matrix multiplication is $xW$ rather than $Wx$ . As said earlier, this is more intuitive for me, especially when we extend it to multiple examples. I just write $Wx$ because it's more readable

Now, we solve for the other case if $i = k$ . Not gonna show the work here, the derivation is self-explanatory (unless you failed calc 1). We arrive with the final derivation:

$\frac {\partial h_k}{\partial W_{i, j}} = \begin{cases} 0 & \text{if }i \neq k \\ x_i & \text{if }i = k \end{cases}$

Very cool. We found $\frac {\partial h_k}{\partial W_{i, j}}$ . Let's zoom out a bit and look at $\frac {\partial h}{\partial W_{i, j}}$ . Say we're looking at $W_{0, 0}$ :

$\frac {\partial h}{\partial W_{0, 0}} = \begin{bmatrix} x_0 & 0 & ... & 0 \end{bmatrix}$

Another example. Let's say we're looking at $W_{0, 1}$ :

$\frac {\partial h}{\partial W_{0, 1}} = \begin{bmatrix} 0 & x_0 & ... & 0 \end{bmatrix}$

Essentially, all other values in the matrix where $i \neq k$ is zero. We knew this already though.

Our final Jacobian matrix looks something like this:

$\frac {\partial h}{\partial W} = \begin{bmatrix} \begin{bmatrix} x_0 & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_0 & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_0 \end{bmatrix} \\ \\ ... \\ \\ \begin{bmatrix} x_i & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_i & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_i \end{bmatrix} \\ \\ ... \\ \\ \begin{bmatrix} x_{D_{in}} & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_{D_{in}} & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_{D_{in}} \end{bmatrix} \end{bmatrix}$

TODO

Combining our derivatives together

TODO

Backward pass for multiple examples

TODO

Conclusion

Now we know neural network basic math 😎🎉🎉🎉

Back to basics: Neural network from scratch

Created: Feb 03, 2024

Updated: Feb 03, 2024

Setup

Linear layer

ReLU layer

Softmax layer

Loss function

Minimizing loss

Loss derivative

Softmax derivative

ReLU derivative

Linear derivative

Combining our derivatives together

Backward pass for multiple examples

Conclusion