For one of my classes, I had to implement a neural network from scratch. It was a great refresher on the math behind a neural network. Though, that doesn't make implementing and visualizing the math easier.
"You're good in math bro, you should learn machine learning bro"
I write this to remind my future self on the intuition behind the math in the neural network, in case I want a refresher, or if I ever have to write a neural network from scratch again (when pigs start flying). I won't be discussing what each layer does or why we use certain layers, per se. I wanted to show how we derived the forward and backward passes (especially the backward pass).
Setup
The model we'll describe will be a classification model. It takes in input(s), and outputs a probability distribution describing how likely it is to be a specific class. It will consist of the following layers, in order:
- Linear
- ReLU
- Softmax
For our loss function, we'll use the negative log likelihood (NLL) function.
PLEASE NOTE that this is for learning purposes. In practice, depending on what you want (say for classification), you may not want ReLU as the last layer.
For classification, the cross entropy loss is equivalent to having a softmax layer and using NLL loss. So, you can simplify softmax and NLL loss to be the cross entropy loss function. However, I prefer to keep them separate to make math easier to grasp.
Typically, most guides describe the input x to the model as a vector with Din input features. However, we usually train over multiple examples (a batch size N). Thus, in this guide, we'll define our input to our model as a matrix with dimensions N×Din. Each row in the input is one training example.
Similarly, our model output will be a N×Dout matrix, where Dout represents the number of output features (the number of classes we can predict).
With the model's core dimensions defined, let's dive into the specifics of each layer and the forward pass of the neural network.
Note, I'll define variables that are probably not very common in other tutorials. Idc. These variables made sense to me when I was first learning it.
Linear layer
The linear layer is defined as h=Wx+b (though we may rearrange the matrices multiplication to match dimensions). We will use h to represent the output of the linear layer.
I find it more intuitive to do h=xW+b, rather than having to do transposing and all that jazz. So I will do that. The actual order of matrix multiplication will depend on how you define dimensions and whatever
If we had one training example, our matrix multiplication would look something like this:
[x0...xDin]⎣⎡W0,0...WDin,0.........WDout...WDin,Dout⎦⎤+[b0...bDout]=[h0...hDout]
But remember, we typically do training in batches and have multiple examples. Thus, our input x has dimensions N×Din. Each row is one training example, and we have N examples. Likewise, the output of the layer, h, will be a matrix of dimensions N×Dout. Each row is the output for one example, and we have N examples.
⎣⎡h0,0...hN,0......h0,DouthN,Dout⎦⎤
Our weights W will have dimensions of Din×Dout. We use the same weights for each of our examples.
⎣⎡W0,0...WDin,0...Wi,j...WDout...WDin,Dout⎦⎤
If we do the matrix multiplication Wx, we get a resultant matrix of dimensions N×Dout, matching our h. Each row in the output is the result of multipling weights to the corresponding input row.
The biases b will have dimensions of N×Dout. Keep in mind that we use the same bias values for all training examples. So really, b is just a vector of with Dout columns (just like we showed for the one training example), and expanded along the other dimension to have N rows, so that we can add our biases to each training example.
⎣⎡b0...b0......bDoutbDout⎦⎤
Putting it together:
⎣⎡x0,0...xN,0......x0,DinxN,Din⎦⎤⎣⎡W0,0...WDin,0......WDoutWDin,Dout⎦⎤+⎣⎡b0...b0......bDoutbDout⎦⎤=⎣⎡h0,0...hN,0......h0,DouthN,Dout⎦⎤
ReLU layer
ReLU is pretty easy. It's defined as:
r=max(0,h)
Here, h is the output from the linear layer being fed into the input of our ReLU layer. And we'll define r as the output of our ReLU layer.
Essentially, what we're doing is taking every single input hi,j from our input matrix and running it through our ReLU function, e.g. ensuring every value becomes ≥0.
The dimensions of our input h is N×Dout. ReLU doesn't do any matrix transformations, so it outputs a matrix r of dimensions N×Dout also.
Softmax layer
Softmax is the final layer of our neural network. Typically, according to Wikipedia, softmax takes in an input vector and outputs a vector of same dimensions. This output vector represents a probability distribution - in our classification model, it represents the predicted probabilities for each class. For each column Si, in the output, Si represents the probability that the training example is part of class i. And since this is a probability distribution, all the values in the output add up to 1.
For one training example, the softmax is defined as:
Si=∑kDouterkeri
The input to softmax is a vector r of size 1×Dout. The output of softmax will be a vector S of size 1×Dout as well.
S=[S0...Sj...SDout]
For some math clarification:
k∑Douterk=er0+er1+...+erDout
[r0...rDout]⇒[∑kerker0...∑kerkerDout]
For example, if softmax returned [0.8,0.15,0.05], this means our model predicted that there is an 80% chance that the original input x (not r) belongs to class index 0, and that there's a 5% chance that the input belongs to class index 2.
However, this is just for one training example. To extend this for our model, where we have multiple training examples and r has dimensions N×Dout, we can define our softmax as:
Si,j=∑keri,keri,j
We are basically taking each row in r (a vector), and outputting another vector which is another row in our final output S. It ends up giving us an output matrix S of dimensions N×Dout. Each row represents the predicted probability distribution for one example.
For some math clarification:
k∑eri,k=eri,0+eri,1+...+eri,Dout
⎣⎡r0,0...rN,0......r0,DoutrN,Dout⎦⎤⇒⎣⎢⎢⎢⎢⎢⎢⎢⎡∑ker0,ker0,0...∑kerN,kerN,0......∑ker0,ker0,Dout∑kerN,ker0,Dout⎦⎥⎥⎥⎥⎥⎥⎥⎤
Loss function
Softmax is our final layer in our model. It returns a predicted probability distribution for each of our training examples. Now, we need to measure how well our model did.
To do that, we will compute the negative log likelihood loss (NLL loss) for each training example. Let's look at only one for now.
S=[0.8,0.15,0.05]
Typically, our true probability distribution y will look something like this for classification models:
y=[1,0,0]
Interpreting all this, this means that the input x actually belongs class index 0 (it has 100% chance as shown in y). Our model made a prediction that the input has an 80% chance of being in index 0. So it's almost there.
How do we measure how well our model did? We'll calculate a loss value, which is a scalar value. We'll use the NLL loss equation (for one training example):
L(y,S)=−c∑yclog(Sc)
It basically sums up the log likelihood for each output feature c (each possible class), and takes the negative. Why negative? Because we want to minimize the total loss.
Not gonna explain NLL right now, that's a different story
To expand this for multiple training examples, we basically compute the loss for each training example, and output the average loss for all N examples:
L(y,S)=−N1i=0∑Nc∑Doutyi,clog(Si,c)
Note that here, y is a matrix of size N×Dout, where each row is the true probability distribution for one example.
Minimizing loss
We go forward in our neural network with our initial input x, get predicted distributions S for each traing example, and finally calculate a final loss value L for each training example, and take the average over the batch.
h=Wx+b
r=max(0,h)
S=Softmax(r)
L=−N1i=0∑Nc∑Doutyi,clog(Si,c)
Now it's time to update our parameters (our weights and biases). To do so, we'll calculate the gradient of our loss function with respect to our weights or biases for each of out training examples. Then, we'll take the average gradient and update our weights and biases accordingly.
Wnew=Wold−N1i=0∑N∇WLi
bnew=bold−N1i=0∑N∇bLi
Let's look at how we'll calculate the gradient for each training example. Recall chain rule:
∂W∂L=∂S∂L∂r∂S∂h∂r∂W∂h
∂b∂L=∂S∂L∂r∂S∂h∂r∂b∂h
We'll talk in detail about how we'll derive each of these deriviates for one training example, and how we'll update our parameters with respect to multiple training examples, as we described before.
Loss derivative
Recall our loss function for one training example:
L=−c∑Doutyclog(Sc)
We need to calculate ∂S∂L. But, let's observe what this derivative looks like.
For one training example, S is the input with dimensions of 1×Dout. Our loss function is simply one scalar value. Thus, we need to compute the following:
∂S∂L=⎣⎡∂S0∂L...∂Si∂L...∂SDout∂L⎦⎤
Let's try and solve each ∂Si∂L.
∂Si∂L=∂Si∂L(−c∑Doutyclog(Sc))
=∂Si∂L(−(y0log(S0)+...+yilog(Si)+...+yDoutlog(SDout)))
=−(∂Si∂L(y0log(S0))+...+∂Si∂L(yilog(Si))+...+∂Si∂L(yDoutlog(SDout)))
We only care about taking the derivative with respect to Si. This means that the derivative of all other terms is 0. Thus, we have:
∂Si∂L=∂Si∂L(−yilog(Si))
=−Siyi
This is the derivative of the loss function with respect to its prediction (softmax) output for one training example. It is a matrix of size 1×Dout.
Softmax derivative
For one training example, the output of the softmax function S is size 1×Dout. The input r to our softmax function has dimensions 1×Dout. We would like to compute the following:
∂r∂S=[∂r0∂S...∂ri∂S...∂rDout∂S]
Let's try and solve for each ∂ri∂S.
As stated before, S has size 1×Dout. So really, when computing ∂ri∂S, we're computing another matrix:
∂ri∂S=[∂ri∂S0...∂ri∂Sj...∂ri∂SDout]
In order for us to compute ∂ri∂S, we need to compute how each of the components Sj changes with respect to ri.
In other words, our ∂r∂S looks like this:
∂r∂S=⎣⎢⎢⎢⎡∂r0∂S0...∂r0∂SDout......∂ri∂S0∂ri∂SDout......∂rDout∂S0∂rDout∂SDout⎦⎥⎥⎥⎤
This is also called the Jacobian matrix. I ain't a math expert so I'm not gonna explain it, go read it on Wikipedia
Let's compute ∂ri∂Sj. Recall our softmax equation for one training example:
Sj=∑kerkerj
I replaced i with j for notation consistency in the current context
We solve for ∂ri∂Sj. We use quotient rule:
∂ri∂Sj=(∑kerk)2(∂ri∂Sjeri)(∑kerk)−eri∂ri∂Sj(∑kerk)
Observe that ∂ri∂Sjeri can depend on whether i=j.
∂ri∂Sjeri={eri0if i=jif i≠j
This is because in the original equation, Sj actually only depends on erj. Otherwise, its derivative is zero.
For ∂ri∂Sj(∑kerk), its derivative is the same regardless of i or j:
∂ri∂Sj(k∑erk)=∂ri∂Sj(er0+...+eri+...+erDout)={erjerjif i=jif i≠j
After some labor (that I skipped writing about), we arrive to the derivation:
∂ri∂Sj=⎩⎪⎨⎪⎧Sj⋅(1−Sj)−SiSjif i=jif i≠j
If you did this by hand, you'll notice that you can substitute Sj into your derivations. That's what I did, in case you're confused.
Future me, if you don't get this, go back to calc 1 smh
In the end, our matrix looks kinda cool with a neat diagonal:
∂r∂S=⎣⎢⎢⎡S0⋅(1−S0)...−SDoutS0......−S0SDoutSDout⋅(1−SDout)⎦⎥⎥⎤
ReLU derivative
Just like forward pass, this step is pretty easy. The derivative of ReLU (with respect to each h element in the input matrix) is:
∂h∂r=⎩⎪⎨⎪⎧01if h<0if h≥0
stackoverflow
Linear derivative
Recall the equation for our linear layer:
h=Wx+b
Recall that, for one training example, the output matrix h has dimensions 1×Dout. The input matrix x has dimensions 1×Din, our weights have dimensions Din×Dout, and our biases have dimensions 1×Dout.
First, let's compute ∂W∂h. We want the following:
∂W∂h=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∂W0,0∂h...∂Wi,0∂h...∂WDin,0∂h.........∂W0,j∂h∂Wi,j∂h∂WDin,j∂h.........∂W0,Dout∂h∂Wj,Dout∂h∂WDin,Dout∂h⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
We must also compute ∂Wi,j∂h:
∂Wi,j∂h=[∂Wi,j∂h0...∂Wi,j∂hk...∂Wi,j∂hDout]
So, our Jacobian matrix looks like this:
∂W∂h=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡[∂W0,0∂h0...∂W0,0∂hDout]...[∂Wi,0∂h0...∂Wi,0∂hDout]...[∂WDin,0∂h0...∂WDin,0∂hDout].........[∂W0,j∂h0...∂W0,j∂hDout][∂Wi,j∂h0...∂Wi,j∂hDout][∂WDin,j∂h0...∂WDin,j∂hDout].........[∂W0,Dout∂h0...∂W0,Dout∂hDout][∂Wi,Dout∂h0...∂Wi,Dout∂hDout][∂WDin,Dout∂h0...∂WDin,Dout∂hDout]⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
Now, this Jacobian matrix looks a bit weird. It appears to be a 3D matrix. When I first saw this, I had a hard time visualizing this, especially since this is only one training example (which adds another dimension to our matrix 💀). However, as we continue and simplify stuff, rest assured it will be clear.
Let's continue and compute ∂Wi,j∂hk. Using basic matrix multiplication knowledge, we can note the equation for a specific hk component:
hk=(n∑DinWn,kxn)+bk
Observe that for computing ∂Wi,j∂hk, when i≠k, then the derivative is zero. This is because the hk depends on only Wn,k, as noted in the origianl equation of hk. Intuitively, it's because during matrix multiplication, only column k in the weights matrix actually affect column k in the final matrix $h.
Order of matrix multiplication is xW rather than Wx. As said earlier, this is more intuitive for me, especially when we extend it to multiple examples. I just write Wx because it's more readable
Now, we solve for the other case if i=k. Not gonna show the work here, the derivation is self-explanatory (unless you failed calc 1). We arrive with the final derivation:
∂Wi,j∂hk={0xiif i≠kif i=k
Very cool. We found ∂Wi,j∂hk. Let's zoom out a bit and look at ∂Wi,j∂h. Say we're looking at W0,0:
∂W0,0∂h=[x00...0]
Another example. Let's say we're looking at W0,1:
∂W0,1∂h=[0x0...0]
Essentially, all other values in the matrix where i≠k is zero. We knew this already though.
Our final Jacobian matrix looks something like this:
∂W∂h=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡[x0...]...[xi...]...[xDin...].........[...x0...][...xi...][...xDin...].........[...x0][...xi][...xDin]⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤
TODO
Combining our derivatives together
TODO
Backward pass for multiple examples
TODO
Conclusion
Now we know neural network basic math 😎🎉🎉🎉