Back to basics: Neural network from scratch

Created: Feb 03, 2024
Updated: Feb 03, 2024

For one of my classes, I had to implement a neural network from scratch. It was a great refresher on the math behind a neural network. Though, that doesn't make implementing and visualizing the math easier.

"You're good in math bro, you should learn machine learning bro"

cat crazy like me fr

I write this to remind my future self on the intuition behind the math in the neural network, in case I want a refresher, or if I ever have to write a neural network from scratch again (when pigs start flying). I won't be discussing what each layer does or why we use certain layers, per se. I wanted to show how we derived the forward and backward passes (especially the backward pass).

Setup

The model we'll describe will be a classification model. It takes in input(s), and outputs a probability distribution describing how likely it is to be a specific class. It will consist of the following layers, in order:

  1. Linear
  2. ReLU
  3. Softmax

For our loss function, we'll use the negative log likelihood (NLL) function.

PLEASE NOTE that this is for learning purposes. In practice, depending on what you want (say for classification), you may not want ReLU as the last layer.

For classification, the cross entropy loss is equivalent to having a softmax layer and using NLL loss. So, you can simplify softmax and NLL loss to be the cross entropy loss function. However, I prefer to keep them separate to make math easier to grasp.

Typically, most guides describe the input xx to the model as a vector with DinD_{in} input features. However, we usually train over multiple examples (a batch size NN). Thus, in this guide, we'll define our input to our model as a matrix with dimensions N×DinN \times D_{in}. Each row in the input is one training example.

Similarly, our model output will be a N×DoutN \times D_{out} matrix, where DoutD_{out} represents the number of output features (the number of classes we can predict).

With the model's core dimensions defined, let's dive into the specifics of each layer and the forward pass of the neural network.

Note, I'll define variables that are probably not very common in other tutorials. Idc. These variables made sense to me when I was first learning it.

Linear layer

The linear layer is defined as h=Wx+bh = Wx + b (though we may rearrange the matrices multiplication to match dimensions). We will use hh to represent the output of the linear layer.

I find it more intuitive to do h=xW+bh = xW + b, rather than having to do transposing and all that jazz. So I will do that. The actual order of matrix multiplication will depend on how you define dimensions and whatever

If we had one training example, our matrix multiplication would look something like this:

[x0...xDin][W0,0...WDout.........WDin,0...WDin,Dout]+[b0...bDout]=[h0...hDout]\begin{bmatrix} x_0 & ... & x_{D_{in}} \end{bmatrix} \begin{bmatrix} W_{0, 0} & ... & W_{D_{out}} \\ ... & ... & ... \\ W_{D_{in},0} & ... & W_{D_{in}, D_{out}} \end{bmatrix} + \begin{bmatrix} b_0 & ... & b_{D_{out}} \end{bmatrix} = \begin{bmatrix} h_0 & ... & h_{D_{out}} \end{bmatrix}

But remember, we typically do training in batches and have multiple examples. Thus, our input xx has dimensions N×DinN \times D_{in}. Each row is one training example, and we have NN examples. Likewise, the output of the layer, hh, will be a matrix of dimensions N×DoutN \times D_{out}. Each row is the output for one example, and we have NN examples.

[h0,0...h0,Dout...hN,0...hN,Dout]\begin{bmatrix} h_{0, 0} & ... & h{0, D_{out}} \\ ... \\ h_{N, 0} & ... & h_{N, D_{out}} \end{bmatrix}

Our weights WW will have dimensions of Din×DoutD_{in} \times D_{out}. We use the same weights for each of our examples.

[W0,0...WDout...Wi,j...WDin,0...WDin,Dout]\begin{bmatrix} W_{0, 0} & ... & W_{D_{out}} \\ ... & W_{i,j} & ... \\ W_{D_{in},0} & ... & W_{D_{in}, D_{out}} \end{bmatrix}

If we do the matrix multiplication WxWx, we get a resultant matrix of dimensions N×DoutN \times D_{out}, matching our hh. Each row in the output is the result of multipling weights to the corresponding input row.

The biases bb will have dimensions of N×DoutN \times D_{out}. Keep in mind that we use the same bias values for all training examples. So really, bb is just a vector of with DoutD_{out} columns (just like we showed for the one training example), and expanded along the other dimension to have NN rows, so that we can add our biases to each training example.

[b0...bDout...b0...bDout]\begin{bmatrix} b_0 & ... & b_{D_{out}} \\ ... \\ b_0 & ... & b_{D_{out}} \end{bmatrix}

Putting it together:

[x0,0...x0,Din...xN,0...xN,Din][W0,0...WDout...WDin,0...WDin,Dout]+[b0...bDout...b0...bDout]=[h0,0...h0,Dout...hN,0...hN,Dout]\begin{bmatrix} x_{0,0} & ... & x_{0,D_{in}} \\ ... \\ x_{N,0} & ... & x_{N,D_{in}} \end{bmatrix} \begin{bmatrix} W_{0, 0} & ... & W_{D_{out}} \\ ... \\ W_{D_{in},0} & ... & W_{D_{in}, D_{out}} \end{bmatrix} + \begin{bmatrix} b_0 & ... & b_{D_{out}} \\ ... \\ b_0 & ... & b_{D_{out}} \end{bmatrix} = \begin{bmatrix} h_{0, 0} & ... & h_{0, D_{out}} \\ ... \\ h_{N, 0} & ... & h_{N, D_{out}} \end{bmatrix}

ReLU layer

ReLU is pretty easy. It's defined as:

r=max(0,h)r = max(0, h)

Here, hh is the output from the linear layer being fed into the input of our ReLU layer. And we'll define rr as the output of our ReLU layer.

Essentially, what we're doing is taking every single input hi,jh_{i,j} from our input matrix and running it through our ReLU function, e.g. ensuring every value becomes 0\geq 0.

The dimensions of our input hh is N×DoutN \times D_{out}. ReLU doesn't do any matrix transformations, so it outputs a matrix rr of dimensions N×DoutN \times D_{out} also.

Softmax layer

Softmax is the final layer of our neural network. Typically, according to Wikipedia, softmax takes in an input vector and outputs a vector of same dimensions. This output vector represents a probability distribution - in our classification model, it represents the predicted probabilities for each class. For each column SiS_i, in the output, SiS_i represents the probability that the training example is part of class ii. And since this is a probability distribution, all the values in the output add up to 1.

For one training example, the softmax is defined as:

Si=erikDouterkS_i = \frac{e^{r_i}}{\sum_k^{D_{out}} e^{r_k}}

The input to softmax is a vector rr of size 1×Dout1 \times D_{out}. The output of softmax will be a vector SS of size 1×Dout1 \times D_{out} as well.

S=[S0...Sj...SDout]S = \begin{bmatrix} S_0 & ... & S_j & ... & S_{D_{out}} \end{bmatrix}

For some math clarification:

kDouterk=er0+er1+...+erDout\sum_k^{D_{out}} e^{r_k} = e^{r_0} + e^{r_1} + ... + e^{r_{D_{out}}}

[r0...rDout][er0kerk...erDoutkerk]\begin{bmatrix} r_0 & ... & r_{D_{out}} \end{bmatrix} \Rightarrow \begin{bmatrix} \frac{e^{r_0}}{\sum_k e^{r_k}} & ... & \frac{e^{r_{D_{out}}}}{\sum_k e^{r_k}} \end{bmatrix}

For example, if softmax returned [0.8,0.15,0.05][0.8, 0.15, 0.05], this means our model predicted that there is an 80% chance that the original input xx (not rr) belongs to class index 0, and that there's a 5% chance that the input belongs to class index 2.

However, this is just for one training example. To extend this for our model, where we have multiple training examples and rr has dimensions N×DoutN \times D_{out}, we can define our softmax as:

Si,j=eri,jkeri,kS_{i, j} = \frac{e^{r_{i, j}}}{\sum_k e^{r_{i, k}}}

We are basically taking each row in rr (a vector), and outputting another vector which is another row in our final output SS. It ends up giving us an output matrix SS of dimensions N×DoutN \times D_{out}. Each row represents the predicted probability distribution for one example.

For some math clarification:

keri,k=eri,0+eri,1+...+eri,Dout\sum_k e^{r_{i, k}} = e^{r_{i, 0}} + e^{r_{i, 1}} + ... + e^{r_{i, D_{out}}}

[r0,0...r0,Dout...rN,0...rN,Dout][er0,0ker0,k...er0,Doutker0,k...erN,0kerN,k...er0,DoutkerN,k]\begin{bmatrix} r_{0, 0} & ... & r_{0, D_{out}} \\ ... \\ r_{N, 0} & ... & r_{N, D_{out}} \end{bmatrix} \Rightarrow \begin{bmatrix} \frac{e^{r_{0, 0}}} {\sum_k e^{r_{0, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{0, k}}} \\ \\ ... \\ \\ \frac{e^{r_{N, 0}}} {\sum_k e^{r_{N, k}}} & ... & \frac{e^{r_{0, D_{out}}}} {\sum_k e^{r_{N, k}}} \end{bmatrix}

Loss function

Softmax is our final layer in our model. It returns a predicted probability distribution for each of our training examples. Now, we need to measure how well our model did.

To do that, we will compute the negative log likelihood loss (NLL loss) for each training example. Let's look at only one for now.

S=[0.8,0.15,0.05]S = [0.8, 0.15, 0.05]

Typically, our true probability distribution yy will look something like this for classification models:

y=[1,0,0]y = [1, 0, 0]

Interpreting all this, this means that the input xx actually belongs class index 0 (it has 100% chance as shown in yy). Our model made a prediction that the input has an 80% chance of being in index 0. So it's almost there.

How do we measure how well our model did? We'll calculate a loss value, which is a scalar value. We'll use the NLL loss equation (for one training example):

L(y,S)=cyclog(Sc)\mathcal{L}(y, S) = -\sum_c y_c log(S_c)

It basically sums up the log likelihood for each output feature cc (each possible class), and takes the negative. Why negative? Because we want to minimize the total loss.

Not gonna explain NLL right now, that's a different story

To expand this for multiple training examples, we basically compute the loss for each training example, and output the average loss for all NN examples:

L(y,S)=1Ni=0NcDoutyi,clog(Si,c)\mathcal{L}(y, S) = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})

Note that here, yy is a matrix of size N×DoutN \times D_{out}, where each row is the true probability distribution for one example.

Minimizing loss

We go forward in our neural network with our initial input xx, get predicted distributions SS for each traing example, and finally calculate a final loss value L\mathcal{L} for each training example, and take the average over the batch.

h=Wx+bh = Wx + b

r=max(0,h)r = max(0, h)

S=Softmax(r)S = \text{Softmax}(r)

L=1Ni=0NcDoutyi,clog(Si,c)\mathcal{L} = - \frac{1}{N} \sum_{i=0}^N \sum_c^{D_{out}} y_{i, c} log(S_{i, c})

Now it's time to update our parameters (our weights and biases). To do so, we'll calculate the gradient of our loss function with respect to our weights or biases for each of out training examples. Then, we'll take the average gradient and update our weights and biases accordingly.

Wnew=Wold1Ni=0NWLiW_{new} = W_{old} - \frac{1}{N} \sum_{i=0}^N \nabla_W \mathcal{L}_i

bnew=bold1Ni=0NbLib_{new} = b_{old} - \frac{1}{N} \sum_{i=0}^N \nabla_b \mathcal{L}_i

Let's look at how we'll calculate the gradient for each training example. Recall chain rule:

LW=LSSrrhhW\frac {\partial \mathcal{L}}{\partial W} = \frac {\partial \mathcal{L}}{\partial S} \frac {\partial S}{\partial r} \frac {\partial r}{\partial h} \frac {\partial h}{\partial W}

Lb=LSSrrhhb\frac {\partial \mathcal{L}}{\partial b} = \frac {\partial \mathcal{L}}{\partial S} \frac {\partial S}{\partial r} \frac {\partial r}{\partial h} \frac {\partial h}{\partial b}

We'll talk in detail about how we'll derive each of these deriviates for one training example, and how we'll update our parameters with respect to multiple training examples, as we described before.

Loss derivative

Recall our loss function for one training example:

L=cDoutyclog(Sc)\mathcal{L} = - \sum_c^{D_{out}} y_{c} log(S_{c})

We need to calculate LS\frac {\partial \mathcal{L}}{\partial S}. But, let's observe what this derivative looks like.

For one training example, SS is the input with dimensions of 1×Dout1 \times D_{out}. Our loss function is simply one scalar value. Thus, we need to compute the following:

LS=[LS0...LSi...LSDout]\frac {\partial \mathcal{L}}{\partial S} = \begin{bmatrix} \frac {\partial \mathcal{L}}{\partial S_0} & ... & \frac {\partial \mathcal{L}}{\partial S_i} & ... & \frac {\partial \mathcal{L}}{\partial S_{D_{out}}} \\ \end{bmatrix}

Let's try and solve each LSi\frac {\partial \mathcal{L}}{\partial S_i}.

LSi=LSi(cDoutyclog(Sc))\frac {\partial \mathcal{L}}{\partial S_i} = \frac {\partial \mathcal{L}}{\partial S_i} (- \sum_c^{D_{out}} y_c log(S_c))

=LSi((y0log(S0)+...+yilog(Si)+...+yDoutlog(SDout)))= \frac {\partial \mathcal{L}}{\partial S_i} (-(y_0 log(S_0) + ... + y_i log(S_i) + ... + y_{D_{out}} log(S_{D_{out}})))

=(LSi(y0log(S0))+...+LSi(yilog(Si))+...+LSi(yDoutlog(SDout)))= -(\frac {\partial \mathcal{L}}{\partial S_{i}} (y_0 log(S_0)) + ... + \frac {\partial \mathcal{L}}{\partial S_i} (y_i log(S_i)) + ... + \frac {\partial \mathcal{L}}{\partial S_{i}} (y_{D_{out}} log(S_{D_{out}})))

We only care about taking the derivative with respect to SiS_{i}. This means that the derivative of all other terms is 0. Thus, we have:

LSi=LSi(yilog(Si))\frac {\partial \mathcal{L}}{\partial S_{i}} = \frac {\partial \mathcal{L}}{\partial S_{i}} (- y_{i} log(S_{i}))

=yiSi= -\frac{y_{i}}{S_{i}}

This is the derivative of the loss function with respect to its prediction (softmax) output for one training example. It is a matrix of size 1×Dout1 \times D_{out}.

Softmax derivative

For one training example, the output of the softmax function SS is size 1×Dout1 \times D_{out}. The input rr to our softmax function has dimensions 1×Dout1 \times D_{out}. We would like to compute the following:

Sr=[Sr0...Sri...SrDout]\frac{\partial S}{\partial r} = \begin{bmatrix} \frac {\partial S}{\partial r_{0}} & ... & \frac {\partial S}{\partial r_i} & ... & \frac {\partial S}{\partial r_{D_{out}}} \end{bmatrix}

Let's try and solve for each Sri\frac {\partial S}{\partial r_{i}}.

As stated before, SS has size 1×Dout1 \times D_{out}. So really, when computing Sri\frac {\partial S}{\partial r_{i}}, we're computing another matrix:

Sri=[S0ri...Sjri...SDoutri]\frac {\partial S}{\partial r_{i}} = \begin{bmatrix} \frac {\partial S_{0}}{\partial r_{i}} & ... & \frac {\partial S_j}{\partial r_i} & ... & \frac {\partial S_{D_{out}}}{\partial r_{i}} \end{bmatrix}

In order for us to compute Sri\frac {\partial S}{\partial r_{i}}, we need to compute how each of the components SjS_j changes with respect to rir_{i}.

In other words, our Sr\frac {\partial S}{\partial r} looks like this:

Sr=[S0r0...S0ri...S0rDout...SDoutr0...SDoutri...SDoutrDout]\frac{\partial S}{\partial r} = \begin{bmatrix} \frac {\partial S_0}{\partial r_{0}} & ... & \frac {\partial S_0}{\partial r_i} & ... & \frac {\partial S_0}{\partial r_{D_{out}}} \\ ... \\ \frac {\partial S_{D_{out}}}{\partial r_{0}} & ... & \frac {\partial S_{D_{out}}}{\partial r_i} & ... & \frac {\partial S_{D_{out}}}{\partial r_{D_{out}}} \end{bmatrix}

This is also called the Jacobian matrix. I ain't a math expert so I'm not gonna explain it, go read it on Wikipedia

Let's compute Sjri\frac {\partial S_{j}}{\partial r_{i}}. Recall our softmax equation for one training example:

Sj=erjkerkS_j = \frac{e^{r_j}}{\sum_k e^{r_k}}

I replaced ii with jj for notation consistency in the current context

We solve for Sjri\frac {\partial S_{j}}{\partial r_{i}}. We use quotient rule:

Sjri=(Sjrieri)(kerk)eriSjri(kerk)(kerk)2\frac {\partial S_{j}}{\partial r_{i}} = \frac { (\frac{\partial S_{j}}{\partial r_{i}} e^{r_i}) (\sum_k e^{r_k}) - e^{r_i} \frac {\partial S_{j}}{\partial r_{i}} (\sum_k e^{r_k}) } {(\sum_k e^{r_k})^2}

Observe that Sjrieri\frac {\partial S_{j}}{\partial r_{i}} e^{r_i} can depend on whether i=ji = j.

Sjrieri={eriif i=j0if ij\frac {\partial S_{j}}{\partial r_{i}} e^{r_i} = \begin{cases} e^{r_i} & \text{if }i = j\\ 0 & \text{if }i \neq j \end{cases}

This is because in the original equation, SjS_j actually only depends on erje_{r_j}. Otherwise, its derivative is zero.

For Sjri(kerk)\frac {\partial S_{j}}{\partial r_{i}} (\sum_k e^{r_k}), its derivative is the same regardless of ii or jj:

Sjri(kerk)=Sjri(er0+...+eri+...+erDout)={erjif i=jerjif ij\frac {\partial S_{j}}{\partial r_{i}} (\sum_k e^{r_k}) = \frac {\partial S_{j}}{\partial r_{i}}(e^{r_0} + ... + e^{r_i} + ... + e^{r_{D_{out}}}) = \begin{cases} e^{r_j} & \text{if }i = j\\ e^{r_j} & \text{if }i \neq j \end{cases}

After some labor (that I skipped writing about), we arrive to the derivation:

Sjri={Sj(1Sj)if i=jSiSjif ij\frac {\partial S_{j}}{\partial r_{i}} = \begin{cases} S_j \cdot (1 - S_j) & \text{if } i = j \\ -S_i S_j & \text{if }i \neq j \\ \end{cases}

If you did this by hand, you'll notice that you can substitute SjS_j into your derivations. That's what I did, in case you're confused.

Future me, if you don't get this, go back to calc 1 smh

In the end, our matrix looks kinda cool with a neat diagonal:

Sr=[S0(1S0)...S0SDout...SDoutS0...SDout(1SDout)]\frac{\partial S}{\partial r} = \begin{bmatrix} S_0 \cdot (1 - S_0) & ... & -S_0 S_{D_{out}} \\ ... \\ -S_{D_{out}} S_0 & ... & S_{D_{out}} \cdot (1 - S_{D_{out}}) \\ \end{bmatrix}

ReLU derivative

Just like forward pass, this step is pretty easy. The derivative of ReLU (with respect to each hh element in the input matrix) is:

rh={0if h<01if h0\frac {\partial \mathcal{r}}{\partial h} = \begin{cases} 0 & \text{if }h < 0 \\ 1 & \text{if }h \geq 0 \\ \end{cases}

stackoverflow

Linear derivative

Recall the equation for our linear layer:

h=Wx+bh = Wx + b

Recall that, for one training example, the output matrix hh has dimensions 1×Dout1 \times D_{out}. The input matrix xx has dimensions 1×Din1 \times D_{in}, our weights have dimensions Din×DoutD_{in} \times D_{out}, and our biases have dimensions 1×Dout1 \times D_{out}.

First, let's compute hW\frac {\partial h}{\partial W}. We want the following:

hW=[hW0,0...hW0,j...hW0,Dout...hWi,0...hWi,j...hWj,Dout...hWDin,0...hWDin,j...hWDin,Dout]\frac {\partial h}{\partial W} = \begin{bmatrix} \frac {\partial h}{\partial W_{0, 0}} & ... & \frac {\partial h}{\partial W_{0, j}} & ... & \frac {\partial h}{\partial W_{0, D_{out}}} \\ \\ ... \\ \\ \frac {\partial h}{\partial W_{i, 0}} & ... & \frac {\partial h}{\partial W_{i, j}} & ... & \frac {\partial h}{\partial W_{j, D_{out}}}\\ \\ ... \\ \\ \frac {\partial h}{\partial W_{D_{in}, 0}} & ... & \frac {\partial h}{\partial W_{D_{in}, j}} & ... & \frac {\partial h}{\partial W_{D_{in}, D_{out}}} \end{bmatrix}

We must also compute hWi,j\frac {\partial h}{\partial W_{i, j}}:

hWi,j=[h0Wi,j...hkWi,j...hDoutWi,j]\frac {\partial h}{\partial W_{i, j}} = \begin{bmatrix} \frac {\partial h_0}{\partial W_{i, j}} & ... & \frac {\partial h_k}{\partial W_{i, j}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{i, j}} \end{bmatrix}

So, our Jacobian matrix looks like this:

hW=[[h0W0,0...hDoutW0,0]...[h0W0,j...hDoutW0,j]...[h0W0,Dout...hDoutW0,Dout]...[h0Wi,0...hDoutWi,0]...[h0Wi,j...hDoutWi,j]...[h0Wi,Dout...hDoutWi,Dout]...[h0WDin,0...hDoutWDin,0]...[h0WDin,j...hDoutWDin,j]...[h0WDin,Dout...hDoutWDin,Dout]]\frac {\partial h}{\partial W} = \begin{bmatrix} \begin{bmatrix} \frac {\partial h_0}{\partial W_{0, 0}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{0, 0}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{0, j}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{0, j}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{0, D_{out}}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{0, D_{out}}} \end{bmatrix} \\ \\ ... \\ \\ \begin{bmatrix} \frac {\partial h_0}{\partial W_{i, 0}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{i, 0}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{i, j}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{i, j}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{i, D_{out}}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{i, D_{out}}} \end{bmatrix} \\ \\ ... \\ \\ \begin{bmatrix} \frac {\partial h_0}{\partial W_{D_{in}, 0}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{D_{in}, 0}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{D_{in}, j}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{D_{in}, j}} \end{bmatrix} & ... & \begin{bmatrix} \frac {\partial h_0}{\partial W_{D_{in}, D_{out}}} & ... & \frac {\partial h_{D_{out}}}{\partial W_{D_{in}, D_{out}}} \end{bmatrix} \end{bmatrix}

Now, this Jacobian matrix looks a bit weird. It appears to be a 3D matrix. When I first saw this, I had a hard time visualizing this, especially since this is only one training example (which adds another dimension to our matrix 💀). However, as we continue and simplify stuff, rest assured it will be clear.

Let's continue and compute hkWi,j\frac {\partial h_k}{\partial W_{i, j}}. Using basic matrix multiplication knowledge, we can note the equation for a specific hkh_k component:

hk=(nDinWn,kxn)+bkh_k = (\sum_n^{D_{in}} W_{n, k} x_{n}) + b_k

Observe that for computing hkWi,j\frac {\partial h_k}{\partial W_{i, j}}, when iki \neq k, then the derivative is zero. This is because the hkh_k depends on only Wn,kW_{n, k}, as noted in the origianl equation of hkh_k. Intuitively, it's because during matrix multiplication, only column kk in the weights matrix actually affect column kk in the final matrix $h.

Order of matrix multiplication is xWxW rather than WxWx. As said earlier, this is more intuitive for me, especially when we extend it to multiple examples. I just write WxWx because it's more readable

Now, we solve for the other case if i=ki = k. Not gonna show the work here, the derivation is self-explanatory (unless you failed calc 1). We arrive with the final derivation:

hkWi,j={0if ikxiif i=k\frac {\partial h_k}{\partial W_{i, j}} = \begin{cases} 0 & \text{if }i \neq k \\ x_i & \text{if }i = k \end{cases}

Very cool. We found hkWi,j\frac {\partial h_k}{\partial W_{i, j}}. Let's zoom out a bit and look at hWi,j\frac {\partial h}{\partial W_{i, j}}. Say we're looking at W0,0W_{0, 0}:

hW0,0=[x00...0]\frac {\partial h}{\partial W_{0, 0}} = \begin{bmatrix} x_0 & 0 & ... & 0 \end{bmatrix}

Another example. Let's say we're looking at W0,1W_{0, 1}:

hW0,1=[0x0...0]\frac {\partial h}{\partial W_{0, 1}} = \begin{bmatrix} 0 & x_0 & ... & 0 \end{bmatrix}

Essentially, all other values in the matrix where iki \neq k is zero. We knew this already though.

Our final Jacobian matrix looks something like this:

hW=[[x0...]...[...x0...]...[...x0]...[xi...]...[...xi...]...[...xi]...[xDin...]...[...xDin...]...[...xDin]]\frac {\partial h}{\partial W} = \begin{bmatrix} \begin{bmatrix} x_0 & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_0 & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_0 \end{bmatrix} \\ \\ ... \\ \\ \begin{bmatrix} x_i & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_i & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_i \end{bmatrix} \\ \\ ... \\ \\ \begin{bmatrix} x_{D_{in}} & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_{D_{in}} & ... \end{bmatrix} & ... & \begin{bmatrix} ... & x_{D_{in}} \end{bmatrix} \end{bmatrix}

TODO

Combining our derivatives together

TODO

Backward pass for multiple examples

TODO

Conclusion

Now we know neural network basic math 😎🎉🎉🎉

chipi chipi chapa chapa cat
  Back to Top