Holding Your Hand Like a Small Child Through a Neural Network – Part 1
The first part of this 2 part series expands upon a now-classic neural network blog post and demonstration, guiding the reader through the foundational building blocks of a simple neural network.
Calculator Error and Updating Weights
We’ve generated our first predictions. Some were right, some were wrong. Where do we go from here? As I like to say, we didn’t get this far just to get this far. We push forward.
The next step is to see how wrong our predictions were. Before your mind thinks of crazy, complicated ways to do that, I’ll tell you the (simple) answer. Subtraction.
y: [   ] l1_error: [[-0.36672394] [-0.27408027] [ 0.53826471] [ 0.64131589]]
The equation to get l1_error is y – probability predictions. So for the first value it is: 0 – .3667 = -.3667. Simple, right?
Unfortunately, it’s going to get a little more complicated from here. But I’ll tell you upfront what our goals are so what we do makes a little more sense.
What we’re trying to do is update the weights so that the next time we make predictions, there is less error.
The first step for this is weighting the l1_error by how confident we were in our guess. Predictions close to 0 or 1 will have small update weights, and predictions closer to 0.5 will get updated more heavily. The mechanism we use to come up with these weights is the derivative of the sigmoid.
sigmoid derivative (update weight): [[ 0.23223749] [ 0.19896028] [ 0.24853581] [ 0.21876263]]
Since all of our l1 predictions were relatively unconfident, the update weights are relatively large. The most confident prediction was that the second training example was not a one (only a 19.89% chance of being a one), so notice that it has the smallest update weight.
The next step is to multiply the l1_errors by these weights, which gives us the following result that Mr Trask calls the l1_delta:
l1 delta (weighted error): [[-0.08516705] [-0.05453109] [ 0.13377806] [ 0.14752178]]
Now we’ve reached the final step. We are ready to update the weights. Or at least I am.
We update the weights by adding the dot product of the input values and the l1_deltas.
Let’s go through this matrix multiplication manually like we did before.
The input values are a 4×3 matrix. The l1_deltas are a 4×1 matrix. In order to take the dot product, we need the No. of columns in the first matrix to equal the No. rows in the second. To make that happen, we take the transpose of the input matrix, making it a 3×4 matrix.
Original inputs: [[0 0 1] [0 1 1] [1 0 1] [1 1 1]]
transpose of input weights: [[0 0 1 1] [0 1 0 1] [1 1 1 1]] (To take the transpose of a matrix, you flip it over the vertical axis and rotate it counter-clockwise.)
We’re multiplying a 3×4 matrix by a 4×1, so we should end up with a 3×1 matrix. This makes sense since we have 3 weights we need to update. Let’s begin the calculations:
(0 * -.085) + (0 * -.055) + (1 * .134) + (1 * .148) = 0.282 (0 * -.085) + (1 * -.055) + (0 * .134) + (1 * .148) = 0.093 (1 * -.085) + (1 * -.055) + (1 * .134) + (1 * .148) = 0.142
Okay! The first row must correspond to the update to the first weight, the second row to the second weight, etc. Unsurprisingly, the first weight (corresponding to the first column of inputs that is perfectly correlated with the output) gets updated the most, but let’s better understand why that is.
The first two l1 deltas are negative and the second two are positive. This is because the first two training examples have a true value of 0, and even though our guess was that they were more likely 0 than 1, we weren’t 100% sure. The more we move that guess towards 0, the better the guess will be. The converse logic holds true for the third and fourth inputs which have a true value of 1.
So what this operation does, in a very elegant way, is reward the weights by how accurate their corresponding input column is to the output. There is a penalty applied to a weight if an input contains a 1 when the true value is 0. Inputs that correctly have a 0 don’t get penalized because 0 * the penalty is 0.
pre-update weights: [[ 0.39293837] [-0.42772133] [-0.54629709]] post-update weights: [[ 0.67423821] [-0.33473064] [-0.40469539]]
With this process, we go from the original weights to the updated weights. With these updated weights we start the process over again, but stay tuned for next time where we’ll see what happens in the second iteration!
View my handy Jupyter Notebook here.
Bio: Paul Singman is a freelance data scientist in NYC. Some of his favorite projects are building prediction models for Airbnb listings and Oscars winners (but not both). For more info check out his Linkedin or reach him via the miracle of email: paulesingman AT gmail DOT com.
Original. Reposted with permission.