Holding Your Hand Like a Small Child Through a Neural Network – Part 2
The second of 2 posts expanding upon a now-classic neural network blog post and demonstration, guiding the reader through the workings of a simple neural network.
As we should come to expect, the weight on the first input got larger and the other two got smaller.
Let’s take a look at how the sum of the errors decreases over the first 100 iterations:
Now the first 1000 iterations:
Seems like we hit an “elbow point” around the 100th iteration. Let’s see how this same graph looks over 10,000 iterations:
Even more dramatic. So much of the effort (computational resources for those who don’t like to personify their processors) goes towards decreasing the final error by tiny, tiny amounts.
Last graph, lets see where we end up after 100,000 iterations:
The value of the error after 10,000 iterations is 0.03182. After 100,000 it is 0.00995, so the error is certainly still decreasing. Though from the graph above, we can see it is easy to make the argument that the additional training loops are not worth it since we get most of the way there from just a few hundred iterations.
Where did the weights end up? Great question! Let’s have a peek:
weights (after 100,000 iterations): [[ 12.0087] [-0.2044] [-5.8002]]
Not surprisingly, the size of the first weight has grown to be the largest. What does, in fact, surprise me is the relatively large weight on the third input (large weights, even if negative, still have an impact on the predictions.)
One thing to note is that the inputs corresponding to the third weight are all ones, making it effectively like adding a bias unit to the model. Viewed in that way, it is less surprising to see the large-ish third weight.
One more time, let’s run through the predictions produced from these weights. We start with the dot product of the weights and the input:
dot product results: (0 * 12.00) + (0 * -.20) + (1 * -5.8) = -5.8 (0 * 12.00) + (1 * -.20) + (1 * -5.8) = -6.0 (1 * 12.00) + (0 * -.20) + (1 * -5.8) = 6.2 (1 * 12.00) + (1 * -.20) + (1 * -5.8) = 6.0
Those results make an overwhelming amount of sense. Let’s apply the sigmoid function:
l1 probability prediction (sigmoid): 1/(1+e^-(-5.8)) = 0.003 1/(1+e^-(-6.0)) = 0.002 1/(1+e^-(6.2)) = 0.998 1/(1+e^-(6.0)) = 0.997
Hopefully that makes it a little more obvious why the error is so low. Only took 100,000 tries
Jupyter notebook for this article on GitHub.
Stay tuned next time when we add another layer and dive into the details of a more legit backprop example!
Bio: Paul Singman is a freelance data scientist in NYC. Some of his favorite projects are building prediction models for Airbnb listings and Oscars winners (but not both). For more info check out his Linkedin or reach him via the miracle of email: paulesingman AT gmail DOT com.
Original. Reposted with permission.