What is the Role of the Activation Function in a Neural Network?
Confused as to exactly what the activation function in a neural network does? Read this overview, and check out the handy cheat sheet at the end.
Sorry if this is too trivial, but let me start at the "very beginning:" Linear regression.
The goal of (ordinary least-squares) linear regression is to find the optimal weights that -- when linearly combined with the inputs -- result in a model that minimizes the vertical offsets between the target and explanatory variables, but let's not get distracted by model fitting, which is a different topic ;).
So, in linear regression, we compute a linear combination of weights and inputs (let's call this function the "net input function").
net(x)=b+x1w1+x2w2+...xnwn = z
Next, let's consider logistic regression. Here, we put the net input z through a non-linear "activation function" -- the logistic sigmoid function where.
Think of it as "squashing" the linear net input through a non-linear function, which has the nice property that it returns the conditional probability P(y=1 | x) (i.e., the probability that a sample x belongs to class 1).
Now, if we add a step function, for instance,
- If SigmoidOutput greater or equal 0.5 predict class 1, and class 0 otherwise
- (Equivalently: if NetInput z greater or equal 0 -> predict class 1 and class 0 otherwise)
we get a logistic regression classifier:
(Maybe see this one for more details: Sebastian Raschka's answer to What is the probabilistic interpretation of regularized logistic regression? )
However, logistic regression (a generalized linear model) still remains a linear classifier in the sense that its decision surface is linear:
If classes can be linearly separated, this works fine, however, let's consider a trickier case:
Here, a non-linear classifier may be a better choice -- for example, a multi-layer neural network. Below, I trained a simple multi-layer perceptron with 1 hidden layer that consists of 200 of these logistic sigmoid activation functions. Let's see how the decision surface looks like now:
(note that I may be am overfitting a bit, but again, that's a discussion for a separate topic ;))
The architecture of this fully connected, feed-forward neural network, looks essentially like this:
In this particular case, we only have 3 units in the input layer (x_0 = 1 for the bias unit, and x_1 and x_2 for the 2 features, respectively); there are 200 of these sigmoid activation functions (a_m) in the hidden layer and 1 sigmoid function in the output layer, which is then squashed through a unit step function (not shown) to produce the predicted output class label y^.
To sum it up, the logistic regression classifier has a non-linear activation function, but the weight coefficients of this model are essentially a linear combination, which is why logistic regression is a "generalized" linear model. Now, the role of the activation function in a neural network is to produce a non-linear decision boundary via non-linear combinations of the weighted inputs.
(If you are interested, see Sebastian Raschka's answer to What is the best visual explanation for the back propagation algorithm for neural networks? for learning the weights in this case.)
For your convenience, I added a cheat sheet of the most common activation functions below:
Original. Reposted with permission.