What is the Role of the Activation Function in a Neural Network?
Confused as to exactly what the activation function in a neural network does? Read this overview, and check out the handy cheat sheet at the end.
Sorry if this is too trivial, but let me start at the "very beginning:" Linear regression.
The goal of (ordinary leastsquares) linear regression is to find the optimal weights that  when linearly combined with the inputs  result in a model that minimizes the vertical offsets between the target and explanatory variables, but let's not get distracted by model fitting, which is a different topic ;).
So, in linear regression, we compute a linear combination of weights and inputs (let's call this function the "net input function").
net(x)=b+x_{1}w_{1}+x_{2}w_{2}+...x_{n}w_{n} = z
Next, let's consider logistic regression. Here, we put the net input z through a nonlinear "activation function"  the logistic sigmoid function where.
Think of it as "squashing" the linear net input through a nonlinear function, which has the nice property that it returns the conditional probability P(y=1  x) (i.e., the probability that a sample x belongs to class 1).
Now, if we add a step function, for instance,
 If SigmoidOutput greater or equal 0.5 predict class 1, and class 0 otherwise
 (Equivalently: if NetInput z greater or equal 0 > predict class 1 and class 0 otherwise)
we get a logistic regression classifier:
However, logistic regression (a generalized linear model) still remains a linear classifier in the sense that its decision surface is linear:
If classes can be linearly separated, this works fine, however, let's consider a trickier case:
Here, a nonlinear classifier may be a better choice  for example, a multilayer neural network. Below, I trained a simple multilayer perceptron with 1 hidden layer that consists of 200 of these logistic sigmoid activation functions. Let's see how the decision surface looks like now:
(note that I may be am overfitting a bit, but again, that's a discussion for a separate topic ;))
The architecture of this fully connected, feedforward neural network, looks essentially like this:
In this particular case, we only have 3 units in the input layer (x_0 = 1 for the bias unit, and x_1 and x_2 for the 2 features, respectively); there are 200 of these sigmoid activation functions (a_m) in the hidden layer and 1 sigmoid function in the output layer, which is then squashed through a unit step function (not shown) to produce the predicted output class label y^.
To sum it up, the logistic regression classifier has a nonlinear activation function, but the weight coefficients of this model are essentially a linear combination, which is why logistic regression is a "generalized" linear model. Now, the role of the activation function in a neural network is to produce a nonlinear decision boundary via nonlinear combinations of the weighted inputs.
For your convenience, I added a cheat sheet of the most common activation functions below:
