KDnuggets Home » News » 2016 » Aug » Tutorials, Overviews » A Primer on Logistic Regression – Part I ( 16:n32 )

# A Primer on Logistic Regression – Part I

Gain an understanding of logistic regression - what it is, and when and how to use it - in this post.

Since Linear Regression requires variables to be measured on continuous scale, let’s recode Age into a new categorical variable AgeGroup of 6 categories. Then, we compute the Churn percentage (Non-churning users of an AgeGroup / Total users count for an AgeGroup) for each AgeGroup and plot these percentages with AgeGroup that results in smooth curve – which fits well between the bounds 0 and 1. Therefore, we notice that Age when recoded to AgeGroup, can be summarized and fitted quite well through a non-linear curve rather than the linear trend line as visualized below:

The above scatterplot illustrates that the relationship between AgeGroup and Churn resembles like a S-shaped curve. Here, we have only 6 Age Groups for this data resulting into – a look like S-curve. But with the addition of more data resulting into many Age Groups, we would get a clear smooth S-curve. The S-curve does a better job of “fitting” these data points rather than the best fit straight line (obtained by Linear Regression).

Now, the question arises – whether this fitted curve results in prediction of Churn values as finite (‘0’ or ‘1’)?

To answer this question, let’s assume a cut-off point say 0.5 to differentiate the user’s behavior and estimate the chance of the user’s churn tendency. Now let’s define the rule to achieve this –

• Above 0.5, the user’s response can be attributed or recorded as Non-churn (‘1’) and
• Less than 0.5, the user’s response can be considered as Churn (‘0’).

This rule enables us to get the response values only as either ‘0’ or ‘1’, which is in sync with the nature of Churn variable.

But, this method is difficult in practice, as the determination of ideal cut-off point to arrive at AgeGroups is tedious and time-consuming. Hence, we need to resort to use some methodology to cater to these issues of the data easily.

Now, in view of the Churn-Age dataset, let’s define some more terms that are frequently encountered, discuss how these terms are related to one another and how they are useful for logistic regression.

Probability – is the quantitative representation of the chance that an event will occur. For instance, let’s define an event as “retaining the user for a particular AgeGroup” – that is, obtain the desired outcome value as “1” – which signifies “No Churn”. Thus, the Churn Percentage that we computed above, is actually the chance (probability) that we get a success as per the above defined event.

In short, probability is the number of times the events “occurs” divided by the total number of times the event “could occur”.

Odds – is always in relation to the happening of an event. We often use terms like – What is the odds that “It will rain today” or “Team A will win”. Odds is defined as the ratio of the chance of the event happening to that of non-happening of the event.

For instance, with our Churn-Age example, for an AgeGroup – “42-53” years, the probability of an event – User not churning is 14/20 = 0.7 and the probability of an event – User churning out is (1-0.7) = 0.3. Hence the odds of user churning out for individuals falling under “42-53” years is 0.7 / 0.3 = 2.33. That implies, the chance of user belonging to “42-53” years will not churn is little over than twice the chance that the user will churn out.

Note that odds can be converted back into probability as

$\textsl{Odds} = \frac{\textsl{Probability of Success}}{\textsl{Probability of Failure}} = \frac{\textup{p}}{\textup{1 - p}} \textit{ OR }$

$\Rightarrow \textup{Probability} = \frac{\textup{odds}}{\textup{1 + odds}}$

In common sense, probability and odds are used interchangeably. However, in statistics, probability and odds are not the same, but different.

The dataset (with these relevant terms) is displayed below, which forms the basis for building Logistic Regression model.

As in the above table, we have moved from Probability to Odds to Log of odds. But why do we need to take all the trouble to do the transformation to Log odds?

Probability ranges from 0 to 1 whereas Odds range from 0 to +ꝏ. The odds increase as the probability increases or vice versa, i.e. for higher probabilities, say p = 0.9, 0.999, 0.9999, Odds will be 9, 999, 9999 respectively.

Usually, it is difficult to model a variable which has restricted range, such as probability as it bounds between 0 and 1. Hence, we model odds of a variable to get around the restricted range problem.

But why Log transformation? There are two reasons for selecting “Log” as below:

• Log transformation also removes the unbounded range of Odds and compresses the range by reducing the wide spread of values of Odds to a manageable range of Log odds values. For instance, in our Churn-Age dataset, if the probability was 0.99, the odds will be 99 – too high and log odds will be 4.5951.
• Log transformation maintains the inherent nature of odds. That is, the greater the odds, the greater the log of odds and vice versa.

How to build a logistic regression model with Log odds?

Let ‘y’ be the outcome variable (say User Churn) indicating failure / success with 0 / 1 and ‘p’ be the probability of y to be 1 i.e. p = Prob (y=1). Let x1, x2,…., xk be a set of predictor variables (say, User Age, Gender and so on).

Then, the logistic regression model equation of y given x1, say User Churn given Age, can be written as below:

$\textup{log}\left ( \textup{odds} \right ) = \textup{log}\left (\frac{\textup{p}}{\textup{1 - p}} \right ) = \left ( \emph{a} + b_{1}x_{1} \right )$

The above model equation expresses the log-odds of an event as a linear function of its predictors.

But we need the outcome value as either 0 or 1 i.e. we need back the probabilities as outcome value. This can be obtained by taking an exponent of the above equation as shown below:

$exp\left ( \textsl{log}\left (\frac{\textup{p}}{\textup{1 - p}} \right ) \right ) = exp\left ( \emph{a} + b_{1}x_{1}\right )$

$\Rightarrow \frac{\textup{p}}{\textup{1 - p}} = exp\left ( \emph{a} + b_{1}x_{1}\right ) \textup{ [Since exp (log x) = x]}$

Solving the above equation for p, we get

$\emph{p} = \frac{exp\left ( \emph{a} + b_{1}x_{1} \right )}{1 + exp\left ( \emph{a} + b_{1}x_{1} \right )}$

The above function of ‘p’ is known as “Sigmoidal” function and its plot (S-shaped curve) is as below:

Thus, with this simple example of User Churn-Age dataset, we could decipher the intuition behind the Math of Logistic Regression.

Closing Thoughts

Logistic Regression is simply an extension of the linear regression model, so the basic idea of prediction is the same as that of Multiple Regression Analysis. But, unlike the multiple regression model, the logistic regression model is designed to test response variables, having finite outcomes.

Although logistic regression does contain a few complexities and new statistical concepts, it is within reach of anyone who can use linear models. Similar to Linear Regression, Logistic regression model provides the conceptual foundation for more sophisticated statistical and machine learning approaches.

In the next part, we will delve deeper into the Assumptions, Model Interpretation and Evaluation of Logistic Regression Model.