A Primer on Logistic Regression – Part I
Gain an understanding of logistic regression - what it is, and when and how to use it - in this post.
By Pushpa Makhija, CleverTap.
In the real world, we often come across scenarios which requires to make decisions that result into finite outcomes, like the below examples,
- Will it rain today?
- Will I reach office on time today?
- Would a child graduate from his/her university?
- Does sedentary lifestyle increase the chances to get the heart disease?
- Does smoking lead to lung cancer?
- Would I wear blue, black, red outfit today?
- What grade a student would get in an exam?
All the above situations do reflect the input-output relationships. Here the output variable values are discrete & finite rather than continuous & infinite values like in Linear Regression. How could we model and analyze such data?
We could try to frame a rule which helps in guessing the outcome from the input variables. This is called a classification problem, and is an important topic in statistics and machine learning. Classification, a task of assigning objects to one of the several predefined categories, is a pervasive problem that encompasses many diverse applications in a broad array of domains. Some examples of Classification Tasks are listed below:
- In medical field, the classification task could be assigning a diagnosis to a given patient as described by observed characteristics of the patient such as age, gender, blood pressure, body mass index, presence or absence of certain symptoms, etc.
- In banking sector, one may want to categorize hundreds or thousands of applications for new cards containing information for several attributes such as annual salary, outstanding debts, age etc., into users who have good credit or bad credit for enabling a credit card company to do further analysis for decision making; OR one might want to learn to predict whether a particular credit card charge is legitimate or fraudulent.
- In social sciences, we may be interested to predict the preference of a voter for a party based on – age, income, sex, race, residence state, votes in previous elections etc.
- In finance sector, one would require to ascertain “whether a vendor is credit worthy”?
- In insurance domain, the company will need to assess “Is the submitted claim fraudulent or genuine”?
- In Marketing, the marketer would like to figure out “Which segment of consumers are likely to buy”?
Mostly, in the business world, Classification problems where the response or dependent variable have discrete and finite outcomes, are more prevalent than the Regression problems where the response variable is continuous and have infinite values. Logistic Regression is one of the most common algorithm used for modeling classification problems.
Why do we need Logistic Regression?
In case of Linear Regression Model, the predicted outcome of the dependent variable will always be a real value which could range from –ꝏ to +ꝏ. But unlike Regression problems, in Classification problems, the outcome value is either 0 or 1 or any other discrete value, as we saw in the above listed common examples. Now the question that arises is how to ensure that we only get the predicted outcome value as 0 or 1 after fitting a linear model for Classification problems.
To answer this question, let’s consider a couple of cases involving two variables:
Case I: Y is a linear function of X say
Case II: P is a non-linear function, say, an exponential function of R i.e. mathematically expressed as
While Y is a linear function of X, can we instead make some transformation to express P as some linear function of R? The appropriate transformation that could be applied in Case II is “Log” transformation.
Applying log on both sides of Case II equation, we get
Let’s leverage our understanding of Linear Regression Analysis, by using some trick – like some transformation to solve Case II scenarios, to develop methodology for solving Classification type of Problems.
So, what’s Logistic Regression?
Logistic Regression is a type of predictive model to describe the data and to explain the relationship between the dependent variable (having 2 or more finite outcomes) and a set of categorical and/or continuous explanatory / independent variables.
Suppose you have a dataset of 170 users containing the user’s age and whether or not the user tend to churn out from an app. The goal is to predict the tendency of the user to churn for a particular user’s age. The subset of 10 rows of this dataset and its summary is displayed below:
The above dataset can be visualized graphically as below:
It appears in the above graph that the target outcome variable, Churn, has a value 1 if the user is active / not churned on the app and 0 if the user churns out from the app.
Can we simply use Linear Regression to describe this relationship?
Let’s try to determine the best fit line by applying linear regression model, which is depicted in the following plot:
Upon inspecting the above graph, you notice that some things do not make sense as mentioned below:
- There are no limits on the values predicted by linear regression. So the predicted response value for user behavior – Churn might be either
- between 0 and 1 or
- less than 0 or
- greater than 1
Such values are not possible with our outcome variable (Churn) as they fall outside its legitimate and meaningful range of 0 to 1, inclusive. That implies, the line does a poor job of “describing” this kind of data.
- The response – User’s churn usually is not a linear function of the user’s age.
Then the question arises: How to overcome these problems of Linear regression model? Is it possible to then model churn by some other technique?