How to Build Your Own Logistic Regression Model in Python

A hands on guide to Logistic Regression for aspiring data scientist and machine learning engineer.



The name of this algorithm could be a little confusing in the sense that the Logistic Regression machine learning algorithm is for classification tasks and not regression problems. The name ‘Regression’ here implies that a linear model is fit into the feature space. This algorithm applies a logistic function to a linear combination of features to predict the outcome of a categorical dependent variable based on predictor variables. Logistic regression algorithms help estimate the probability of falling into a specific level of the categorical dependent variable based on the given predictor variables.

Suppose that you want to predict if there will be rain tomorrow in Toronto. Here the outcome of the prediction is not a continuous number because there will either be rain or no rain and hence linear regression cannot be applied. Here the outcome variable is one of the several categories and using logistic regression helps.

 

Applications of Logistic Regression

 

  • Logistic regression algorithm is applied in the field of epidemiology to identify risk factors for diseases and plan accordingly for preventive measures.
  • Used to predict whether a candidate will win or lose a political election or to predict whether a voter will vote for a particular candidate.
  • Used in weather forecasting to predict the probability of rain.
  • Used in credit scoring systems for risk management to predict the defaulting of an account.

 

Environment and tools

 

  1. numpy
  2. matplotlib

 

Where is the code?

 
Without much ado, let’s get started with the code. The complete project on github can be found here.

Let’s start with loading the libraries and dependencies.

import numpy as np
import matplotlib.pyplot as plt


The first function is used for defining the sigmoid activation function. The plot of the sigmoid function looks like this:

Figure

Sigmoid function

 

def sigmoid(scores): 
   return 1 / (1 + np.exp(-scores))


The sigmoid function is represented as shown:

The sigmoid function also called the logistic function gives an ‘S’ shaped curve that can take any real-valued number and map it into a value between 0 and 1. If the curve goes to positive infinity, y predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0. If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or yes, and if it is less than 0.5, we can classify it like 0 or no.

The next function is used for returning the log likelihood value. The parameters associated with this function are feature vectors, target value and weights of the model.

The log-likelihood is as the term suggests, the natural logarithm of the likelihood. In turn, given a sample and a parametric family of distributions (i.e., a set of distributions indexed by a parameter) that could have been generated from the sample the likelihood is a function that associates to each parameter the probability of observing the given sample.

def log_likelihood(features, target, weights): 
   scores = np.dot(features, weights)  
   ll = np.sum(target * scores - np.log(1 + np.exp(scores)))  
   return ll


The next function is used to make the logistic regression model. The parameters associated with this function are feature vectors, target value, number of steps for training, learning rate and a parameter for adding intercept which is set to false by default.

First weights are assigned using feature vectors. Next score is calculated using dot product of feature and weight vectors. The prediction is found by applying the sigmoid function to the score. Now error can be calculated which is the difference between target and prediction values. This error is used for finding out the gradient which is the dot product of transposed feature vector and error. The new weights can be calculated by adding learning rate multiplied by gradient to the old weights.

def logistic_regression(features, target, num_steps, learning_rate, add_intercept=False):
    if add_intercept: 
       intercept = np.ones((features.shape[0], 1))  
       features = np.hstack((intercept, features))     weights = np.zeros(features.shape[1])
  
    for step in range(num_steps):  
      scores = np.dot(features, weights) 
      predictions = sigmoid(scores)  
      output_error_signal = target - predictions      
      gradient = np.dot(features.T, output_error_signal)    
      weights += learning_rate * gradient  
 
      if step % 10000 == 0:           
        print(log_likelihood(features, target, weights)) 
    
return weights


random() function is used to generate random numbers in Python. Seed function is used to save the state of random function, so that it can generate some random numbers on multiple execution of the code on the same machine or on different machines. The seed value chosen is 10 with 10000 data points.

The multivariate normal is a generalization of the one-dimensional normal distribution to higher dimensions. Such a distribution is specified by its mean and covariance matrix.

np.random.seed(10)
num_observations = 10000 x1 = np.random.multivariate_normal([0, 0], [[1, 0.5], [0.5, 1]], num_observations)
x2 = np.random.multivariate_normal([1, 4], [[1, 0.5], [0.5, 1]], num_observations)


hstack is used for appending data horizontally while vstack is used for appending data vertically. First vstack is used to separate the data points using features and then hstack is used for separating the data points using labels.

simulated_separableish_features = np.vstack((x1, x2)).astype(np.float32)simulated_labels = np.hstack((np.zeros(num_observations), np.ones(num_observations)))


Let’s visualize the results by plotting the separated data points using scatter function where alpha blending value is chosen to be 0.3. The blending value can range between 0 (transparent) and 1 (opaque).

plt.figure(figsize=(10, 8))plt.scatter(simulated_separableish_features[:, 0],   simulated_separableish_features[:, 1], c=simulated_labels,    alpha=0.3,)
 
plt.show()


 

Results

 

Figure

Classification of sample data points

 

 

Conclusions

 
To conclude, I demonstrated how to make a logistic regression model from scratch in python. Logistic regression is a widely used supervised machine learning technique. It is one of the best tools for statisticians, researchers and data scientists in predictive analytics. It offers several advantages like it is a robust algorithm as the independent variables need not have equal variance or normal distribution, do not assume a linear relationship between the dependent and independent variables and hence can also handle non-linear effects and they are also easier to inspect and less complex.

 

References/Further Readings

 
Logistic Regression — Detailed Overview
Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many social…

 
Tips for honing your logistic regression models | Zopa Blog
When we create our Credit Risk assessment or Fraud prevention machine learning models at Zopa, we use a variety of…

 

Logistic Regression: Concept & Application | Blog | Dimensionless
Through this article, we try to understand the concept of the logistic regression and its application. We will, as…

 

 

Before You Go

 
The corresponding source code can be found here.

abhinavsagar/Machine-learning-tutorials
You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

 

 

Contacts

 
If you want to keep updated with my latest articles and projects follow me on Medium. These are some of my contacts details:

Happy reading, happy learning and happy coding.

 
Bio: Abhinav Sagar is a senior year undergrad at VIT Vellore. He is interested in data science, machine learning and their applications to real-world problems.

Original. Reposted with permission.

Related: