A Brief Primer on Linear Regression – Part 1
This introduction to linear regression discusses a simple linear regression model with one predictor variable, and then extends it to the multiple linear regression model with at least two predictors.
What is Simple Linear Regression?
Simple Linear Regression is a statistical technique that allows us to summarize and study relationships between two continuous i.e. numeric variables:
- The variable we are predicting is called the criterion or response or dependent variable, and
- The variable we are basing our predictions on is called the predictor or explanatory or independent
Simple linear regression gets its adjective ‘simple’, because it concerns the study of only one predictor variable.
For example, the height-weight information of 100 randomly selected people, aged between 20 and 60, can be quantified in terms of the equation or model, considering the response variable as weight and one predictor variable as height. Here, the inherent assumption, though quite unrealistic, is that “weight” can be measured by a single attribute – height. The model to fit this data could be written as
Weight (continuous) ̴ Height (continuous)
In contrast, multiple linear regression, gets its adjective ‘multiple’, because it concerns the study of two or more predictor variables.
Extending our classic example of height-weight, we include other predictor variables, say, calorie intake, exercise level that would affect the person’s weight. The model to fit this data could be written as
Weight (continuous) ̴ Height (continuous) + Calorie Intake (continuous) + Exercise Level (categorical)
A sample dataset of 10 rows pertaining to height-weight example along with other factors affecting the prediction is displayed below:
Both height and calorie intake individually are linearly related to weight as seen below in their scatter plots.
However, both height and calorie intake together may affect the weight of an individual linearly in a multi-dimensional cloud of data points, but not in the same manner as they affect alone, in the above scatter plots.
The general mathematical model for representing the linear relationships (termed as regression equations) can be written as:
Here, for simple regression – b, the slope of the linear equation indicates the strength of impact of the variable, and a, the intercept of the line. And for multiple regression – bi (i =1, 2, …,n), are the slopes or regression coefficients, indicates the strength of impact of the predictors, and a, is the intercept of the line.
The regression coefficient estimates the change in the response variable Y per unit increase in one of the xi (i =1, 2, …,n) when all other predictors are held constant i.e. for our height-weight example, if x1 differed by one unit, and both x2 and x3 are held constant, Y will differ by b1 units, on an average.
The intercept or Y-intercept of the line, is the value you would predict for Y if all predictors are 0 i.e. when all xi = 0 (i =1, 2, …,n). In some cases, the Y-intercept really has no meaningful interpretation, but it just helps to anchor the regression line in the right place.
Conclusion
In this part, we introduced simple linear regression model with one predictor variable and then extended it to the multiple linear regression model with at least two predictors.
A sound understanding of regression analysis, and modeling provides a solid foundation for analysts to gain deeper understanding of virtually every statistical and machine learning technique. Although regression analysis is not the fanciest learning technique, it is a dominant and widely used statistical technique to establish a relationship model between two or more variables.
In the ensuing parts, we will delve into the steps and methodology to develop multiple linear regression model.
Original. Reposted with permission.
Related: