# Introducing Path Analysis Using R

Path analysis is an extension of multiple regression. It allows for the analysis of more complicated models. comments

By Perceptive Analytics

Imagine that you were to build a model to predict the mileage of a car from different attributes of the car. How will you do it?

The simplest approach will be to take one parameter or attribute (which one attribute to choose can be a question for endless debate) that affects the mileage most and build out a regression model to predict the mileage. But do you think this is the right approach? No, because the mileage of a car depends on multiple factors and not just a single factor. So, let’s go a step further and expand our model to make it more robust and include other attributes of car.

In the second approach, we will identify various attributes of a car such as horsepower, capacity, engine type, engine variant, cylinders, etc. All these will form predictor variables (also known as independent variables) for our model and the mileage will be response variable (also known as dependent variable).

What’s the difference between the first and second model?

In the second model, we have multiple factors or variables contributing to the final output variable. Intuitively, the accuracy of this model should be higher. Right?

The first model is called a simple linear regression; while, the second model is called a multiple linear regression model. In this case, assumption is that there are multiple independent variables which all affect the output variable. But, what if one of the independent variables is a dependent variable on other independent variables? For example, mileage is a dependent variable on horsepower, capacity, engine type, engine variant and cylinders; but what if, horsepower is a dependent variable on capacity, engine type and cylinders?

In such a scenario, the model becomes complex and path analysis comes handy in such situations. Path analysis is an extension of multiple regression. It allows for the analysis of more complicated models. It is helpful in examining situations where there are multiple intermediate dependent variables and in situations where Z is dependent on variable Y, which in turn is dependent on variable X. It can compare different models to determine which one best fits the data. Path analysis was earlier also known as ‘causal modeling’; however, after strong criticism people refrain from using the term because it’s not possible to establish causal relationships using statistical techniques. Causal relationships can only be established through experimental designs. Path analysis can be used to disprove a model that suggests a causal relationship among variables; however, it cannot be used to prove that a causal relation exist among variables.

Let’s understand the terminology used in the path analysis. We don’t variables as independent or dependent here; rather, we call them exogenous or endogenous variables. Exogenous variables (independent variables in the world of regression) are variables which have arrows starting from them but none pointing towards them. Endogenous variables have at least one variable pointing towards them. The reason for such a nomenclature is that the factors that cause or influence exogenous variables exist outside the system while, the factors that cause endogenous variables exist within the system. In the above image, X is an exogenous variable; while, Y and Z are endogenous variables. A typical path diagram is as shown below. In the above figure, A, B, C, D and E are exogenous variables; while, I and O are endogenous variables. ‘d’ is a disturbance term which is analogous to residuals in regression.

Now, let’s go through the assumptions that we need to consider before we use path analysis. Since, path analysis is an extension of multiple regression, most of assumptions of multiple regression hold true for path analysis as well.

1. All the variables should have linear relations among each other.
2. Endogenous variable should be continuous. In case of ordinal data, minimum number of categories should be five.
3. There should be no interaction among variables. In case of any interaction, a separate term or variable can be added that reflects the interaction between the two variables.
4. Disturbance terms are uncorrelated or covariance among the disturbance terms is zero.

Now, let’s move a step ahead and understand the implementation of path analysis in R. We will first try out with a toy example and then take a standard dataset available in R.

```install.packages("lavaan")
install.packages("OpenMx")
install.packages("semPlot")
install.packages("GGally")
install.packages("corrplot")
library(lavaan)
library(semPlot)
library(OpenMx)
library(GGally)
library(corrplot)
```

Now, let’s create our own dataset and try out path analysis. Please note that the rationale for doing this exercise is to develop intuition to understand path analysis.

```# Let's create our own dataset and play around that first
set.seed(11)
a = 0.5
b = 5
c = 7
d = 2.5
x1 = rnorm(20, mean = 0, sd = 1)
x2 = rnorm(20, mean = 0, sd = 1)
x3 = runif(20, min = 2, max = 5)
Y = a*x1 + b*x2
Z = c*x3 + d*Y
data1 = cbind(x1, x2, x3, Y, Z)
```

```> head(data1, n = 10)
x1 x2      x3 Y  Z
[1,] -0.59103110 -0.68251762 2.152597 -3.70810366  5.797922
[2,]  0.02659437 -0.01585819 3.488896 -0.06599378 24.257289
[3,] -1.51655310 -0.44260479 3.524391 -2.97130048 17.242488
[4,] -1.36265335  0.35255750 2.707776  1.08146082 21.658085
[5,]  1.17848916  0.07317058 4.441204  0.95509749 33.476170
[6,] -0.93415132  0.00715880 3.257310 -0.43128166 21.722969
[7,]  1.32360565 -0.18760011 2.574199 -0.27619773 17.328901
[8,]  0.62491779 -0.76570065 3.946699 -3.51604433 18.836781
[9,] -0.04572296 -0.22105682 4.439842 -1.12814558 28.258531
[10,] -1.00412058 -0.98358859 2.676505 -5.42000323  5.185524
```

Now, we have created this dataset. Let’s see the correlation matrix for these variables. This will tell us how strongly and which all variables are correlated to each other.

```> cor1 = cor(data1)
> corrplot(cor1, method = 'square')
``` The above chart shows us that Y is very strongly correlate with X2; while, Z is strongly correlated with X2 and Y. The impact of X1 on Y is not as strong as that of X2.

```model1 = 'Z ~ x1 + x2 + x3 + Y
Y ~ x1 + x2'
fit1 = cfa(model1, data = data1)
summary(fit1, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)
```

```> summary(fit1, fit.measures = TRUE, standardized = TRUE, rsquare = TRUE)
** WARNING ** lavaan (0.6-1) did NOT converge after 90 iterations
** WARNING ** Estimates below are most likely unreliable

Number of observations                            20

Estimator                                         ML
Model Fit Test Statistic                          NA
Degrees of freedom                                NA
P-value                                           NA

Parameter Estimates:

Information                                 Expected
Information saturated (h1) model          Structured
Standard Errors                             Standard

Regressions:
Estimate Std.Err z-value  P(>|z|) Std.lv Std.all
Z ~
x1              0.721 NA                  0.721 0.072
x2              0.328 NA                  0.328 0.028
x3              1.915 NA                  1.915 0.179
Y              1.998 NA                  1.998 0.867
Y ~
x1              0.500 NA                  0.500 0.115
x2              5.000 NA                  5.000 0.968

Variances:
Estimate Std.Err z-value  P(>|z|) Std.lv Std.all
.Z               14.773 NA                  14.773 0.215
.Y                0.000 NA                 0.000 0.000

R-Square:
Estimate
Z              0.785
Y              1.000
```

```> semPaths(fit1, 'std', layout = 'circle')
``` The above plot shows us that Z is strongly dependent on Y and weakly dependent on X3 and X1. Y is strongly dependent on X2 and weakly dependent on X1. This is the same intuition that we have built earlier in this article. This is the beauty of path analysis and this is how analysis can be used.

The values between the lines are path coefficients. Path coefficients are standardized regression coefficients, similar to beta coefficients of multiple regression. These path coefficients should be statistically significant, which can be checked from the summary output (we will see this in the next example).

Let’s move to our second example. In this example, we will use standard dataset ‘mtcars’ available in R.

```# Let's take second example where we take standard dataset 'mtcars' available in R
data2 = mtcars
```

```> head(data2, n = 10)
mpg cyl disp hp drat    wt qsec vs am gear carb
Mazda RX4         21.0 6 160.0 110 3.90 2.620 16.46  0 1 4 4
Mazda RX4 Wag     21.0 6 160.0 110 3.90 2.875 17.02  0 1 4 4
Datsun 710        22.8 4 108.0 93 3.85 2.320 18.61  1 1 4 1
Hornet 4 Drive    21.4 6 258.0 110 3.08 3.215 19.44  1 0 3 1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0 0 3 2
Valiant           18.1 6 225.0 105 2.76 3.460 20.22  1 0 3 1
Duster 360        14.3 8 360.0 245 3.21 3.570 15.84  0 0 3 4
Merc 240D         24.4 4 146.7 62 3.69 3.190 20.00  1 0 4 2
Merc 230          22.8 4 140.8 95 3.92 3.150 22.90  1 0 4 2
Merc 280          19.2 6 167.6 123 3.92 3.440 18.30  1 0 4 4
```

```model2 = 'mpg ~ hp + gear + cyl + disp + carb + am + wt
hp ~ cyl + disp + carb'
fit2 = cfa(model2, data = data2)
```

```> summary(fit2)
lavaan (0.6-1) converged normally after  62 iterations

Number of observations                            32

Estimator                                         ML
Model Fit Test Statistic                       7.901
Degrees of freedom                                 3
P-value (Chi-square)                           0.048

Parameter Estimates:

Information                                 Expected
Information saturated (h1) model          Structured
Standard Errors                             Standard

Regressions:
Estimate Std.Err z-value  P(>|z|)
mpg ~
hp              -0.022 0.016 -1.388    0.165
gear               0.586 1.247 0.470    0.638
cyl              -0.848 0.710 -1.194    0.232
disp               0.006 0.012 0.512    0.609
carb              -0.472 0.620 -0.761    0.446
am               1.624 1.542 1.053    0.292
wt              -2.671 1.267 -2.109    0.035
hp ~
cyl               7.717 6.554 1.177    0.239
disp               0.233 0.087 2.666    0.008
carb              20.273 3.405 5.954    0.000

Variances:
Estimate Std.Err z-value  P(>|z|)
.mpg                5.011 1.253 4.000    0.000
.hp               644.737 161.184  4.000 0.000
```

In the above summary output, we can see that wt is a significant variable for mpg at 5 percent level; while, dsp and crb are significant variables for hp. ‘Hp’ itself is not a significant variable for mpg. We will examine this model using a path diagram using semPlot package.

```> semPaths(fit2, 'std', 'est', curveAdjacent = TRUE, style = "lisrel")
``` The above plot shows that mpg is strongly dependent on wt; while, hp is strongly dependent on dsp and crb. There is a weak relation between hp and mpg. Same inference was derived from the above output.

semPaths function can be used to create above chart in multiple ways. You can go through the documentation for semPaths and explore different options.

There are few considerations that you should keep in mind while doing path analysis. Path analysis is very sensitive to omission or addition of variables in the model. Any omission of relevant variable or addition of extra variable in the model may significantly impact the results. Also, path analysis is a technique for testing out models and not building them. If you were to use path analysis in building models, then you may end with endless combination of different models and choosing the right model may not be possible. So, path analysis can be used to test a specific model or compare multiple models to choose the best possible.

There are numerous other ways you can use path analysis. We would love to hear your experiences of using path analysis in different contexts. Please share your examples and experiences in the comments section below.

Perceptive Analytics provides Data Analytics, business intelligence and reporting services to e-commerce, retail, healthcare and pharmaceutical industries. Its client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Related: Get the FREE ebook 'The Great Big Natural Language Processing Primer' and 'The Complete Collection of Data Science Cheat Sheets' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.  