# Propensity Score Matching in R

Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible.

**The Matching Algorithms...**

We can now proceed with the matching algorithms with our ‘pscores.model’ and the estimated propensity scores. The matching algorithms create sets of participants for treatment and control groups. A matched set will consist of at least one person from the treatment group (i.e., people who responded to the ad campaign) and one from the control group (i.e., people who did not respond to the ad campaign) with similar propensity scores. The basic goal is to approximate a random experiment, eliminating many of the problems that come with observational data analysis.

There are various matching algorithms in R, namely, *exact matching, nearest neighbor, optimal matching, full matching *and* caliper matching.* Let’s try a couple of them:

1. *Exact matching* is a technique used to match individuals based on the exact values of covariates.

> library(MatchIt) > match1 <- matchit(pscores.model, method="exact",data=Data) > summary(match1, covariates = T)Call:matchit(formula = pscores.model, data = Data, method = "exact")Sample sizes:ControlTreatedAll596404Matched 99104Discarded 497 300Matched sample sizes by subclass:TreatedControl Total Age Income1 1 12 47 952 2 13 41 893 1 12 35 684 1 12 38 785 1 12 59 69> match1.data <- match.data(match1) > View(match1.data)

Here we can see that atleast one person in the treatment group (Ad_Campaign_Response = 1) has been matched with one person in the control group (Ad_Campaign_Response = 0).

**Create the tableone for exact matching...**

> table_match1 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = match1.data,test = FALSE) > print(table_match1, smd = TRUE)Stratified by Ad_Campaign_Response0 1SMDn 99104Age (mean (sd)) 46.04 (7.57)45.51 (7.33)0.071Income (mean (sd)) 79.45 (10.43)79.82 (10.23) 0.035

We can see in the above results that there are 99 control subjects matched with 104 treatment subjects. As our sample dataset is fairly balanced we don’t see much difference in doing exact matching, instead the SMD numbers are slightly higher than the pre-matching numbers. We can try the same with nearest matching and see the effect of matching.

2. *Nearest Neighbour Matching* is an algorithm that matches individuals with controls (it could be two or more controls per treated unit) based on a distance.

> match2 <- matchit(pscores.model, method="nearest", radio=1,data=Data) > plot(match2, type="jitter")

> plot(match2, type="hist")

**Create the tableone for nearest matching...**

> table_match2 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = match2.data,test = FALSE) > print(table_match2, smd = TRUE)Stratified by Ad_Campaign_Response01SMDn404404Age (mean (sd)) 45.63 (10.26)45.75 (10.33) 0.012Income (mean (sd)) 79.27 (16.04)79.28 (15.55) <0.001

The first thing we see above is that we matched 404 controls to 404 treated subjects and also we can see we get very small numbers for SMD in the covariates, hence we can conclude that we did a pretty good job of matching with nearest neighbour. We have a great balance in the dataset now to proceed with our further analysis.

**Outcome Analysis (testing our hypothesis)...**

We will now do a t test to test our hypothesis that there is a higher chance of purchase of the product when people respond to the ad campaign.

Before performing a paired t-test we will have to create two subsets from the matched data, one for treatment and other for control groups.

> y_trt <- match2.data$Bought[match2.data$Ad_Campaign_Response == 1] > y_con <- match2.data$Bought[match2.data$Ad_Campaign_Response == 0]

Next we will calculate a pairwise difference of the two subsets

> difference <- y_trt - y_con

Then we will perform a paired t-test on the difference, a paired t-test is just a regular t test on the difference in the outcome of the matched pairs

> t.test(difference)One Sample t-testdata: differencet = 30.786, df = 403, p-value < 2.2e-16alternative hypothesis: true mean is not equal to 095 percent confidence interval:0.6835709 0.7768252sample estimates:mean of x0.730198

From the results above we see a very very small p-value that makes the model highly significant.

The point estimate (mean) is 0.73, that means the difference in probability of Buying the product when everyone responds to the ad campaign versus when no one responds is 0.73 ( in other words there are 73% higher chances of buying when a person responds to the ad campaign).

**References:**

- www.statisticshowto.com/propensity-score-matching/
- pareonline.net/getvn.asp?v=19&n=18
- rstudio-pubs-static.s3.amazonaws.com/284461_5fabe52157594320921fc9e4d539ebc2.html
- Research paper on “Propensity Score Matching in Observational Studies”.
- Inferring Causal effects from Observational data – Jason Roy

This article was contributed by **Perceptive Analytics**. Wyendrila Roy, Chaitanya Sagar, Prudhvi Potuganti and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, business intelligence and reporting services to e-commerce, retail and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

**Related:**

- A Primer on Web Scraping in R
- 10 Tools to Help You Learn R
- Learn Generalized Linear Models (GLM) using R