Propensity Score Matching in R
Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible.
The Matching Algorithms...
We can now proceed with the matching algorithms with our ‘pscores.model’ and the estimated propensity scores. The matching algorithms create sets of participants for treatment and control groups. A matched set will consist of at least one person from the treatment group (i.e., people who responded to the ad campaign) and one from the control group (i.e., people who did not respond to the ad campaign) with similar propensity scores. The basic goal is to approximate a random experiment, eliminating many of the problems that come with observational data analysis.
There are various matching algorithms in R, namely, exact matching, nearest neighbor, optimal matching, full matching and caliper matching. Let’s try a couple of them:
1. Exact matching is a technique used to match individuals based on the exact values of covariates.
> library(MatchIt) > match1 <- matchit(pscores.model, method="exact",data=Data) > summary(match1, covariates = T) Call: matchit(formula = pscores.model, data = Data, method = "exact") Sample sizes: Control Treated All 596 404 Matched 99 104 Discarded 497 300 Matched sample sizes by subclass: Treated Control Total Age Income 1 1 1 2 47 95 2 2 1 3 41 89 3 1 1 2 35 68 4 1 1 2 38 78 5 1 1 2 59 69 > match1.data <- match.data(match1) > View(match1.data)
Here we can see that atleast one person in the treatment group (Ad_Campaign_Response = 1) has been matched with one person in the control group (Ad_Campaign_Response = 0).
Create the tableone for exact matching...
> table_match1 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = match1.data,test = FALSE) > print(table_match1, smd = TRUE) Stratified by Ad_Campaign_Response 0 1 SMD n 99 104 Age (mean (sd)) 46.04 (7.57) 45.51 (7.33) 0.071 Income (mean (sd)) 79.45 (10.43) 79.82 (10.23) 0.035
We can see in the above results that there are 99 control subjects matched with 104 treatment subjects. As our sample dataset is fairly balanced we don’t see much difference in doing exact matching, instead the SMD numbers are slightly higher than the pre-matching numbers. We can try the same with nearest matching and see the effect of matching.
2. Nearest Neighbour Matching is an algorithm that matches individuals with controls (it could be two or more controls per treated unit) based on a distance.
> match2 <- matchit(pscores.model, method="nearest", radio=1,data=Data) > plot(match2, type="jitter")
> plot(match2, type="hist")
Create the tableone for nearest matching...
> table_match2 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = match2.data,test = FALSE) > print(table_match2, smd = TRUE) Stratified by Ad_Campaign_Response 0 1 SMD n 404 404 Age (mean (sd)) 45.63 (10.26) 45.75 (10.33) 0.012 Income (mean (sd)) 79.27 (16.04) 79.28 (15.55) <0.001
The first thing we see above is that we matched 404 controls to 404 treated subjects and also we can see we get very small numbers for SMD in the covariates, hence we can conclude that we did a pretty good job of matching with nearest neighbour. We have a great balance in the dataset now to proceed with our further analysis.
Outcome Analysis (testing our hypothesis)...
We will now do a t test to test our hypothesis that there is a higher chance of purchase of the product when people respond to the ad campaign.
Before performing a paired t-test we will have to create two subsets from the matched data, one for treatment and other for control groups.
> y_trt <- match2.data$Bought[match2.data$Ad_Campaign_Response == 1] > y_con <- match2.data$Bought[match2.data$Ad_Campaign_Response == 0]
Next we will calculate a pairwise difference of the two subsets
> difference <- y_trt - y_con
Then we will perform a paired t-test on the difference, a paired t-test is just a regular t test on the difference in the outcome of the matched pairs
> t.test(difference) One Sample t-test data: difference t = 30.786, df = 403, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.6835709 0.7768252 sample estimates: mean of x 0.730198
From the results above we see a very very small p-value that makes the model highly significant.
The point estimate (mean) is 0.73, that means the difference in probability of Buying the product when everyone responds to the ad campaign versus when no one responds is 0.73 ( in other words there are 73% higher chances of buying when a person responds to the ad campaign).
- Research paper on “Propensity Score Matching in Observational Studies”.
- Inferring Causal effects from Observational data – Jason Roy
This article was contributed by Perceptive Analytics. Wyendrila Roy, Chaitanya Sagar, Prudhvi Potuganti and Saneesh Veetil contributed to this article.
Perceptive Analytics provides data analytics, business intelligence and reporting services to e-commerce, retail and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.
- A Primer on Web Scraping in R
- 10 Tools to Help You Learn R
- Learn Generalized Linear Models (GLM) using R