Topics: Coronavirus | AI | Data Science | Deep Learning | Machine Learning | Python | R | Statistics

KDnuggets Home » News » 2018 » Jan » Tutorials, Overviews » Propensity Score Matching in R ( 18:n04 )

# Propensity Score Matching in R

Tags: , ,

Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible. comments

The Matching Algorithms...

We can now proceed with the matching algorithms with our ‘pscores.model’ and the estimated propensity scores. The matching algorithms create sets of participants for treatment and control groups. A matched set will consist of at least one person from the treatment group (i.e., people who responded to the ad campaign) and one from the control group (i.e., people who did not respond to the ad campaign) with similar propensity scores. The basic goal is to approximate a random experiment, eliminating many of the problems that come with observational data analysis.

There are various matching algorithms in R, namely, exact matching, nearest neighbor, optimal matching, full matching and caliper matching. Let’s try a couple of them:

1. Exact matching is a technique used to match individuals based on the exact values of covariates.

```> library(MatchIt)
> match1 <- matchit(pscores.model, method="exact",data=Data)
> summary(match1, covariates = T)

Call:
matchit(formula = pscores.model, data = Data, method = "exact")

Sample sizes:
Control  Treated
All                596      404
Matched        99      104

Matched sample sizes by subclass:
Treated   Control  Total  Age  Income
1        1       1      2       47     95
2        2       1      3       41     89
3        1       1      2       35     68
4        1       1      2       38     78
5        1       1      2       59     69

> match1.data <- match.data(match1)
> View(match1.data)``` Here we can see that atleast one person in the treatment group (Ad_Campaign_Response = 1) has been matched with one person in the control group (Ad_Campaign_Response = 0).

Create the tableone for exact matching...

```> table_match1 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = match1.data,test = FALSE)
> print(table_match1, smd = TRUE)
0                         1              SMD
n                               99                  104
Age (mean (sd))       46.04 (7.57)   45.51 (7.33)     0.071
Income (mean (sd))  79.45 (10.43)  79.82 (10.23)   0.035```

We can see in the above results that there are 99 control subjects matched with 104 treatment subjects. As our sample dataset is fairly balanced we don’t see much difference in doing exact matching, instead the SMD numbers are slightly higher than the pre-matching numbers. We can try the same with nearest matching and see the effect of matching.

2. Nearest Neighbour Matching is an algorithm that matches individuals with controls (it could be two or more controls per treated unit) based on a distance.

```> match2 <- matchit(pscores.model, method="nearest", radio=1,data=Data)
> plot(match2, type="jitter")``` `> plot(match2, type="hist")` Create the tableone for nearest matching...

```> table_match2 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = match2.data,test = FALSE)
> print(table_match2, smd = TRUE)
0              1                        SMD
n                        404            404
Age (mean (sd))        45.63 (10.26)  45.75 (10.33)      0.012
Income (mean (sd))  79.27 (16.04)  79.28 (15.55)     <0.001```

The first thing we see above is that we matched 404 controls to 404 treated subjects and also we can see we get very small numbers for SMD in the covariates, hence we can conclude that we did a pretty good job of matching with nearest neighbour. We have a great balance in the dataset now to proceed with our further analysis.

Outcome Analysis (testing our hypothesis)...

We will now do a t test to test our hypothesis that there is a higher chance of purchase of the product when people respond to the ad campaign.

Before performing a paired t-test we will have to create two subsets from the matched data, one for treatment and other for control groups.

```> y_trt <- match2.data\$Bought[match2.data\$Ad_Campaign_Response == 1]
> y_con <- match2.data\$Bought[match2.data\$Ad_Campaign_Response == 0]
```

Next we will calculate a pairwise difference of the two subsets

`> difference <- y_trt - y_con`

Then we will perform a paired t-test on the difference, a paired t-test is just a regular t test on the difference in the outcome of the matched pairs

```> t.test(difference)

One Sample t-test

data:  difference
t = 30.786, df = 403, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.6835709 0.7768252
sample estimates:
mean of x
0.730198 ```

From the results above we see a very very small p-value that makes the model highly significant.
The point estimate (mean) is 0.73, that means the difference in probability of Buying the product when everyone responds to the ad campaign versus when no one responds is 0.73 ( in other words there are 73% higher chances of buying when a person responds to the ad campaign).

References: Get KDnuggets, a leading newsletter on AI, Data Science, and Machine Learning