Propensity Score Matching in R

Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible.



By Perceptive Analytics

The concept of Propensity score matching (PSM) was first introduced by Rosenbaum and Rubin (1983) in a paper entitled “The Central Role of the Propensity Score in Observational Studies for Casual Effects.”

Statistically it means...

Propensity scores are an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible. PSM refers to the pairing of treatment and control units with similar values on the propensity score; and possibly other covariates (the characteristics of participants); and the discarding of all unmatched units.

What is PSM in simple terms...

PSM is done on observational studies. It is done to remove the selection bias between the treatment and the control groups.

For example say a researcher wants to test the effect of a drug on lab rats. He divides the rats in two groups and tests the effects of the drug in one of the groups, which is the treatment group. The other group is known as the control group.  As it is an experiment everything is controlled by the experimenter, like all these rats are genetically the same and grow up in the same environment, so he knows that any differences between them will be solely due to the drug that he has given them.

However this cannot be done with people because people are different from each other, they come in different shapes and sizes, ages, ethnicities, etc., However, at the same time people aren't so different that we can't find any similarities between them. This is when we can use propensity score matching.

To give an example, if a marketer wants to observe the effect of a marketing campaign on the buyers; to judge if the campaign is the only reason which influenced them to buy; he cannot deduce this as he does not know whether the people who participated in the campaign are equivalent to the people who did not participate in the campaign. There might be other influential factors that led the buyers to buy and it might not be the effect of the campaign at all. In that case he can go for a propensity score matching estimation to observe how much impact the campaign had on the buyers/non-buyers.

PSM using R...

I will now demonstrate a simple program on how to do Propensity Score matching in R, with the use of two packages: Tableone and MatchIt.

The dataset...

#Reading the raw data 
> Data <- read.csv("Campaign_Data.csv", header = TRUE)
> dim(Data)
[1] 1000    4


The dataset contains randomly simulated 1000 records of people with their demographic profiles, age and income, the Ad_Campaign_Response, whether they have responded to the campaign or not , 1 = Responded and 0 = Not Responded, and the Bought column, 1 = Purchased and 0 = Not Purchased.

# To view the first few records in the dataset
> head(Data)
  Age  Income  Ad_Campaign_Response   Bought
1  41    107                    1       1
2  56     75                    0      0
3  50     88                    0       0
4  22     94                    1       0
5  51     74                    1       0
6  45     51                    1       1


The Treatment and Control Population...

# The treatment population
> Treats <- subset(Data,  Ad_Campaign_Response == 1)
> colMeans(Treats)
                 Age               Income  Ad_Campaign_Response        Bought 
          45.7500000           79.2772277            1.0000000             0.8391089

# The control population
> Control <- subset(Data, Ad_Campaign_Response == 0)
> colMeans(Control)
                 Age               Income  Ad_Campaign_Response                  Bought 
          45.0755034           79.4832215            0.0000000             0.1107383

> dim(Treats)
[1] 404   4
> dim(Control)
[1] 596   4

#Here we see we have 404 treated records and 596 control records.


If we want to find out the real impact of the campaign on the purchase

> model_1 <- lm(Bought ~ Ad_Campaign_Response + Age + Income,data=Data)
> model_1

Call:
lm(formula = Bought ~ Ad_Campaign_Response + Age + Income, data = Data)

Coefficients:
         (Intercept)  Ad_Campaign_Response                   Age                Income  
           0.2201474             0.7291988             -0.0014048            -0.0005798  

> Effect <- model_1$coeff[2] 
> Effect
Ad_Campaign_Response 
           0.7291988


The model above shows that the ad campaign had a 72.9% effect on the purchase.

The Propensity Scores Model...

Now let’s prepare a Logistic Regression model to estimate the propensity scores. That is, the probability of responding to the ad campaign.

> pscores.model <- glm(Ad_Campaign_Response ~ Age + Income,family = binomial("logit"),data = Data)
> summary(pscores.model)
Call:
glm(formula = Ad_Campaign_Response ~ Age + Income, family = binomial("logit"), 
    data = Data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1095  -1.0208  -0.9926   1.3373   1.4521  

Coefficients:
              Estimate         Std. Error     z value   Pr(>|z|)
(Intercept) -0.6332998  0.4505631   -1.406    0.160
Age           0.0065209   0.0063705     1.024    0.306
Income     -0.0006507  0.0041133    -0.158    0.874

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1349.2  on 999  degrees of freedom
Residual deviance: 1348.1  on 997  degrees of freedom
AIC: 1354.1

Number of Fisher Scoring iterations: 4

> Propensity_scores <- pscores.model
> Data$PScores <- pscores.model$fitted.values
> hist(Data$PScores[Ad_Campaign_Response==1],main = "PScores of Response = 1")
> hist(Data$PScores[Ad_Campaign_Response==0], ,main = "PScores of Response = 0")


Creating a Tableone pre-matching table...

#covariates we are using
>xvars <- c("Age", "Income")
>library(tableone)
> table1 <- CreateTableOne(vars = xvars,strata = "Ad_Campaign_Response",data = Data, test = FALSE)


We will use the tableone package to summarize the data using the covariates that we stored in xvars. We are going to stratify the response variable, i.e., we will check the balance in the dataset among the people who responded (treatment group) and who did not respond (control group) to the campaign. The ‘test = False’ states that we don’t require a significance test; instead we just want to see the mean and standard deviation of the covariates in the results.

Then we are going to print the statistics and also see the Standardized Mean Differences (SMD) in the variables.

> print(table1, smd = TRUE)
                    Stratified by Ad_Campaign_Response
                             0              1              SMD   
  n                            596            404               
  Age (mean (sd))       45.08 (10.02)  45.75 (10.33)   0.066
  Income (mean (sd))  79.48 (15.80)  79.28 (15.55)   0.013


Here we see the summary statistics of the covariates, we can see that in the control group there are 596 subjects and in the treatment group there are 404 subjects. It also shows the mean and standard deviations of these variables.

In the last column we can see the SMD, here we should be careful about SMDs which are greater than 0.1 because those are the variables which shows imbalance in the dataset and that is where we actually need to do propensity score matching.

In our example here, as it is a simulated dataset of just 1000 records and containing only 2 covariates we don’t see any imbalance but we will still go ahead with the matching and see the difference in the results.