Entry X-1: Covariate Imbalance (Propensity Score Matching)

Introduction (PDF & R-Code)

Conducting a randomized controlled trial is the foremost strategy for generating causal inferences in the social sciences. The required random assignment of the treatment condition, however, limits our ability to use experimental designs when conducting research with ethical concerns. Specifically, we can not randomly assign participants to be born into an impoverished or middle-class family (even though psychologists tried this; see Neubauer and Neubauer, 1996), due to the potential social, physical, and psychological ramifications of the random assignment. When we can not randomly assign participants to be exposed to the treatment condition, we rely on quasi-experimental methods to emulate experimental research. The most common quasi-experimental technique is matching.

Matching is the process of taking one individual who was exposed to the treatment and adding an individual that resembles the treatment case, but not exposed to the treatment, to the control group. This matching process can be conducted as designed – when the experiment is designed with exact matching – or statistically. Given that this series is focused on statistical assumptions, the next five entries will focus on the most used statistical process of conducting a quasi-experiment: propensity score matching.

Propensity Score Matching

Before propensity score matching was developed, exact matching was the most common process of conducting a quasi-experiment. For example, if we had a respondent in the treatment group that was a 20-year-old male we would find a 20-year-old male to place into the control group. Although simple when only matching on two confounders (age and biological sex), exact matching becomes exponentially more difficult as the number of confounders increase. For instance, find me a control case for an individual that is 20-years old, with a 3.8 gpa, with high-school educated parents, with a family income of $80,000? It would be quite difficult. What if we add a continuous covariate like an individual’s score on an intelligence test?  These difficulties associated with exact matching encouraged Rosenbaum and Rubin (1983) to develop propensity score matching.

Propensity score matching is – purely – a data reduction technique designed to permit the matching of treatment and control cases on numerous confounders without reducing the ability to match participants. This is achieved by combining the variation in the treatment predicted by potential confounders into a single score for each individual – a propensity score. The propensity scores can then be used to match treatment cases to similar control cases and conduct post-matching comparisons. Post-matching comparisons employ the matched subsample (the treatment and control cases with similar propensity scores) to examine the effects of a treatment on an outcome of interest.

Propensity Scores

A propensity score is a predicted probability. This is to say that a propensity score is the probability a case was exposed to the treatment conditional upon the independent variables used to predict the treatment. Propensity scores range from zero (i.e., the case had no chance of being exposed to the treatment) to one (i.e., the case was definitely going to be exposed to the treatment) and capture the variation in the treatment predicted by independent variables.

Let’s work through an example. Ethically, we can not randomly assign individuals to be involved in a car accident. We, however, can sample individuals that have been in a car accident (i.e., the treatment) and individuals that have not been in car accidents (i.e., the counterfactual). From this sample, we can use key covariates to predict if individuals in our sample were in a car accident or not. For instance, average driving speed, history of accidents, and risky driving behaviors can be used to predict if an individual experienced a car accident. After obtaining the effects of these key covariates, we can use the unstandardized slope coefficients to calculate the probability of an individual being in a car accident. Individuals with a higher predicted probability likely have characteristics that increased the risk of being in a car accident, while individuals with a lower predicted probability likely have characteristics that decreased the risk of being in a car accident.[i]

The Value Of Propensity Score Matching (Covariate Balance)[ii]

The logic of matching participants using a propensity score is derived from the assumption of covariate balance. Within the quasi-experimental framework, causal inferences can not be made without first satisfying the assumption of covariate balance. Explicitly, the process of emulating a true experiment requires us to establish an appropriate counterfactual condition and balance confounders across the treatment and control cases. That is why propensity score matching is inherently designed to increase balance on observed confounders. Actually, if you do not increase balance on observed confounders, your matching procedure failed, and you should probably go back to the drawing board.

The perceived value of any matching procedure does not come from achieving covariate balance on observed confounders but rather the assumption that matching will potentially balance some unmeasured confounders. For example, if we are interested in the effects of G.R.E.A.T on future gang involvement, matching treatment and control cases on biological sex could balance an unmeasured confounder like involvement in delinquency (biological males tend to be more involved in delinquent acts than biological females).[iii] Consistent with the example, balance on unknown confounders could be achieved when the independent variables used to match participants are correlated with unmeasured constructs that confound the association between the treatment and the outcome of interest. The postulation that matching increases balance on unobserved confounders, as well as the ability to include numerous observed constructs in the matching procedure, is the reason propensity score matching remains a useful technique in certain circumstances. Nevertheless, when the assumption of covariate balance is violated the estimates derived from post-matching analyses can become biased, substantially limiting the value of propensity score matching as a quasi-experimental technique. Let’s explore the bias created by violating the assumption of covariate balance in post-propensity score matching analyses.

Balancing Observed Confounders

The data simulation for a non-randomized controlled trial with a dichotomous treatment variable is specified in a logical order where we create the independent variables (confounders) first, then create the treatment variable, and finally create the outcome of interest. Six independent variables are simulated in the code below: DV1 & DV2 are specified to emulate normally distributed categorical variables, SCV1 & SCV2 are specified to emulate normally distributed semi-continuous constructs, and CV1 & CV2 are specified to emulate normally distributed continuous constructs. All of the seeds are set to different values using a random number generator, to make sure that the independent variables are uncorrelated. After creating all of the constructs, we aggregate them into a data frame using the data.frame command.

To simulate a dichotomous treatment variable influenced by continuous independent variables, we must first simulate a continuous treatment variable (TCon in the R-Code). To simulate the continuous treatment variable, I prefer to start by normalizing the independent variables. Normalizing the independent variables makes the distribution on the six constructs range between zero and one. The function in the code below normalizes every measure in the dataframe using the formula (x – min(x)) / (max(x) – min(x)).

The normalization of the independent variables makes it easy to create a continuous treatment variable that is bounded between 0 and 1. Specifically, by having the modifiers (i.e., the constants in the R-code multiplied by the variables) add up to 1, we can simulate scores that will emulate the respondents’ probability of being exposed to the treatment. In this case, the modifier for each of the independent variables was set to .10 (six variables = .60 total) and the modifier for the random term was specified to be .40. The specification of a random term ensures that exposure to the treatment condition isn’t perfectly influenced by the respondents’ values on the independent variables. The modifiers can also be considered the percentage of the variation in the treatment variable contributed by the independent variables and the random term.

> # Simulating Data ####
> N<-500
> set.seed(23733)
> DV1<-trunc(rnorm(N,2.5,.75)) # Categorical Variable
> set.seed(88988)
> SCV1<-trunc(rnorm(N,35,5)) # Semi-continuous Variable
> set.seed(29204)
> CV1<-trunc(rnorm(N,35000,5000)) # Continuous Variable
> set.seed(81852)
> DV2<-trunc(rt(N,20,9)) # Categorical Variable
> set.seed(7802)
> SCV2<-rnorm(N,100,15) # Semi-continuous Variable
> set.seed(80528)
> CV2<-trunc(rnorm(N,2000,500)) # Continuous Variable
> Data<-data.frame(DV1,SCV1,CV1,DV2,SCV2,CV2)

> # Normalizing Confounding Variables ####
> 
> min_max_norm <- function(x) {
+   (x - min(x)) / (max(x) - min(x))
+ }
> 
> NData <- as.data.frame(lapply(Data, min_max_norm))
> 
> # Creating Continuous Treatment Variable ####
> 
> set.seed(1992)
> NData$TCon<-.10*NData$DV1+.10*NData$SCV1+.10*NData$CV1+
+             .10*NData$DV2+.10*NData$SCV2+.10*NData$CV2+.40*runif(N,0,1)

As it can be observed in the figure below, the scores on the continuous treatment variable (TCon) range from approximately .15 to .80, with the mass of the distribution centering around the mean of .48.

The continuous treatment variable (Tcon) is then dichotomized (TDI), where cases that scored above .50 received the treatment (i.e., had a “1” on TDI) and cases below .50 did not receive the treatment (i.e., had a “0” on TDI). The value of .50 was selected because it is a logical break that creates a dichotomous variable with more control cases than treatment cases.[iv] The resulting dichotomy (TDI) had 223 cases exposed to the treatment and 277 cases not exposed to the treatment.

> # Creating Dichotomous Treatment Variable ####
> 
> table(NData$TCon>.50)

FALSE  TRUE 
  277   223 
> NData$TDI[NData$TCon>.50]<-1
> NData$TDI[NData$TCon<=.50]<-0
> table(NData$TDI)

  0   1 
277 223 
>

After specifying the dichotomous treatment variable (TDI), we have to simulate the dependent variable (Y). In this example, the relationship between the treatment and the dependent variable was set to .50. That is a 1 point increase in the treatment (TDI) – the transition from 0 to 1 – resulted in a .50 increase in the dependent variable (Y) or, stated differently, the mean difference between the treatment and control cases on the outcome of interest was .50. The independent variables – confounders – influenced the dependent variable (Y) equally, where a 1 point increase in each independent variable corresponded with a 1 point increase in the dependent variable. To make sure everything worked perfectly, we can estimate two models: (1) the bivariate association between Y and TDI and (2) the association between Y and TDI adjusted for the confounding independent variables. Consistent with the data specification, the bivariate association between Y and TDI is upwardly biased (b = .718), while the model adjusting for the confounding independent variables estimates the relationship perfectly (b = .50).

> # Creating Dependent Variable ####
> 
> 
> NData$Y<-.50*NData$TDI+1.00*NData$DV1+1.00*NData$SCV1+1.00*NData$CV1+
+           1.00*NData$DV2+1.00*NData$SCV2+1.00*NData$CV2
> 
> # Checking Simulations ####
> 
> lm(Y~TDI, data = NData)

Call:
lm(formula = Y ~ TDI, data = NData)

Coefficients:
(Intercept)          TDI  
      2.782        0.718  

> lm(Y~TDI+DV1+SCV1+CV1+DV2+SCV2+CV2, data = NData)

Call:
lm(formula = Y ~ TDI + DV1 + SCV1 + CV1 + DV2 + SCV2 + CV2, data = NData)

Coefficients:
       (Intercept)                 TDI                 DV1                SCV1  
-0.000000000000005   0.500000000000000   1.000000000000003   1.000000000000000  
               CV1                 DV2                SCV2                 CV2  
 0.999999999999999   0.999999999999999   1.000000000000002   1.000000000000000  

>

As observed below, the scores on the independent variables are not balanced across the treatment and control cases.

> # Balance in Population ####
> 
> TMC<-NData[which(NData$TDI == 1),]
> CMC<-NData[which(NData$TDI == 0),]
> 
> ## Categorical Variable 1
> mean(TMC$DV1)-mean(CMC$DV1)
[1] 0.01168
> sd(TMC$DV1)-sd(CMC$DV1)
[1] -0.01794
> skew(TMC$DV1)-skew(CMC$DV1)
[1] -0.2145
> kurtosi(TMC$DV1)-kurtosi(CMC$DV1)
[1] 0.4656
> 
> ## Semi-continuous Variable 1
> mean(TMC$SCV1)-mean(CMC$SCV1)
[1] 0.05725
> sd(TMC$SCV1)-sd(CMC$SCV1)
[1] 0.02151
> skew(TMC$SCV1)-skew(CMC$SCV1)
[1] -0.08513
> kurtosi(TMC$SCV1)-kurtosi(CMC$SCV1)
[1] -0.3513
> 
> ## Continuous Variable 1
> mean(TMC$CV1)-mean(CMC$CV1)
[1] 0.0396
> sd(TMC$CV1)-sd(CMC$CV1)
[1] -0.004282
> skew(TMC$CV1)-skew(CMC$CV1)
[1] 0.1054
> kurtosi(TMC$CV1)-kurtosi(CMC$CV1)
[1] -0.7627
> 
> ## Categorical Variable 2
> mean(TMC$DV2)-mean(CMC$DV2)
[1] 0.04431
> sd(TMC$DV2)-sd(CMC$DV2)
[1] 0.001218
> skew(TMC$DV2)-skew(CMC$DV2)
[1] 0.000575
> kurtosi(TMC$DV2)-kurtosi(CMC$DV2)
[1] -0.1748
> 
> ## Semi-continuous Variable 2
> mean(TMC$SCV2)-mean(CMC$SCV2)
[1] 0.04289
> sd(TMC$SCV2)-sd(CMC$SCV2)
[1] 0.01045
> skew(TMC$SCV2)-skew(CMC$SCV2)
[1] -0.2325
> kurtosi(TMC$SCV2)-kurtosi(CMC$SCV2)
[1] 0.7448
> 
> ## Continuous Variable 2
> mean(TMC$CV2)-mean(CMC$CV2)
[1] 0.02181
> sd(TMC$CV2)-sd(CMC$CV2)
[1] 0.008385
> skew(TMC$CV2)-skew(CMC$CV2)
[1] 0.1802
> kurtosi(TMC$CV2)-kurtosi(CMC$CV2)
[1] -0.2985
>

Balanced Observed Confounders

Now that we have our simulated data, we can run our analyses. First, we will estimate a single matching procedure using all of the independent variables to match cases across the treatment and control groups. The propensity scores were estimated using a binary logistic regression model where TDI was regressed on DV1, DV2, SCV1, SCV2, CV1, and CV2. The matching procedure was specified as nearest neighbor one to one matching without replacement in a random order using a caliper of .05. Although inefficient (regarding both time and establishing matches), nearest neighbor matching with a caliper is the easiest propensity score matching procedure to understand: the computer randomly selects a treatment case then using a random order finds a control case with a propensity score within .05 of the selected treatment case. The treatment and selected control case are then matched and removed from the list. When conducting propensity score matching, a seed is required to replicate the matching procedure.

To briefly review the summary of the results, section titled summary of balance for all data provides the balance in the data before matching participants, while the section titled summary of balance for matched data provides the balance in the data after matching participants. The section titled percent balanced improved provides an indication of how well the matching procedure did at improving the balance across the specified predictors. The section titled sample sizes informs us of how many treatment cases were successfully matched and how many treatment cases were not successfully matched. The means treated column provides the mean of the treatment group for the propensity score (labeled distance) and the specified predictors. The means control column is the sample mean for the control group, the std.mean.diff is the standardized differences (Cohen’s D) between the treatment and control group, the var ratio is the ratio of the treatment to control groups variance, the eCDF mean and eCDF max represent the mean and the maximum value for the difference between the weighted empirical cumulative distribution function for the treatment and control group, while the std.pair.dist is the average of the absolute differences of a variable between pairs.


Important Note: Although you have to conduct t-tests to publish any propensity score matching paper, t-values and p-values DO NOT provide you with any indication of how well your matching procedure worked. Specifically, t-values and p-values do not illustrate how well the matching procedure improved the balance between the treatment and control cases on the key confounders. Cohen’s D and the mean and the maximum value for the difference between the weighted empirical cumulative distribution function, as well as SD difference, skew difference, and kurtosis difference, will provide the best indication. With this said, I will not be producing any t-values and p-values for any of the PSM simulations.


As demonstrated by the results below, 175 treatment cases were matched to 175 control cases while 150 cases (48 treatment; 102 control) were not successfully matched. The matching procedure appeared to drastically improve balance – evaluated using all of the key values described above – between the treatment and control cases on the confounding variables. After the matching procedure is completed, the data pertaining to the matched subsample is extracted using the match.data command.

> # Matching (Nearest Neighbor, Random, 1 to 1) one time all variables ####
> 
> set.seed(1992)
> M1<-matchit(TDI~DV1+SCV1+CV1+DV2+SCV2+CV2, data = NData,
+         method = "nearest", m.order = "random", ratio = 1, replace = FALSE, caliper = .05)
> 
> summary(M1)

Call:
matchit(formula = TDI ~ DV1 + SCV1 + CV1 + DV2 + SCV2 + CV2, 
    data = NData, method = "nearest", replace = FALSE, m.order = "random", 
    caliper = 0.05, ratio = 1)

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance         0.486         0.413           0.537      1.198     0.150    0.235
DV1              0.509         0.497           0.061      0.836     0.019    0.071
SCV1             0.499         0.442           0.318      1.289     0.057    0.168
CV1              0.520         0.480           0.270      0.944     0.071    0.128
DV2              0.401         0.357           0.232      1.013     0.040    0.101
SCV2             0.563         0.520           0.244      1.130     0.076    0.157
CV2              0.507         0.485           0.115      1.095     0.026    0.066


Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance         0.443         0.443           0.003      1.003     0.004    0.023
DV1              0.500         0.504          -0.022      0.830     0.024    0.051
SCV1             0.466         0.461           0.026      1.127     0.018    0.057
CV1              0.495         0.503          -0.053      0.856     0.024    0.097
DV2              0.375         0.377          -0.012      0.772     0.017    0.029
SCV2             0.545         0.537           0.044      1.322     0.023    0.069
CV2              0.489         0.493          -0.016      0.966     0.029    0.097
         Std. Pair Dist.
distance           0.013
DV1                0.994
SCV1               0.888
CV1                1.051
DV2                0.997
SCV2               0.866
CV2                1.051

Percent Balance Improvement:
         Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance            99.5       98.3      97.3     90.3
DV1                 63.3       -4.1     -26.4     27.4
SCV1                91.7       52.8      69.2     66.0
CV1                 80.3     -169.7      65.7     23.9
DV2                 94.8    -1928.1      58.7     71.6
SCV2                82.0     -128.0      69.0     56.3
CV2                 86.0       61.5     -12.5    -46.2

Sample Sizes:
          Control Treated
All           277     223
Matched       175     175
Unmatched     102      48
Discarded       0       0

>  
> M1DF<-match.data(M1)
>

Using the matched subsample, a post-matching analysis was conducted where the bivariate association between Y and TDI was estimated. Given the drastic improvement of covariate balance in the matched subsample , it should come as no surprise that the estimated association between Y and TDI was very close to the true association (estimated b = .4950; true b = .50).

> ## Post matching Analysis ####
> 
> PMRM<-lm(Y~TDI, data = M1DF)
> summary(PMRM)

Call:
lm(formula = Y ~ TDI, data = M1DF)

Residuals:
   Min     1Q Median     3Q    Max 
-0.975 -0.256  0.015  0.257  1.055 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   2.8744     0.0288   100.0 <0.0000000000000002 ***
TDI           0.4950     0.0407    12.2 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.38 on 348 degrees of freedom
Multiple R-squared:  0.299,	Adjusted R-squared:  0.297 
F-statistic:  148 on 1 and 348 DF,  p-value: <0.0000000000000002

>

Below is an illustration of the distribution of the b estimates– unstandardized slope coefficients –after conducting the matching procedure 1000 times. The mean of the distribution is .4956, while the distribution of b estimates is extremely narrow around the true slope coefficient.

Notes: b” represents the estimated effects of the treatment and dependent variable. The red dashed line represents the mean effect size of the treatment on the dependent variable across the 1000 replications of the propensity score matching procedure (true causal estimate in the population is .50). The blue, yellow, and red areas represent 1, 2, and 3 (respectively) standard deviations from the mean.

Unbalanced Observed Confounders

Considering that there is covariate imbalance in the population (the initial dataset), removing some variables from the matching procedure will effectively violate the assumption of covariate balance. As such, we replicated the matching process using only SCV1 and CV1. As it can be observed in the results below, the balance between the treatment and control groups on SCV1 and CV1 was substantively improved. 

> # Matching (Nearest Neighbor, Random, 1 to 1) one time two variables####
> 
> set.seed(1992)
> M1<-matchit(TDI~SCV1+CV1, data = NData,
+             method = "nearest", m.order = "random", ratio = 1, replace = FALSE, caliper = .05)
> 
> summary(M1)

Call:
matchit(formula = TDI ~ SCV1 + CV1, data = NData, method = "nearest", 
    replace = FALSE, m.order = "random", caliper = 0.05, ratio = 1)

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance         0.469         0.427           0.405      1.115     0.109    0.175
SCV1             0.499         0.442           0.318      1.289     0.057    0.168
CV1              0.520         0.480           0.270      0.944     0.071    0.128


Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance         0.447         0.447           0.003      1.003     0.004    0.022
SCV1             0.470         0.463           0.041      1.434     0.024    0.083
CV1              0.500         0.508          -0.054      0.894     0.029    0.083
         Std. Pair Dist.
distance           0.011
SCV1               0.601
CV1                0.858

Percent Balance Improvement:
         Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance            99.4       97.2      96.8     87.4
SCV1                87.0      -42.0      58.3     50.7
CV1                 80.0      -94.2      58.6     35.1

Sample Sizes:
          Control Treated
All           277     223
Matched       181     181
Unmatched      96      42
Discarded       0       0

> 
> M1DF<-match.data(M1)
>

Nevertheless, balance on the remaining confounders did not improve and violates the assumption of covariate balance Specifically, by not improving balance across the treatment and control groups on key confounders, we have not adjusted the post-matching analysis for the influence of these confounders. The bias introduced by violating the covariate balance assumption is demonstrated in the slope coefficient, which is now slightly further from the true slope coefficient than the previous demonstration (bimbalanced sample =  .484; bbalanced sample = .495; btrue = .50).

> ## Balance in matched sample ####
> 
> TMC<-M1DF[which(M1DF$TDI == 1),]
> CMC<-M1DF[which(M1DF$TDI == 0),]
> 
> ## Categorical Variable 1
> mean(TMC$DV1)-mean(CMC$DV1)
[1] 0.05056
> sd(TMC$DV1)-sd(CMC$DV1)
[1] -0.03642
> skew(TMC$DV1)-skew(CMC$DV1)
[1] -0.201
> kurtosi(TMC$DV1)-kurtosi(CMC$DV1)
[1] 0.6245
> 
> ## Semi-continuous Variable 1
> mean(TMC$SCV1)-mean(CMC$SCV1)
[1] 0.005037
> sd(TMC$SCV1)-sd(CMC$SCV1)
[1] 0.0202
> skew(TMC$SCV1)-skew(CMC$SCV1)
[1] 0.02275
> kurtosi(TMC$SCV1)-kurtosi(CMC$SCV1)
[1] -0.6576
> 
> ## Continuous Variable 1
> mean(TMC$CV1)-mean(CMC$CV1)
[1] -0.005054
> sd(TMC$CV1)-sd(CMC$CV1)
[1] -0.01193
> skew(TMC$CV1)-skew(CMC$CV1)
[1] -0.3436
> kurtosi(TMC$CV1)-kurtosi(CMC$CV1)
[1] -0.03494
> 
> ## Categorical Variable 2
> mean(TMC$DV2)-mean(CMC$DV2)
[1] -0.02978
> sd(TMC$DV2)-sd(CMC$DV2)
[1] -0.01268
> skew(TMC$DV2)-skew(CMC$DV2)
[1] -0.1079
> kurtosi(TMC$DV2)-kurtosi(CMC$DV2)
[1] 0.1544
> 
> ## Semi-continuous Variable 2
> mean(TMC$SCV2)-mean(CMC$SCV2)
[1] 0.005853
> sd(TMC$SCV2)-sd(CMC$SCV2)
[1] 0.0135
> skew(TMC$SCV2)-skew(CMC$SCV2)
[1] -0.1512
> kurtosi(TMC$SCV2)-kurtosi(CMC$SCV2)
[1] 0.8488
> 
> ## Continuous Variable 2
> mean(TMC$CV2)-mean(CMC$CV2)
[1] -0.0417
> sd(TMC$CV2)-sd(CMC$CV2)
[1] -0.002408
> skew(TMC$CV2)-skew(CMC$CV2)
[1] 0.2165
> kurtosi(TMC$CV2)-kurtosi(CMC$CV2)
[1] 0.2266

> ## Post matching Analysis ####
> 
> PMRM<-lm(Y~TDI, data = M1DF)
> summary(PMRM)

Call:
lm(formula = Y ~ TDI, data = M1DF)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.3068 -0.2533  0.0008  0.2781  1.3506 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   2.8917     0.0304    95.1 <0.0000000000000002 ***
TDI           0.4849     0.0430    11.3 <0.0000000000000002 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.405 on 354 degrees of freedom
Multiple R-squared:  0.264,	Adjusted R-squared:  0.262 
F-statistic:  127 on 1 and 354 DF,  p-value: <0.0000000000000002

>

The matching procedure described above was replicated 1000 times and the results are presented in the figure below. Distinct from the previous figure, the distribution widened slightly and the mean of the distribution is .501. As the results suggest, conducting numerous matching procedures – with different seeds – could potentially reduce the bias related to violations of the covariate balance assumption under certain conditions. This finding illustrates the importance of conducting replications when propensity score matching.

Notes: b” represents the estimated effects of the treatment and dependent variable. The red dashed line represents the mean effect size of the treatment on the dependent variable across the 1000 replications of the propensity score matching procedure (true causal estimate in the population is .50). The blue, yellow, and red areas represent 1, 2, and 3 (respectively) standard deviations from the mean.

Balancing Unobserved Confounders

While we are here, let’s spend a brief moment talking about the perceived value of propensity score matching. The matching procedures estimated above – contrary to popular belief – will not produce slope coefficients closer to the true estimate than a statistical control model (y regressed on the treatment plus the independent variables). This is because the independent variables in the previous illustrations were treated as observed confounders. Nevertheless, balancing the treatment and control groups on unobserved confounders is a fundamental component of the covariate balance assumption.

To explore how propensity score matching can balance unobserved confounders, we will replicate our previous data simulation with three key differences: (1) DV1 will be correlated with DV2, (2) SCV1 will be correlated with SCV2, and (3) CV1 will be correlated with CV2. Although the matrix simulation approach could have been implemented to specify these correlations, we want to fully replicate the previous procedure. As such, the correlations were specified by creating three normally distributed random variables – Cor1, Cor2, & Cor 3 – and specifying a modifier consistent with the scales of the correlated measures. At the end of the syntax, it can be observed that the correlation between DV1 and DV2 is .349, SCV1 and SCV2 is .615, and CV1 and CV2 is .500. The remaining syntax for simulating this dataset is identical to the syntax reviewed above.

> # Simulating Data ####
> N<-500
> set.seed(23733)
> 
> set.seed(190329)
> Cor1<-rnorm(N,10,5)
> set.seed(475030)
> Cor2<-rnorm(N,10,5)
> set.seed(2839)
> Cor3<-rnorm(N,10,5)
> 
> set.seed(12)
> DV1<-trunc(rnorm(N,2.5,.75))+.20*Cor1 # Categorical Variable
> set.seed(88988)
> SCV1<-trunc(rnorm(N,35,5))+3.00*Cor2 # Semi-continuous Variable
> set.seed(29204)
> CV1<-trunc(rnorm(N,35000,5000))+600.00*Cor3 # Continuous Variable
> set.seed(81852)
> DV2<-trunc(rt(N,20,9))+.20*Cor1 # Categorical Variable
> set.seed(7802)
> SCV2<-rnorm(N,100,15)+3.00*Cor2 # Semi-continuous Variable
> set.seed(80528)
> CV2<-trunc(rnorm(N,2000,500))+600.00*Cor3 # Continuous Variable
> 
> Data<-data.frame(DV1,SCV1,CV1,DV2,SCV2,CV2)
> 
> cor(Data)
            DV1      SCV1        CV1      DV2     SCV2      CV2
DV1   1.0000000 -0.002795  0.0001513  0.34948  0.05366 0.003465
SCV1 -0.0027953  1.000000  0.0769116 -0.01702  0.61509 0.102824
CV1   0.0001513  0.076912  1.0000000 -0.00435  0.08313 0.500791
DV2   0.3494757 -0.017017 -0.0043496  1.00000 -0.01592 0.013729
SCV2  0.0536636  0.615095  0.0831273 -0.01592  1.00000 0.094371
CV2   0.0034648  0.102824  0.5007906  0.01373  0.09437 1.000000
>

Let’s briefly look at the covariate balance between the treatment and control group in the population.

> # Balance in Population ####
> 
> TMC<-NData[which(NData$TDI == 1),]
> CMC<-NData[which(NData$TDI == 0),]
> 
> ## Categorical Variable 1
> mean(TMC$DV1)-mean(CMC$DV1)
[1] 0.0337416438866
> sd(TMC$DV1)-sd(CMC$DV1)
[1] 0.00853147356446
> skew(TMC$DV1)-skew(CMC$DV1)
[1] 0.226833498225
> kurtosi(TMC$DV1)-kurtosi(CMC$DV1)
[1] 0.204367895678
> 
> ## Semi-continuous Variable 1
> mean(TMC$SCV1)-mean(CMC$SCV1)
[1] 0.0388165397047
> sd(TMC$SCV1)-sd(CMC$SCV1)
[1] -0.0131419623772
> skew(TMC$SCV1)-skew(CMC$SCV1)
[1] -0.0680779620419
> kurtosi(TMC$SCV1)-kurtosi(CMC$SCV1)
[1] -0.139721297574
> 
> ## Continuous Variable 1
> mean(TMC$CV1)-mean(CMC$CV1)
[1] 0.0456296040414
> sd(TMC$CV1)-sd(CMC$CV1)
[1] -0.00314906781317
> skew(TMC$CV1)-skew(CMC$CV1)
[1] 0.109999196363
> kurtosi(TMC$CV1)-kurtosi(CMC$CV1)
[1] -0.655826749024
> 
> ## Categorical Variable 2
> mean(TMC$DV2)-mean(CMC$DV2)
[1] 0.0212280287683
> sd(TMC$DV2)-sd(CMC$DV2)
[1] -0.00177946428811
> skew(TMC$DV2)-skew(CMC$DV2)
[1] 0.0767152882382
> kurtosi(TMC$DV2)-kurtosi(CMC$DV2)
[1] -0.337059434997
> 
> ## Semi-continuous Variable 2
> mean(TMC$SCV2)-mean(CMC$SCV2)
[1] 0.0406959963591
> sd(TMC$SCV2)-sd(CMC$SCV2)
[1] -0.00850165461959
> skew(TMC$SCV2)-skew(CMC$SCV2)
[1] -0.165084736323
> kurtosi(TMC$SCV2)-kurtosi(CMC$SCV2)
[1] 1.32308210618
> 
> ## Continuous Variable 2
> mean(TMC$CV2)-mean(CMC$CV2)
[1] 0.0420568552517
> sd(TMC$CV2)-sd(CMC$CV2)
[1] -0.00995314397575
> skew(TMC$CV2)-skew(CMC$CV2)
[1] -0.205429775993
> kurtosi(TMC$CV2)-kurtosi(CMC$CV2)
[1] -0.0619160541187
>

After simulating the data, we estimated the propensity scores using a binary logistic regression model where TDI was regressed on DV1, SCV1, and CV1 and matched treatment to control cases using nearest neighbor one to one matching without replacement in a random order using a caliper of .05. For this example, we are imagining that DV2, SCV2, and CV2 are unobserved confounding variables. Similar to the previous matching algorithms, the balancing statistics suggest that covariate balance between the treatment and control cases was improved through the matching procedure. In total 176 treatment cases were matched to 176 control cases (53 treatment and 95 control cases were removed).

> # Matching (Nearest Neighbor, Random, 1 to 1) one time all variables ####
> 
> set.seed(1992)
> M1<-matchit(TDI~DV1+SCV1+CV1, data = NData,
+             method = "nearest", m.order = "random", ratio = 1, replace = FALSE, caliper = .05)
> 
> summary(M1)

Call:
matchit(formula = TDI ~ DV1 + SCV1 + CV1, data = NData, method = "nearest", 
    replace = FALSE, m.order = "random", caliper = 0.05, ratio = 1)

Summary of Balance for All Data:
         Means Treated Means Control Std. Mean Diff.  Var. Ratio   eCDF Mean    eCDF Max
distance   0.484152161   0.435900941     0.452302809 0.966748123 0.116784028 0.198069579
DV1        0.475533517   0.441791874     0.226537436 1.125210884 0.062082212 0.126283053
SCV1       0.555831785   0.517015245     0.259989751 0.844740443 0.069482202 0.129554134
CV1        0.520322834   0.474693230     0.309115298 0.958661108 0.085498703 0.154949322


Summary of Balance for Matched Data:
         Means Treated Means Control Std. Mean Diff.  Var. Ratio   eCDF Mean    eCDF Max
distance   0.461535412   0.461486889     0.000454846 0.998662028 0.003488636 0.022727273
DV1        0.462195936   0.453935359     0.055460541 1.295554532 0.025215909 0.085227273
SCV1       0.538088081   0.543085688    -0.033473528 1.027932685 0.012920455 0.051136364
CV1        0.498229792   0.501335847    -0.021041807 0.969276671 0.019238636 0.051136364
         Std. Pair Dist.
distance     0.011314482
DV1          0.959773730
SCV1         0.934464144
CV1          0.801981589

Percent Balance Improvement:
         Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
distance            99.9       96.0      97.0     88.5
DV1                 75.5     -119.5      59.4     32.5
SCV1                87.1       83.7      81.4     60.5
CV1                 93.2       26.1      77.5     67.0

Sample Sizes:
          Control Treated
All           271     229
Matched       176     176
Unmatched      95      53
Discarded       0       0

> 
> M1DF<-match.data(M1)

Now let’s compare the balance of DV2, SCV2, and CV2 across the treatment and control cases in the matched sample to the balance of DV2, SCV2, and CV2 in the population (presented above). As it can be observed, the balance in DV2, SCV2, and CV2 was slightly – in the most lenient sense – improved even though these variables were not included in the matching algorithm. Specifically, by increasing covariate balance on DV1, SCV1, and CV1 – through matching – we appeared to slightly increase the balance on unobserved confounders correlated with DV1, SCV1, and CV1.

> ## Categorical Variable 2
> mean(TMC$DV2)-mean(CMC$DV2)
[1] 0.00977063687439
> sd(TMC$DV2)-sd(CMC$DV2)
[1] -0.000645176717544
> skew(TMC$DV2)-skew(CMC$DV2)
[1] 0.321357846055
> kurtosi(TMC$DV2)-kurtosi(CMC$DV2)
[1] 0.0505753524938
> 
> ## Semi-continuous Variable 2
> mean(TMC$SCV2)-mean(CMC$SCV2)
[1] 0.0166459364913
> sd(TMC$SCV2)-sd(CMC$SCV2)
[1] -0.00293303275852
> skew(TMC$SCV2)-skew(CMC$SCV2)
[1] -0.208054692417
> kurtosi(TMC$SCV2)-kurtosi(CMC$SCV2)
[1] 1.37491680911
> 
> ## Continuous Variable 2
> mean(TMC$CV2)-mean(CMC$CV2)
[1] 0.0166663112137
> sd(TMC$CV2)-sd(CMC$CV2)
[1] -0.00614379797724
> skew(TMC$CV2)-skew(CMC$CV2)
[1] -0.20177342645
> kurtosi(TMC$CV2)-kurtosi(CMC$CV2)
[1] -0.29430151485
>

The slope coefficient difference between the regression model of Y on TDI while controlling for DV1, SCV1, and CV1 in the population is only .0001 further from zero than the regression model of Y on TDI in the post-matching subsample. Moreover, the SE is larger in the regression model estimated with the post-matching subsample due to the decreased sample size. Considering these findings, further exploration is needed to evaluate when propensity score matching increases balance on unobserved confounding variables correlated with observed confounding variables.

> ## Model In population
> summary(lm(Y~TDI+DV1+SCV1+CV1, data = NData))

Call:
lm(formula = Y ~ TDI + DV1 + SCV1 + CV1, data = NData)

Residuals:
          Min            1Q        Median            3Q           Max 
-0.5747297753 -0.1548270468 -0.0085725037  0.1500198213  0.7137669521 

Coefficients:
                Estimate   Std. Error  t value               Pr(>|t|)    
(Intercept) 0.6356451823 0.0569089938 11.16950 < 0.000000000000000222 ***
TDI         0.5433987554 0.0207573756 26.17859 < 0.000000000000000222 ***
DV1         1.3918260135 0.0700242584 19.87634 < 0.000000000000000222 ***
SCV1        1.5794027031 0.0646811223 24.41829 < 0.000000000000000222 ***
CV1         1.5450598064 0.0678146192 22.78358 < 0.000000000000000222 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.225554668 on 495 degrees of freedom
Multiple R-squared:  0.850719883,	Adjusted R-squared:  0.849513579 
F-statistic: 705.228452 on 4 and 495 DF,  p-value: < 0.0000000000000002220446

>

> ## Post matching Analysis ####
> 
> PMRM<-lm(Y~TDI, data = M1DF)
> summary(PMRM)

Call:
lm(formula = Y ~ TDI, data = M1DF)

Residuals:
          Min            1Q        Median            3Q           Max 
-1.0052264904 -0.2888802211 -0.0054558864  0.2895231483  1.2788161150 

Coefficients:
                Estimate   Std. Error  t value               Pr(>|t|)    
(Intercept) 2.9000252853 0.0314203802 92.29759 < 0.000000000000000222 ***
TDI         0.5432397987 0.0444351278 12.22546 < 0.000000000000000222 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.416838447 on 350 degrees of freedom
Multiple R-squared:  0.299245749,	Adjusted R-squared:  0.297243594 
F-statistic: 149.461829 on 1 and 350 DF,  p-value: < 0.0000000000000002220446

>

Conclusion

While there were a lot of things discussed in this entry, I want to summarize everything using three points: (1) Propensity score matching should always increase adherence to the assumption of covariate balance because that is what the technique is designed to do. If this does not occur, the matching procedure must be adjusted until covariate balance is substantively improved from the sample to the matched subsample. (2) Violations of the assumption of covariate balance in a propensity score matching analysis means that a confounder was left unadjusted for during the matching procedure. (3) Propensity score matching – in theory – is valuable because it has the potential to adjust for unknown confounders by increasing balance across the treatment and control cases. Considering these takeaways, this entry was really intended to begin our introduction into propensity score matching by exploring an assumption violation that should rarely occur. As stated above, if you do not increase balance on observed confounders, your matching procedure failed, and you should probably go back to the drawing board. We will continue our exploration of propensity score matching by talking about the assumption of common support.


[i] This example is consistent with the estimation of a predicted probability in a binary logistic regression model. Additional complexity is added if the treatment includes more than two possible values

[ii] Covariate balance is when the distribution of key constructs is identical across the treatment and control groups.

[iii] Brief note on G.R.E.A.T, it does not reduce involvement in gangs… D.A.R.E doesn’t reduce drug use either (it might increase drug use honestly).

[iv] This is an assumption of propensity score matching that we will address in future entries.

License: Creative Commons Attribution 4.0 International (CC By 4.0)

One thought on “Entry X-1: Covariate Imbalance (Propensity Score Matching)

Leave a comment