Entry 7: The Exclusion of Confounders (Confounder Bias)

The results of the looped simulations were updated on 04/04/21. An oversight occurred in the initial looped simulation where all of the sample sizes equaled 10,000 cases. The results did not change by permitting the sample size to vary between 100 and 1,000. The interpretations made when the association between X and Y was specified to equal 1.00 were updated on 04/21/2021 to better represent the information in the figures.

Introduction (PDF & R-Code)

Unlike analyses employing experimental data, conducting research with observational data requires us to consider the potential mechanisms confounding the association between two variables of interest. Briefly, a confounder is a variable that causes variation in two or more distinct constructs. For example, height can cause variation in partners’ height, while simultaneously causing variation in the likelihood of playing professional sports. Considering that we can not randomly assign partners’ height (unless you somehow create a love potion), if we were interested in the association between playing professional sports and partners’ height we would have to adjust our model – or condition our analysis – upon the height of the participant. This is because height in the current scenario represents a confounder and can bias the association between partners’ height and playing professional sports.

Numerous observed and unobserved confounders can exist when studying any association with observational data. When these confounders are not adjusted for, they can upwardly bias the slope coefficient – generate a slope coefficient further from zero than reality – and make us believe an association exists when in reality the association is spurious. Spurious in the current context is used to describe a statistical association that only exists due to an unadjusted for confounder. Nevertheless, when an association does exist (e.g., playing professional sports does influence partner’s height), not adjusting for a confounder can upwardly or downwardly (generate a slope coefficient closer to zero than reality) bias the slope coefficient . Importantly, the slope coefficient being upwardly or downwardly biased, as well as the direction – positive or negative – and magnitude of the bias, is conditional upon the true slope coefficient of the association of interest and the direction and magnitude of the effects of the confounder on the constructs being examined.

Visualizing The Directionality of Slope Coefficient Confounder Bias

Although defined above, the best method for identifying and determining the bias generated by an unadjusted for confounder is through visualization, or more formally a Directed Acyclic Graph (DAG). Panel A of Figure 1 provides an illustration of a confounder or C causally influencing (indicated by the solid single headed arrow) variation in X and variation in Y. When the influence of the confounder is left unadjusted for, a covariance – illustrated using the double headed dashed arrow – between X and Y exists, when in reality X and Y are unrelated. Briefly, this covariance exists because X and Y share variation, but the shared variation is only attributable to the common cause (i.e., C) rather than X causing variation in Y or Y causing variation in X. In this scenario, a linear regression model will produce upwardly biased slope coefficients for the association between X and Y when the model does not adjust for the influence of the confounder on Y. That is, the estimates will be further from zero than reality, in turn increasing the likelihood of committing a type 1 error and rejecting the null hypothesis when in reality we should have retained the null hypothesis.

Nevertheless, let’s consider that the slope coefficient of the true association between X and Y is not zero, but rather 1. In this scenario, not adjusting for the effects of a confounder in a statistical model could downwardly or upwardly bias the slope coefficients. Similar to the previous scenario, upward bias is generated by a confounder increasing the covariation between X and Y, where a portion of the covariation is attributable to the causal pathway between X and Y and a portion of the covariation is attributable to the common cause (i.e., C). Focusing on Panel B of Figure 1, an unadjusted for confounder can upwardly bias slope coefficients – in this scenario make the slope coefficient larger than 1 – when increased scores on the confounder cause scores on both X and Y to increase or decrease simultaneously. The type of confounder described above acts as an amplifier, where the confounded association is stronger than the unconfounded association.

Confounders, however, can upwardly or downwardly bias the observed slope coefficients by differentially influencing scores on X and Y. Specifically, when a confounder (C) negatively influences scores on X but positively influences scores on Y (or the alternative), the confounded association will possess a slope coefficient smaller than 1. Under certain conditions, the slope coefficient can be downwardly bias (closer to zero than the absolute value of 1) and range between -.99 and .99, or upwardly biased in the opposite direction (i.e., b > -1.00; further from zero than the absolute value of 1). The downward bias is generated by the confounder reducing the covariation between X and Y, while the upward bias in the opposite direction is generated by the confounder increasing the covariation between X and Y in the opposite direction. The observation of a downwardly or upwardly biased slope coefficient is conditional upon the magnitude of the differential effects the confounder has on X and Y.

This, contrary to popular belief, means that not adjusting for a confounder in a statistical model can increase the likelihood of committing a type 2 error and/or produce an estimate resulting in a distinct interpretation from the true association between X and Y. Considering these effects, it is extremely important to understand how confounders could bias the results of a statistical model. As such, let’s conduct a detailed exploration of the bias generated by not adjusting for confounders in a linear regression model.  

True Association Between X and Y = 0

The examples below are specified in a manner where X has no causal influence on Y. As such, the true slope coefficient between X and Y is equal to zero. Nevertheless, we specified that the association between X and Y will be confounded by C, where the confounder will cause variation in both X and Y. Importantly, the direction of influence (e.g., positive or negative specification) of the confounder on X and/or Y will vary across our simulations. Although not adjusting for the confounder will inevitably generate an upward bias in the slope coefficient (slope coefficients further from zero than reality [the true b = 0]), the direction of the bias will vary depending upon the direction of the effects of the confounder on X and Y.

Confounder (+,+)

Let’s start with the confounder having a positive influence on both X and Y, where increased (or decreased) scores on C cause increased (or decreased) scores on X and Y simultaneously. In this scenario, as specified in the code below, we simulated the confounder or c to be a normally distributed variable with a mean of 0 and a SD of 2.5. Our sample size for this simulation was 500 cases. After simulating the confounder, X and Y were specified in an identical manner (excluding the different set.seed), where a 1 point increase (or decrease) in C corresponded to a 2 point increase (or decrease) in X and a 2 point increase (or decrease) in Y. The confounder was specified to explain ~ 50 percent of the variation in both X and Y. The remaining – or residual – variation was specified to be normally distributed with a mean of 0 and a SD of 2.5. After simulating the three constructs, a dataframe was created and titled Data (creative, I know).

> ## Simulating a Confounder (+,+) ####
> 
> n<-500 # Sample size
> set.seed(1001) # Seed
> c<-rnorm(n,0,2.5)
> set.seed(42125) # Seed
> x<-2*c+2*rnorm(n,0,2.5) # Specification of the independent variable
> set.seed(31) # Seed
> y<-2*c+2*rnorm(n,0,2.5)
> 
> Data<-data.frame(c,x,y)

Using the data, we then estimated the unconfounded association (Panel A; i.e., the model adjusted for the influence of the confounder) and confounded association (Panel B; i.e., the model not adjusted for the influence of the confounder). As demonstrated, the unconfounded model produced a slope coefficient of .017 (Panel A) and suggested that X has no influence on Y (p = .701). The slope coefficient not being perfectly zero is due to random error in the seeds selected for the simulation. Unsurprisingly, the confounded model produced a slope coefficient of .463 (Panel B) and suggested that X has a strong and statistically significant positive influence on Y (p < .001).

These findings, however, only represent the selected scenario. As such, using the code below, we replicated the simulation 10,000 times. In this looped simulation, we allow the number of cases to randomly vary (on a uniform distribution) between 100 and 1000. Additionally, we permit the strength of the association between the confounder and X, and the confounder and Y, to be randomly specified as any value between 1 and 100. Moreover, the influence of the residual variation in X and Y was randomly specified as any value between 1 and 100. The residual variation was specified to be normally distributed with a mean randomly selected from any value between -5 and 5 and a standard deviation randomly selected from any value between 1 and 5. All of the random specifications were conducted by drawing a single value from a uniform distribution – where all values have an equal likelihood of being selected – using the runif command in R. Importantly, the specified association between X and Y remained zero across all of the simulations.

The loop was specified using the foreach package, which permits parallel processing and the aggregation of data from each loop into a single dataframe. Briefly, the foreach package is extremely useful for running a simulation analysis like the one specified below. For the current loops, we recorded the run number (represented by i) and the estimated slope coefficient of the confounded association. Conducting a simulation in the manner below permits us to develop a comprehensive understanding of the bias generated by not adjusting for confounders that have a positive influence on both X and Y across randomly specified scenarios.

Brief Note: I just want to take a second to briefly note that this analysis might appear complex, but we are simply replacing the values in the code above (i.e., 2) with values generated by the computer between thresholds we set. Moreover, the loop just replicates the process as many times as we decide.

n<-10000

DATA1 = foreach (i=1:n, .packages='lm.beta', .combine=rbind) %dopar%                                 
  {
    N<-sample(100:1000, 1)
    c<-rnorm(n,runif(1,-5,5),runif(1,1,5))
    x<-runif(1,1,100)*c+runif(1,1,100)*rnorm(n,runif(1,-5,5),runif(1,1,5))
    y<-runif(1,1,100)*c+runif(1,1,100)*rnorm(n,runif(1,-5,5),runif(1,1,5))
    Data<-data.frame(c,x,y)
    
    M<-lm(y~x, data = Data)
    
    bXY<-M$coefficients[2]  

    Distance<-0-bXY
    
    data.frame(i,bXY)
    
  }

The findings of the simulation loop are provided in the figure below and overwhelmingly indicated that the confounded slope coefficient will be a positive value. This suggests that confounders that have a positive influence on both the independent and dependent variable will commonly be upwardly bias and generate a positive slope coefficient when the influence of the confounder is not adjusted for in the model. Moreover, the magnitude of the bias that exists in the estimate depends upon the magnitude of the effects the confounder has on the variation in both X and Y.

Now let’s replicate the analysis a couple of times, but alter the direction of the influence of the confounder on X and Y.

Confounder (-,+)

For this replication, everything remained the same excluding the influence of the confounder (C) on X. Specifically, now every one-point increase (or decrease) in the confounder was associated with a two-point decrease (or increase) in X. The code for this replication (and the subsequent replications) is provided above. Although the unconfounded model produced a slope coefficient identical to the previous example (b = .017; p = .701), the confounded model produced a negative slope coefficient (b = -.446; p < .001). Considering that the current example is identical to the previous example except for one difference, the findings suggest that the direction of the influence of the confounder on X influences the direction of the upward bias observed in the slope coefficient of the confounded association.

That interpretation was reconfirmed by the looped analysis. Specifically, distinct from the previous example, the findings of the simulation loop – provided in the figure below – overwhelmingly indicated that the confounded slope coefficient will be a negative value. This suggests that confounders that have a negative influence on the independent variable, but a positive influence on the dependent variable will commonly be upwardly bias and generate a negative slope coefficient when the influence of the confounder is not adjusted for in the model.

Confounder (+,-)

To observe the effects of the opposite specification, the current replication was specified where a one-point increase (or decrease) in the confounder (C) was associated with a two-point decrease (or increase) in Y. Similar to the previous example, the negative association between the confounder and Y generated a negative confounded association between X and Y (Panel B). Specifically, the findings suggested that a one-point increase (or decrease) in X was associated with a .497 decrease (or increase) in Y (p < .001). The unconfounded association (Panel A) produced a slope coefficient identical to the previous examples (b = .017; p = .701).

Switching to the looped simulation analysis, the findings indicated that the confounded slope coefficient will be a negative value. This suggests that confounders that have a negative influence on the dependent variable, but a positive influence on the independent variable will commonly be upwardly bias and generate a negative slope coefficient when the influence of the confounder is not adjusted for in the model.

Confounder (-,-)

Focusing on a confounder with a negative causal influence on both X (the independent variable) and Y (the dependent variable), the results of the single simulation and the looped simulation demonstrated that the confounded slope coefficient will commonly be upwardly biased – further from zero – and a positive value.

This suggests that confounders that have a negative influence on both the independent and dependent variable will commonly be upwardly bias – further from zero – and generate a positive slope coefficient when the influence of the confounder is not adjusted for in the model.

Intermission Discussion

So …  you might be asking yourself why is the directionality of the bias important when the association between X and Y is confounded (the model does not adjust for the influence of the confounder or confounders). When X has no influence on Y, not adjusting for the influence of a confounder will upwardly bias the association of interest because – logically – you can not have an estimate closer to zero than the true association (b = 0). Nevertheless, when X does influence Y, not adjusting for the influence of a confounder can upwardly or downwardly bias the slope coefficient. This means that under certain conditions we can actually commit a type 2 error – retaining the null hypothesis, when in reality we should have rejected the null hypothesis – by not adjusting for the influence of a confounder. Moreover, we can also commit a Type S error and generate interpretations distinct from the true association when a confounder is not adjusted for in our statistical models. Let’s demonstrate using looped simulations.

True Association Between X and Y = 1

Following the simulations above, we have conducted four looped simulations. Importantly, X was specified to be positively associated with Y, with a slope coefficient of 1.00. This slope coefficient is statistically significant for continuous constructs with a mean of 5 (or -5) and a SD = 5 at a sample size of 100, which are the extreme values of our simulation thresholds. Moreover, the only difference between the simulations is the direction of the influence of the confounder on X and Y across the 10,000 randomly specified scenarios. Previous analyses demonstrated that the se for the confounded association between X and Y was on average .05 (this is conservative at N = 100, but generous at N = 1,000). As such, in each figure we identified the Region of Nullification, otherwise known as the scenarios where we could potentially commit a type 2 error. As demonstrated below, the direction of influence the confounder has on X and Y dictates if the confounded association between X and Y crosses over the region of nullification. 

Confounder (+,+)

For the current example, the confounder (C) was specified to have a positive influence on variation in both Y and X, while the slope coefficient between X and Y was specified to be 1. As demonstrated by the findings, when C has a positive influence on the variation in both Y and X, the confounded slope coefficient will generally be upwardly biased and a positive value. On some occasions the slope coefficient can be closer to zero – increasing the likelihood of a Type 2 error – or a negative value. But generally, the interpretation of the estimated association – confounded or unconfounded – will be that X has a positive influence on Y, but the magnitude of the association will vary conditional upon the magnitude of the effects of the confounder on X and Y.

Confounder (-,+)

Altering the specification, the confounder (C) was now designated to have a positive influence on the variation in Y, but a negative influence on the variation in X. The slope coefficient between X and Y remained 1. As demonstrated by the findings, when the confounder has a positive influence on the variation in Y but a negative influence on the variation in X, the confounded slope coefficient could be upwardly or downwardly biased, as well as a positive or negative value. Moreover, a subset of the random scenarios produced slope coefficients that fell within the Region of Nullification. Considering this evidence, the interpretation of the confounded association could vary depending upon the magnitude of the negative effects of the confounder on X and the magnitude of the positive effects of the confounder on Y. Three distinct interpretations can be drawn from the confounded association: X has a statistically significant negative influence on Y, X and Y are unrelated, and X has a statistically significant positive influence on Y. Only the last interpretation captures the true association between X and Y. 

Confounder (+,-)

Opposite of the previous specification, the confounder (C) was now designated to have a positive influence on the variation in X, but a negative influence on the variation in Y. The slope coefficient between X and Y remained 1. The findings of the randomly specified 10,000 simulations were almost identical to the previous specification. That is, when the confounder had a positive influence on the variation in X but a negative influence on the variation in Y, the confounded slope coefficient could be upwardly or downwardly biased, as well as a positive or negative value. Moreover, a subset of the random scenarios produced slope coefficients that fell within the Region of Nullification. Similar to the previous example, three distinct interpretations can be drawn from the confounded association: X has a statistically significant negative influence on Y, X and Y are unrelated, and X has a statistically significant positive influence on Y. Again, only the last interpretation captures the true association between X and Y. 

Confounder (-,-)

Finally, when the confounder was specified to have a negative influence on the variation in both X and Y, the results were identical to the simulated loops where the confounder was specified to have a positive influence on the variation in both X and Y. Specifically, while on some occasions the slope coefficient can be closer to zero – increasing the likelihood of a Type 2 error – or a negative value, the interpretation of the association – confounded or unconfounded – will generally be that X has a positive influence on Y, but the magnitude of the association will vary conditional upon the magnitude of the effects of the confounder on X and Y.

Conclusion

Confounder bias is taught in statistics courses across a wide variety of fields due to the increased likelihood of committing a Type 1 error. While this is correct – not adjusting for the influence of a confounders can only upwardly bias the slope coefficient of null associations –, it is important to remember that confounders can upwardly or downwardly bias the slope coefficient of variables that are causally associated. Moreover, under some circumstances we can commit a Type 2 error (retain the null hypothesis, when in reality we should have rejected the null hypothesis) or observe a slope coefficient that is negative (or positive) when the true association is positive (or negative; a Type S error). These effects are conditional upon the direction and the magnitude of the influence of a confounder on both the independent and the dependent variables of interest. While the conclusion remains the same as we have been previously taught, always make efforts to adjust for confounders in our methodological or statistical designs, the intricacies of confounder bias should be taught more broadly. If there is nothing else taken away from this entry: not adjusting for confounders can generate statistically significant slope coefficients when in reality no association exists between the variables of interest, but it can also generate null slope coefficients or slope coefficients in the opposite direction when in reality an association exists between the variables of interest!

License: Creative Commons Attribution 4.0 International (CC By 4.0)

3 thoughts on “Entry 7: The Exclusion of Confounders (Confounder Bias)

Leave a comment