Why you shouldn’t control for post-treatment variables in your regression

This is a slight variation of a theme, I was already blogging about some time ago. But I recently had a discussion with a colleague and thought it would be worthwhile to share my notes here. So what might go wrong if you control for post-treatment variables in your statistical model? From economists you might hear something like “a post-treatment variable is endogenous, and you shouldn’t control for endogenous variables”. But that’s too vague of an answer. Take for example the following setting, which is inspired by my own research.


There is a large literature documenting that firms which are predominantly owned by single families often invest less in innovation and R&D. Reasons for this are that family owners often want to keep tight control over the company, which leads to conservatism in their decision-making and a reluctance to involve outside investors. This makes it harder for them to finance risky and expensive R&D projects. Obviously, in the long-run, this will have an effect on performance, if family firms invest too little in new product development and optimizing their production processes.

Now imagine I want to analyze the relationship between family ownership and firm performance, but I’m not really interested in R&D expenditures. Maybe I want to investigate another phenomenon related to family owners and the innovation aspect is already too well researched to care for it anymore. At the same time though I know that R&D expenditures are an important factor and I should better account for it. So if the graph above is indeed the correct model, I could include R&D ependitures in my regression to hold them constant across firms. Any effect of innovation spending on performance would then be eliminated and I could focus on whatever else I’m interested in. This works even though R&D expenditures are an endogenous, post-treatment variable in the model (you can see this by the arrow that points into it).

So far no problem. But the situation changes dramatically if we add an unobserved confounder.

mediation2_260818Now R&D expenditures become what we call a collider variable, because two arrows, one emitted from family ownership and the other from U, meet in it. Colliders are a tricky business because they open up biasing paths if we control for them (here: family ownership \rightarrow R&D expenditures \dashleftarrow U \dashrightarrow firm performance). They are almost like landmines in empirical research. Left alone they do no harm, but once you condition on them you end up with estimation bias.

Unfortunately, it’s quite likely that there are unobserved variables that affect both R&D expenditures and firm performance, such as general management quality or a particularly well-trained workforce. Therefore, collider bias is a huge risk when controlling for post-treatment variables.

Sometimes people operate under the vague notion that if they include post-treatment variables in a regression they could get at a causal effect “net of this variable”. Or put differently, they think they could say something about the different causal mechanisms that are at play. This view is tricky, if not mistaken! Because if there are unobserved confounders between the outcome and the intermediate variable, like in the example above, it’s impossible to keep the two causal mechanisms apart. That’s exactly why mediation analysis is so hard. Kosuke Imai and coauthors have developed a sensitivity analysis, which allows you to check whether your mediation analysis remains robust to a mediator-outcome-correlation. If you’re interested in mediation analysis, you can also check out this previous post of mine.

In any case, stay safe and watch out for collider bias!


Microsoft Releases New Python Library for Causal Inference

A while ago I blogged about Facebook’s causal inference group. Now Microsoft has followed suit and released a Python library for graph-based methods of causal inference. Continue reading Microsoft Releases New Python Library for Causal Inference

The Origins of Graphical Causal Models

Here is an interesting bit of intellectual history. In his 2000 book “Causality”, Judea Pearl describes how he got to the initial idea that sparked the development of causal inference based on directed acyclic graphs. Continue reading The Origins of Graphical Causal Models

No Free Lunch in Causal Inference

Last week I was teaching about graphical models of causation at a summer school in Montenegro. You can find my slides and accompanying R code in the teaching section of this page. It was lots of fun and I got great feedback from students. After the workshop we had stimulating discussions about the usefulness of this new approach to causal inference in economics and business. I’d like to pick up one of those points here, as this is an argument I frequently hear when talking to people with a classical econometrics training. Continue reading No Free Lunch in Causal Inference

Becoming More Different Over Time

In my class we recently discussed a paper by Higgins and Rodriguez (2006)—published in the Journal of Financial Economics—that contains an important lesson for researchers who want to apply the difference-in-differences (DiD) method in competition analysis and merger control. Continue reading Becoming More Different Over Time

A plea for simple theories

[This is the second part of a fair copy of a recent Twitter thread of mine. I suggest you have a look at part 1 about nonlinear mediation analysis first. Otherwise, it might be hard to follow this post.]

Understanding causal effects is tough, but understanding causal mechanisms is even tougher. When we try to understand mechanisms we move beyond the question whether a certain causal effect exists, and ask how an effect comes about instead. Continue reading A plea for simple theories