Sample Selection Vs. Selection Into Treatment

This is an issue that bothered me for quite some time. So I finally decided to settle it with a blog post. I see people constantly confusing the two most common threats to causal inference—sample selection and endogeneity. This happens, for example, quite often in management research, where it is common to recommend a sample selection model in order to deal with endogenous treatments. But the two concepts are far from being equivalent. Have a look at the following graph, which describes a typical case of endogeneity.endogeneityWe’re interested in the causal effect of X on Y. At the same, we’re aware that there are confounding factors Z that affect both the treatment and outcome. Unfortunately, however, controlling for Z in a regression won’t be enough because there is at least one unobserved factor U, which we can’t account for. In this case, the most common remedy would be to resort to an instrumental variable estimator, for instance.

Sample selection is a very different problem though. A classic example in economics is by Jim Heckman who studied the relationship between hours worked (X) and wages (Y) in a sample of women in 1967. The empirical challenge he addressed—which later got him the Nobel prize—is that we can only observe wages for those women that actually choose to work.¹ A corresponding graph would look like the following.sample_selectionRecognize the selection node S with the double border. It represents a special variable that takes on just two values indicating whether we observe Y for a particular unit in our sample. In Heckman’s example, S would be equal to one for women that choose to work, and zero for unemployed individuals.

In the graph, the decision to work is affected by the same kind of socio-economic factors (Z) that also determine wages (Y). A woman who chooses not to work does so because the wage she would be able to earn on the market wouldn’t compensate her for the opportunity costs of staying outside of the labor force. According to economic theory her reservation wage is then higher than her market wage. It’s very likely that this reservation wage is driven by unobservables U though (e.g., personal wealth, preferences for leisure, family situation, etc.), which might also affect salaries. This is denoted by the dashed bidirected arc in the graph. Therefore the sample of women who work (S=1) is a selected sample that isn’t representative for the entire population.²

The important difference here is that in the presence of sample selection we’re not able to observe some variables for part of the population. In the endogeneity case, on the other hand, we do have full records, it’s just that a simple regression of X on Y (even when controlling for Z) won’t suffice. My hunch is that the confusion between the two concepts partly stems from imprecise language. Endogeneity is often a result of self-selection, meaning that individuals choose X based on unobservable factors U (see the first graph). U then confounds the relationship between X and Y and creates bias. Nevertheless, this self-selection isn’t equivalent to sample selection, because we can still observe Y for all individuals.

Recently, there was an entire paper in Strategic Management Journal devoted to Heckman selection models. But, if you ask me, it doesn’t help much to resolve the confusion either. The example research question the authors discuss is whether mergers and acquisitions (X) affect stock market reactions (Y). They claim that sample selection models are appropriate in this situation “since stock market reactions are only available for firms that actually complete acquisitions”. But is that correct? We also observe stock prices for firms that don’t acquire another company. Of course, we should be aware of the endogeneity of M&A decisions. But that’s more appropriately dealt with by an IV approach then, instead of a Heckman selection model.

One last word on this. It’s actually true that parametric models dealing with either sample selection or endogeneity look quite similar in the end. That’s because they propose similar solutions (i.e., control functions) for the two problems. But, for the sake of clarity, we should better keep these concepts apart. So here we have another example where, in my opinion, causal graphs make things very clear, whereas parametric models rather obscure matters.


¹ Think of wages being defined as the salary you would make if you were forced to work. So wages aren’t necessarily zero for unemployed women.

² Heckman solved this problem by making very specific assumptions about the error structure in his model (joint normality) and including a function of the selection propensity score in the final estimation equation that controls for the preferential selection of units into the sample (Angrist 1997).

Why you shouldn’t control for post-treatment variables in your regression

This is a slight variation of a theme, I was already blogging about some time ago. But I recently had a discussion with a colleague and thought it would be worthwhile to share my notes here. So what might go wrong if you control for post-treatment variables in your statistical model? Continue reading Why you shouldn’t control for post-treatment variables in your regression

Microsoft Releases New Python Library for Causal Inference

A while ago I blogged about Facebook’s causal inference group. Now Microsoft has followed suit and released a Python library for graph-based methods of causal inference. Continue reading Microsoft Releases New Python Library for Causal Inference

The Origins of Graphical Causal Models

Here is an interesting bit of intellectual history. In his 2000 book “Causality”, Judea Pearl describes how he got to the initial idea that sparked the development of causal inference based on directed acyclic graphs. Continue reading The Origins of Graphical Causal Models

No Free Lunch in Causal Inference

Last week I was teaching about graphical models of causation at a summer school in Montenegro. You can find my slides and accompanying R code in the teaching section of this page. It was lots of fun and I got great feedback from students. After the workshop we had stimulating discussions about the usefulness of this new approach to causal inference in economics and business. I’d like to pick up one of those points here, as this is an argument I frequently hear when talking to people with a classical econometrics training. Continue reading No Free Lunch in Causal Inference

Becoming More Different Over Time

In my class we recently discussed a paper by Higgins and Rodriguez (2006)—published in the Journal of Financial Economics—that contains an important lesson for researchers who want to apply the difference-in-differences (DiD) method in competition analysis and merger control. Continue reading Becoming More Different Over Time