This is an issue that bothered me for quite some time. So I finally decided to settle it with a blog post. I see people constantly confusing the two most common threats to causal inference—sample selection and endogeneity. This happens, for example, quite often in management research, where it is common to recommend a sample selection model in order to deal with endogenous treatments. But the two concepts are far from being equivalent. Have a look at the following graph, which describes a typical case of endogeneity.We’re interested in the causal effect of X on Y. At the same, we’re aware that there are confounding factors Z that affect both the treatment and outcome. Unfortunately, however, controlling for Z in a regression won’t be enough because there is at least one unobserved factor U, which we can’t account for. In this case, the most common remedy would be to resort to an instrumental variable estimator, for instance.
Sample selection is a very different problem though. A classic example in economics is by Jim Heckman who studied the relationship between hours worked (X) and wages (Y) in a sample of women in 1967. The empirical challenge he addressed—which later got him the Nobel prize—is that we can only observe wages for those women that actually choose to work.¹ A corresponding graph would look like the following.Recognize the selection node S with the double border. It represents a special variable that takes on just two values indicating whether we observe Y for a particular unit in our sample. In Heckman’s example, S would be equal to one for women that choose to work, and zero for unemployed individuals.
In the graph, the decision to work is affected by the same kind of socio-economic factors (Z) that also determine wages (Y). A woman who chooses not to work does so because the wage she would be able to earn on the market wouldn’t compensate her for the opportunity costs of staying outside of the labor force. According to economic theory her reservation wage is then higher than her market wage. It’s very likely that this reservation wage is driven by unobservables U though (e.g., personal wealth, preferences for leisure, family situation, etc.), which might also affect salaries. This is denoted by the dashed bidirected arc in the graph. Therefore the sample of women who work (S=1) is a selected sample that isn’t representative for the entire population.²
The important difference here is that in the presence of sample selection we’re not able to observe some variables for part of the population. In the endogeneity case, on the other hand, we do have full records, it’s just that a simple regression of X on Y (even when controlling for Z) won’t suffice. My hunch is that the confusion between the two concepts partly stems from imprecise language. Endogeneity is often a result of self-selection, meaning that individuals choose X based on unobservable factors U (see the first graph). U then confounds the relationship between X and Y and creates bias. Nevertheless, this self-selection isn’t equivalent to sample selection, because we can still observe Y for all individuals.
Recently, there was an entire paper in Strategic Management Journal devoted to Heckman selection models. But, if you ask me, it doesn’t help much to resolve the confusion either. The example research question the authors discuss is whether mergers and acquisitions (X) affect stock market reactions (Y). They claim that sample selection models are appropriate in this situation “since stock market reactions are only available for firms that actually complete acquisitions”. But is that correct? We also observe stock prices for firms that don’t acquire another company. Of course, we should be aware of the endogeneity of M&A decisions. But that’s more appropriately dealt with by an IV approach then, instead of a Heckman selection model.
One last word on this. It’s actually true that parametric models dealing with either sample selection or endogeneity look quite similar in the end. That’s because they propose similar solutions (i.e., control functions) for the two problems. But, for the sake of clarity, we should better keep these concepts apart. So here we have another example where, in my opinion, causal graphs make things very clear, whereas parametric models rather obscure matters.
¹ Think of wages being defined as the salary you would make if you were forced to work. So wages aren’t necessarily zero for unemployed women.
² Heckman solved this problem by making very specific assumptions about the error structure in his model (joint normality) and including a function of the selection propensity score in the final estimation equation that controls for the preferential selection of units into the sample (Angrist 1997).