Sample Selection Vs. Selection Into Treatment

This is an issue that bothered me for quite some time. So I finally decided to settle it with a blog post. I see people constantly confusing the two most common threats to causal inference—sample selection and endogeneity. This happens, for example, quite often in management research, where it is common to recommend a sample selection model in order to deal with endogenous treatments. But the two concepts are far from being equivalent. Have a look at the following graph, which describes a typical case of endogeneity.endogeneityWe’re interested in the causal effect of X on Y. At the same, we’re aware that there are confounding factors Z that affect both the treatment and outcome. Unfortunately, however, controlling for Z in a regression won’t be enough because there is at least one unobserved factor U, which we can’t account for. In this case, the most common remedy would be to resort to an instrumental variable estimator, for instance.

Sample selection is a very different problem though. A classic example in economics is by Jim Heckman who studied the relationship between hours worked (X) and wages (Y) in a sample of women in 1967. The empirical challenge he addressed—which later got him the Nobel prize—is that we can only observe wages for those women that actually choose to work.¹ A corresponding graph would look like the following.sample_selectionRecognize the selection node S with the double border. It represents a special variable that takes on just two values indicating whether we observe Y for a particular unit in our sample. In Heckman’s example, S would be equal to one for women that choose to work, and zero for unemployed individuals.

In the graph, the decision to work is affected by the same kind of socio-economic factors (Z) that also determine wages (Y). A woman who chooses not to work does so because the wage she would be able to earn on the market wouldn’t compensate her for the opportunity costs of staying outside of the labor force. According to economic theory her reservation wage is then higher than her market wage. It’s very likely that this reservation wage is driven by unobservables U though (e.g., personal wealth, preferences for leisure, family situation, etc.), which might also affect salaries. This is denoted by the dashed bidirected arc in the graph. Therefore the sample of women who work (S=1) is a selected sample that isn’t representative for the entire population.²

The important difference here is that in the presence of sample selection we’re not able to observe some variables for part of the population. In the endogeneity case, on the other hand, we do have full records, it’s just that a simple regression of X on Y (even when controlling for Z) won’t suffice. My hunch is that the confusion between the two concepts partly stems from imprecise language. Endogeneity is often a result of self-selection, meaning that individuals choose X based on unobservable factors U (see the first graph). U then confounds the relationship between X and Y and creates bias. Nevertheless, this self-selection isn’t equivalent to sample selection, because we can still observe Y for all individuals.

Recently, there was an entire paper in Strategic Management Journal devoted to Heckman selection models. But, if you ask me, it doesn’t help much to resolve the confusion either. The example research question the authors discuss is whether mergers and acquisitions (X) affect stock market reactions (Y). They claim that sample selection models are appropriate in this situation “since stock market reactions are only available for firms that actually complete acquisitions”. But is that correct? We also observe stock prices for firms that don’t acquire another company. Of course, we should be aware of the endogeneity of M&A decisions. But that’s more appropriately dealt with by an IV approach then, instead of a Heckman selection model.

One last word on this. It’s actually true that parametric models dealing with either sample selection or endogeneity look quite similar in the end. That’s because they propose similar solutions (i.e., control functions) for the two problems. But, for the sake of clarity, we should better keep these concepts apart. So here we have another example where, in my opinion, causal graphs make things very clear, whereas parametric models rather obscure matters.


¹ Think of wages being defined as the salary you would make if you were forced to work. So wages aren’t necessarily zero for unemployed women.

² Heckman solved this problem by making very specific assumptions about the error structure in his model (joint normality) and including a function of the selection propensity score in the final estimation equation that controls for the preferential selection of units into the sample (Angrist 1997).

Why you shouldn’t control for post-treatment variables in your regression

This is a slight variation of a theme, I was already blogging about some time ago. But I recently had a discussion with a colleague and thought it would be worthwhile to share my notes here. So what might go wrong if you control for post-treatment variables in your statistical model? Continue reading Why you shouldn’t control for post-treatment variables in your regression

Econometrics and the “not invented here” syndrome: suggestive evidence from the causal graph literature

[This post requires some knowledge of directed acyclic graphs (DAG) and causal inference. Providing an introduction to the topic goes beyond the scope of this blog though. But you can have a look at a recent paper of mine in which I describe this method in more detail.]

Graphical models of causation, most notably associated with the name of computer scientist Judea Pearl, received a lot of pushback from the grandees of econometrics. Heckman had his famous debate with Pearl, arguing that economics looks back on its own tradition of causal inference, going back to Haavelmo, and that we don’t need DAGs. Continue reading Econometrics and the “not invented here” syndrome: suggestive evidence from the causal graph literature

Why Tobit models are overused

In my field of research we’re often running regressions with innovation expenditures or sales with new products aon the left-hand side. Usually we observe many zeros for these variables because firms do not invest at all in R&D and therefore also do not come up with new products. Many researchers then feel inclined to use Tobit models. But frankly, I never understood why. Continue reading Why Tobit models are overused

Follow-up on “IV regressions without instruments” (technical)

Some time ago I wrote about a paper by Arthur Lewbel in the Journal of Business & Economic Statistics in which he develops a method to do two-stage least squares regressions without actually having an exclusion restrictions in the model. The approach relies on higher moment restrictions in the error matrix and works well for linear or partly linear models. Back then, I expressed concerns that the estimator does not seem to work when an endogenous regressor is binary though; at least not in the simulations I have carried out.

After a bit of email back-and-forth we were able to settle the debate now. Continue reading Follow-up on “IV regressions without instruments” (technical)

IV regressions without instruments (technical)

Arthur Lewbel published a very interesting paper back in 2012 in the Journal of Business & Economic Statistics (ungated version here). The paper attracted quite some attention because it lays out a method to do two-stage least squares regressions (in order to identify causal effects) without the need for an outisde instrumental variable. Continue reading IV regressions without instruments (technical)

Econometrics: When Everybody is Different

Nowadays everybody is talking about heterogeneous treatment effects. That is, response to an economic stimulus that varies across individuals in a population. However, so far the discussion was concentrated on the instrumental variable setting where a randomized (natural or administered) experiment affects the treatment status of a so-called complier population. An average of the individual treatment effects can only be estimated for this group of compliers. Instead, for the always and never-takers we cannot say anything. But if individual treatment responses are different for everybody in the population, how can we be sure that what we’re estimating for the compliers is representative for the whole population? Continue reading Econometrics: When Everybody is Different