Econometrics: When Everybody is Different

Nowadays everybody is talking about heterogeneous treatment effects. That is, response to an economic stimulus that varies across individuals in a population. However, so far the discussion was concentrated on the instrumental variable setting where a randomized (natural or administered) experiment affects the treatment status of a so-called complier population. An average of the individual treatment effects can only be estimated for this group of compliers. Instead, for the always and never-takers we cannot say anything. But if individual treatment responses are different for everybody in the population, how can we be sure that what we’re estimating for the compliers is representative for the whole population?

That is the standard LATE debate which was held in the literature some years ago (here is a short, not too technical introduction). But so far the implications of heterogeneous treatment effects have been ignored for the most basic and widely used econometric technique, a linear regression. A new paper by Aronow and Samii picks up the slack*. They are reminding us that a linear regression does in fact not produce an estimate of the average treatment effect for the population we’re studying. Rather each individual’s treatment response is weighted and the resulting average effect is only representative for an “effective sample”. This sample, as the paper shows, can be spectacularly different from what we’re actually interested in. Consequently, regression methods have no advantage in terms of generalizability or external validity compared to instrumental variable techniques.

Damn, that hits hard! Even with a representative sample at hand and no omitted variables or other sources of endogeneity, we’re not able to identify a representative average effect when using the workhorse model of econometrics. What’s happening here? The result becomes less frightening when we remind ourselves of the ordinary least squares (OLS) arithmetic that we all learned in our first econometrics class. Take the model

Y_i = \alpha + \beta X_i + \varepsilon_i.    (1)

A linear regression line always goes through the average of the data (\bar{X}, \bar{Y}), see Figure 1. The OLS estimate for the effect of \beta is given by

\hat{\beta} = \frac{Cov(X_i,Y_i)}{Var(X_i)}.Aronow_Samii

The crucial detail is already hidden in this formula. The regression coefficient is scaled by the variance of X. That means that points further away from (\bar{X}, \bar{Y}) exert more influence on the slope coefficient estimate. I tried to illustrate this fact in Figure 2 where I decreased the Y-value of one data point at a time–once for a point on the boundary of the support of X and once for an interior point. In the first case the slope coefficient changes drastically from a positive slope to a negative. In the second case only the intercept’s estimate is affected significantly.

The same logic, which ultimately stems from minimizing squared residuals, carries over to the multiple regression context with a binary treatment, which Arinow and Samii are interested in. Even after adjusting for covariates, the least squares coefficient is still influenced more by data points further away from the center of the data. Hence the weighting by the conditional variance of the treatment variable in equation (7) of Arinow and Samii’s paper. All this is of course no problem if you’re willing to assume, as I did in equation (1) and most textbooks do, that treatment responses \beta are the same for every individual. The weighting induced by OLS then poses no problems and even renders the estimator efficient in the mean-squared-error sense.

But the research community these days is eager to relax the homogeneous treatment response assumption and, I’d say, rightly so. We then should bear in mind that the difference between a \beta and \beta_i (note the subscript!) in the population equation is not neutral. And we should either investigate the properties of our “effective sample”, as Arinow and Samii propose, or resort to estimation methods with less restrictive functional form assumptions such as inverse probability weighting or matching.

* Some might disagree that their results are new because they’re building heavily on Angrist and Krueger (1999) and Angrist and Pischke (2009, chapter 3). I think their real contribution lies in highlighting the problem to practitioners in a very illustrative manner.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s