Arthur Lewbel published a very interesting paper back in 2012 in the Journal of Business & Economic Statistics (ungated version here). The paper attracted quite some attention because it lays out a method to do two-stage least squares regressions (in order to identify causal effects) without the need for an outisde instrumental variable. Consider a triangular model
The common factor (think about the textbook example of unobserved ability in a wage regressions) creates a correlation between the errors that leads to an endogeneity problem when estimating (1). You can see that there is no exclusion restriction available in equation (2), because appears in both lines. Nevertheless, it is possible to estimate the parameters in (1) consistently when the following two assumptions are fulfilled
is an observed random vector, which can be (but doesn’t have to be) a subset of the regressor vector . (A2) places restrictions on the covariance matrix of the model errors which are satisfied in the above case of a common unobserved factor. In addition, the method requires heteroskedasticity in (in both and for non-triangular models), which arises frequently in applied work.
Lewbel’s method works like a charm in simulation studies. However, it was developed for linear models (footnote 1). But what happens if you have a binary endogenous variable? Let’s consider being Probit
with 1[…] being the indicator function and as before. has to be standard normal such that, for independent and , it has to hold that . Note that
and we can rewrite equation (4) with additive error
with , which is a function of X! Intuitively, the additive cannot vary freely for binary . It has to be smaller when is either small or large, otherwise we would not stay in the supposed bounds of zero and one. This means that there is heteroskedasticity in (4) by construction since
is clearly not constant.
Initially, I thought this is great because it should mean that Lewbel’s method is always applicable with a binary endogenous regressor. What’s with the second assumption though? Inserting in (A2) gives
When (footnote 2) is a subset of this covariance is not zero. (A2) is violated!
To get a feeling for the problem I created simulated data with 2,000 observations and the following parametrization (notation is a bit sloppy due to the limited LaTex capabilities of WordPress)
, , , and
Using the user-written Stata command ivreg2h gave the following output
Estimates are far off the true coefficients (which are all equal to one). And this wasn’t just an unlucky draw. The average estimate of in a small Monte-Carlo study with 200 repetitions was equal to 1.83.
You might object that in order to construct the instruments Lewbel suggests, , you have to estimate the exact . By contrast, ivreg2h assumes a linear equation for . But things don’t improve much if you estimate equation (6) by Probit and construct the instruments manually.
To conclude: Be careful with applying the method in a situation with binary endogenous regressor. There is at least one case ( being Probit) where the estimator is inconsistent. It might still work for other structural specifications. And it would be great if somebody worked out the conditions under which it does. Until then, however, I would refrain from using Lewbel’s method in the binary case. It’s not robust to miss-specifications of the -equation and we don’t know yet when it works and when it doesn’t.
(1) He also presents an extension to partly linear systems which, however, does not capture the limited dependent data case.
(2) If, on the other hand, is restricted to be an outside variable, not contained in , then I don’t see how you can satisfy the requirement of heteroskedasticity (A1). Maybe with some sort of heteroskedastic Probit specification. But I haven’t worked that out. Especially introducing the common factor—which leads to endogeneity in the triangular model—seems to be non-trivial.
Update: Fixed an error and added some clarifying remarks. Thanks to Arthur Lewbel for the pointer!