Beyond Curve Fitting

Last week I attended the AAAI spring symposium on “Beyond Curve Fitting: Causation, Counterfactuals, and Imagination-based AI”, held at Stanford University. Since Judea Pearl and Dana Mackenzie published “The Book of Why”, the topic of causal inference gains increasing momentum in the machine learning and artificial intelligence community. If we want to build truly intelligent machines, which are able to interact with us in a meaningful way, we have to teach them the concept of causality. Otherwise, our future robots will never be able to understand that forcing the rooster to crow at 3am in the morning won’t make the sun appear.

Causal inference has always been somewhat of a niche topic in AI. All of the cutting-edge machine learning tools—you know, the ones you’ve heard about, like neural nets, random forests, support vector machines, and so on—remain purely correlational, and can therefore not discern whether the rooster’s crow causes the sunrise, or the other way round. This seems to be changing though and more and more big shots start to recognize the limits of prediction methods and acknowledge a need for major retooling in the community.

Yoshua Bengio from the University of Montreal, who’s one of the pioneers in deep learning, was attending the symposium too. He was speaking about transfer learning and causal discovery (the slides are available on the website). One funny anecdote of the event was that nobody during his talk—apart from himself, maybe—knew yet that Yoshua will be awarded the 2018 Turing award (together with Geoffrey Hinton and Yann LeCun) for his contributions to neural networks and AI.

After his presentation, Yoshua was excusing himself for not attending lunch, because he had to take an “important phone call”. That’s when the news broke. So together with Judea Pearl’s keynote on the first day, that made already two Turing award winners at the symposium.

IMG_9817 2
A personal highlight for me: meeting the father of causal inference in AI — Judea Pearl

Based on Pearl’s seminal work on graph-theoretic causal models (directed acyclic graphs), tremendous progress has been made in the field of causal AI during the last 30 years. But causal inference is obviously also super important in other fields that rely on empirical work. And they all have developed their own idiosyncratic methods for approaching causal questions. The symposium program was thus divided into several “Causality + X” sessions, where X was referring to many of the scientific disciplines in which causal inference plays a role:

  • Machine learning and AI
  • Computer vision
  • Social sciences
  • Health sciences / epidemiology

This format created a great opportunity for sharing different perspectives and stimulated learning beyond narrow disciplinary silos.

My session was about causal inference in the social sciences, together with Kosuke Imai from Harvard. I was responsible for representing the economics view.


Session: Causality + social sciences

If you’re interested in my slides, you can have a look here. Soon, Elias Bareinboim (who was organizing the event, thanks Elias!) and I, will also release a working paper, in which we’ll get into much more detail on the subject.

To quickly summarize my main message: Having spent considerable time studying the methods for causal inference developed in computer science, I came to the conclusion that economists can learn a lot from engaging with that literature. Of course, that goes the other way round. So I think we could all benefit tremendously from mutual knowledge exchange, which—I must admit—didn’t happen so far to a satisfactory extent. But I see many promising signs of improvement. More and more economists express interest in DAG methodology and what they have to offer.

One thing became clear to me when attending the symposium. The field of causal AI is developing rapidly in so many directions, and a lot of different fields are currently adopting graph-based approaches to causality. Econ should keep pace if we don’t want to lose touch with these developments. That doesn’t mean that we need to abandon our own unique perspective on causal inference, which is tailored to our specific needs. But coordinating on one basic framework for causal inference can have huge potential for cross-fertilization between disciplines. Something that we’re not nurturing nearly enough at the moment, if you ask me.

Why so much hate against propensity score matching?

I’ve seen several variants of this meme on Twitter recently.

This is just one example, so nothing against @HallaMartin. But his tweet got me thinking. Apparently, in the year 2019 it’s not possible anymore to convince people in an econ seminar with a propensity score matching (or any other matching on observables, for that matter). But why is that?

Here’s what I think. The typical matching setup looks somewhat like this:

You’re interested in estimating the causal effect of X on Y. But in order to do so, you will need to adjust for the confounders W, otherwise you’ll end up with biased results. If you’re able to measure W, this adjustment can be done in a propensity score matching, which is actually an efficient way of dealing with a large set of covariates.

The problem though is to be sure that you’ve adjusted for all possible confounding factors. How can you be certain that there are no unobserved variables left that affect both X and Y? Because if the picture looks like the one below (where the unobserved confounders are depicted by the dashed bidirected arc), matching will only give you biased estimates of the causal effect you’re after.

Presumably, the Twitter meme is alluding to exactly this problem. And I agree that it’s hard to make the claim that you’ve accounted for all confounding influence factors in a matching. But how’s that with economists’ most preferred alternative—the instrumental variable (IV) estimator? Here the setup looks like this:

Now, unobserved confounders between X and Y are allowed, as long as you’re able to find an instruments Z that affects X, but which is unrelated to Y. In that case, creates exogenous variation in X that can be leveraged to estimate X‘s causal effect. (Because of the exogonous variation in X induced by Z, we also call this IV setup a surrogate experiment, by the way.)

Great, so we have found a way forward if we’re not 100% sure that we’ve accounted for all unobserved confounders. Instead of a propensity score matching, we can simply resort to an IV estimator.

But if you think about this a bit more, you’ll realize that we face a very similar situation here. The whole IV strategy breaks down if there are unobserved confounders between Z and Y (see again the dashed arc below). How can we be sure to rule out all influence factors that jointly affect the instrument and the outcome? It’s the same problem all over again.

So in that sense, matching and IV are not very different. In both cases we need to carefully justify our identifying assumptions based on the domain knowledge we have. Whether ruling out X \dashleftarrow\dashrightarrow X is more plausible than Z \dashleftarrow\dashrightarrow X depends on the specific context under study. But on theoretical grounds, there’s no difference in strength or quality between the two assumptions. So I don’t really get why—as a rule—economists shouldn’t trust a propensity score matching, but an IV approach is fine.

Now you might say that this is just Twitter babble. But my impression is that most economists nowadays would be indeed very suspicious towards “selection on observables”-types of identification strategies.* Even though there’s nothing inherently implausible about them.

In my view, the opaqueness of the potential outcome (PO) framework is partly to blame for this. Let me explain. In PO you’re starting point is to assume uncofoundedness of the treatment variable

(Y^1, Y^0) \perp X | W.

This assumption requires that the treatment X needs to be independent of the potential outcomes of Y, when controlling for a vector of covariates W (as in the first picture above). But what is this magic vector W that can make all your causal effect estimation dreams come true? Nobody will tell you.

And if the context you’re studying is a bit more complicated than in the graphs I’ve showed you—with several causally conected variables in a model—it’ becomes very complex to even properly think this through. So in the end, deciding whether unconfoundedness holds becomes more of guessing game.

My hunch is that after having seen too many failed attempts of dealing with this sort of complexity, people have developed a general mistrust against unconfoundedness and strong exogeneity type assumptions. But we still don’t want to give up on causal inference altogether. So we move over to the next best thing: IV, RDD, Diff-in-Diff, you name it.

It’s not that these methods have weaker requirements. They all rely on untestable assumptions about unobservables. But maybe they seem more credible because you’ve jumped through more hoops with them?

I don’t know. And I don’t want to get too much into kitchen sink psychology here. I just know that the PO framework makes it incredibly hard to justify crucial identification assumptions, because it’s so much of a black box. And I think there are better alternatives out there, based on the causal graphs I used in this post (see also here). Who knows, maybe by adopting them we might one day be able to appreciate a well carried out propensity score matching again.

* Interestingly though, this only seems to be the case for reduced-form analyses. Structural folks mostly get away with controlling for observables; presumably because structural models make causal assumptions much more explicit than the potential outcome framework.

Causal Inference for Policymaking

I just submitted an extended abstract of an upcoming paper to a conference that will discuss new analytical tools and techniques for policymaking. The abstract contains a brief discussion about the importance of causal inference for taking informed policy decisions. And I would like to share these thoughts here. Continue reading Causal Inference for Policymaking

Graphs and Occam’s Razor

One argument / point of criticism I often hear from people who start exploring Directed Acyclic Graphs (DAG) is that graphical models can quickly become very complex. When you read about the methodology for the first time you get walked through all these toy models – small, well-behaved examples with nice properties, in which causal inference works like a charm.

Continue reading Graphs and Occam’s Razor

Sample Selection Vs. Selection Into Treatment

This is an issue that bothered me for quite some time. So I finally decided to settle it with a blog post. I see people constantly confusing the two most common threats to causal inference—sample selection and endogeneity. This happens, for example, quite often in management research, where it is common to recommend a sample selection model in order to deal with endogenous treatments. But the two concepts are far from being equivalent. Have a look at the following graph, which describes a typical case of endogeneity. Continue reading Sample Selection Vs. Selection Into Treatment

Why you shouldn’t control for post-treatment variables in your regression

This is a slight variation of a theme, I was already blogging about some time ago. But I recently had a discussion with a colleague and thought it would be worthwhile to share my notes here. So what might go wrong if you control for post-treatment variables in your statistical model? Continue reading Why you shouldn’t control for post-treatment variables in your regression