Causal inference has always been somewhat of a niche topic in AI. All of the cutting-edge machine learning tools—you know, the ones you’ve heard about, like neural nets, random forests, support vector machines, and so on—remain purely correlational, and can therefore not discern whether the rooster’s crow causes the sunrise, or the other way round. This seems to be changing though and more and more big shots start to recognize the limits of prediction methods and acknowledge a need for major retooling in the community.

Yoshua Bengio from the University of Montreal, who’s one of the pioneers in deep learning, was attending the symposium too. He was speaking about transfer learning and causal discovery (the slides are available on the website). One funny anecdote of the event was that nobody during his talk—apart from himself, maybe—knew yet that Yoshua will be awarded the 2018 Turing award (together with Geoffrey Hinton and Yann LeCun) for his contributions to neural networks and AI.

After his presentation, Yoshua was excusing himself for not attending lunch, because he had to take an “important phone call”. That’s when the news broke. So together with Judea Pearl’s keynote on the first day, that made already two Turing award winners at the symposium.

Based on Pearl’s seminal work on graph-theoretic causal models (directed acyclic graphs), tremendous progress has been made in the field of causal AI during the last 30 years. But causal inference is obviously also super important in other fields that rely on empirical work. And they all have developed their own idiosyncratic methods for approaching causal questions. The symposium program was thus divided into several *“Causality + X”* sessions, where *X* was referring to many of the scientific disciplines in which causal inference plays a role:

- Machine learning and AI
- Computer vision
- Social sciences
- Health sciences / epidemiology

This format created a great opportunity for sharing different perspectives and stimulated learning beyond narrow disciplinary silos.

My session was about causal inference in the social sciences, together with Kosuke Imai from Harvard. I was responsible for representing the economics view.

If you’re interested in my slides, you can have a look here. Soon, Elias Bareinboim (who was organizing the event, thanks Elias!) and I, will also release a working paper, in which we’ll get into much more detail on the subject.

To quickly summarize my main message: Having spent considerable time studying the methods for causal inference developed in computer science, I came to the conclusion that economists can learn a lot from engaging with that literature. Of course, that goes the other way round. So I think we could all benefit tremendously from mutual knowledge exchange, which—I must admit—didn’t happen so far to a satisfactory extent. But I see many promising signs of improvement. More and more economists express interest in DAG methodology and what they have to offer.

One thing became clear to me when attending the symposium. The field of causal AI is developing rapidly in so many directions, and a lot of different fields are currently adopting graph-based approaches to causality. Econ should keep pace if we don’t want to lose touch with these developments. That doesn’t mean that we need to abandon our own unique perspective on causal inference, which is tailored to our specific needs. But coordinating on one basic framework for causal inference can have huge potential for cross-fertilization between disciplines. Something that we’re not nurturing nearly enough at the moment, if you ask me.

]]>This is just one example, so nothing against @HallaMartin. But his tweet got me thinking. Apparently, in the year 2019 it’s not possible anymore to convince people in an econ seminar with a propensity score matching (or any other matching on observables, for that matter). But why is that?

Here’s what I think. The typical matching setup looks somewhat like this:

You’re interested in estimating the causal effect of *X* on *Y.* But in order to do so, you will need to adjust for the confounders *W*, otherwise you’ll end up with biased results. If you’re able to measure *W*, this adjustment can be done in a propensity score matching, which is actually an efficient way of dealing with a large set of covariates.

The problem though is to be sure that you’ve adjusted for all possible confounding factors. How can you be certain that there are no unobserved variables left that affect both *X* and *Y*? Because if the picture looks like the one below (where the unobserved confounders are depicted by the dashed bidirected arc), matching will only give you biased estimates of the causal effect you’re after.

Presumably, the Twitter meme is alluding to exactly this problem. And I agree that it’s hard to make the claim that you’ve accounted for all confounding influence factors in a matching. But how’s that with economists’ most preferred alternative—the instrumental variable (IV) estimator? Here the setup looks like this:

Now, unobserved confounders between *X* and *Y* are allowed, as long as you’re able to find an instruments *Z* that affects *X*, but which is unrelated to *Y*. In that case, *Z *creates exogenous variation in *X* that can be leveraged to estimate *X*‘s causal effect. (Because of the exogonous variation in *X* induced by *Z*, we also call this IV setup a *surrogate experiment*, by the way.)

Great, so we have found a way forward if we’re not 100% sure that we’ve accounted for all unobserved confounders. Instead of a propensity score matching, we can simply resort to an IV estimator.

But if you think about this a bit more, you’ll realize that we face a very similar situation here. The whole IV strategy breaks down if there are unobserved confounders between *Z* and *Y* (see again the dashed arc below). How can we be sure to rule out all influence factors that jointly affect the instrument and the outcome? It’s the same problem all over again.

So in that sense, matching and IV are not very different. In both cases we need to carefully justify our identifying assumptions based on the domain knowledge we have. Whether ruling out is more plausible than depends on the specific context under study. But on theoretical grounds, there’s no difference in strength or quality between the two assumptions. So I don’t really get why—as a rule—economists shouldn’t trust a propensity score matching, but an IV approach is fine.

Now you might say that this is just Twitter babble. But my impression is that most economists nowadays would be indeed very suspicious towards “selection on observables”-types of identification strategies.* Even though there’s nothing inherently implausible about them.

In my view, the opaqueness of the potential outcome (PO) framework is partly to blame for this. Let me explain. In PO you’re starting point is to assume uncofoundedness of the treatment variable

.

This assumption requires that the treatment *X* needs to be independent of the potential outcomes of *Y*, when controlling for a vector of covariates *W* (as in the first picture above). But what is this magic vector *W* that can make all your causal effect estimation dreams come true? Nobody will tell you.

And if the context you’re studying is a bit more complicated than in the graphs I’ve showed you—with several causally conected variables in a model—it’ becomes very complex to even properly think this through. So in the end, deciding whether unconfoundedness holds becomes more of guessing game.

My hunch is that after having seen too many failed attempts of dealing with this sort of complexity, people have developed a general mistrust against unconfoundedness and strong exogeneity type assumptions. But we still don’t want to give up on causal inference altogether. So we move over to the next best thing: IV, RDD, Diff-in-Diff, you name it.

It’s not that these methods have weaker requirements. They all rely on untestable assumptions about unobservables. But maybe they seem more credible because you’ve jumped through more hoops with them?

I don’t know. And I don’t want to get too much into kitchen sink psychology here. I just know that the PO framework makes it incredibly hard to justify crucial identification assumptions, because it’s so much of a black box. And I think there are better alternatives out there, based on the causal graphs I used in this post (see also here). Who knows, maybe by adopting them we might one day be able to appreciate a well carried out propensity score matching again.

* Interestingly though, this only seems to be the case for reduced-form analyses. Structural folks mostly get away with controlling for observables; presumably because structural models make causal assumptions much more explicit than the potential outcome framework.

]]>Causal inference lies at the heart of policy-making, since every policy measure aims at actively manipulating certain economic variables in order to achieve a desired goal. To make an informed decision about which measures to implement, policy makers need to have knowledge about the likely impact of their actions. Newly emerging approaches in machine learning and predictive analytics are inherently inadequate to supply this kind of knowledge though, as they remain purely correlation-based and are thus not able to address causal questions.

Based on the seminal work by Judea Pearl (2000), the literature on causal inference in computer science and artificial intelligence (AI) has developed unique tools to tackle causal prediction problems, which go well beyond the standard approaches in econometrics. Areas in which this literature has made important contributions are as diverse as:

- Estimating causal effects with observational data
- Learning from surrogate experiments (“encouragement designs”)
- Dealing with selection bias
- External validity of policy experiments
- Transporting experimental results across heterogeneous populations

This paper synthesizes recent advances in the field of causal AI and gives an overview of how these techniques add to the existing econometric toolbox. We show how—in particular combined with the large data sets that are increasingly becoming available—these approaches provide entirely new avenues for policy research. Since other disciplines, such as epidemiology, sociology, and political science, were much quicker than economics in adopting these tools, our hope is that our paper will contribute to a catching up in this direction.

Pearl, J. (2000): *Causality: Models, Reasoning, and Inference*, New York, United States, NY: Cambridge University Press.

But then you go out to apply the methods to your own particular problem and you soon realize that it’s very hard to keep the model at a manageable size. Because how can you be sure that two variables aren’t related to each other? So you better keep a link between them. But suddenly everything depends on everything and all hope for getting at the desired *P(y|do)) *gets lost.

Indeed, in complete networks like this one

estimating causal effects will be nearly impossible. Besides the fact that the graph is obviously not acyclic, if all variables in the model are causally related, you would need to observe all of them at just the right frequency to get anywhere with identification.

Does that ultimately undermine the usefulness of graphical approaches? Well, I wouldn’t say so. If anything, it shows you how under-speciefied the implicit models we usually work with in the currently prevalent potential outcome (PO) framework are. Because if you really believe that “everything causes everything”, then good luck with justifying your unconfoundedness assumption or exclusion restriction. PO and DAG folks sit – as they say – in exactly the same boat here.

DAGs have one big advantage over PO though. Namely that they disclose crucial identifying assumptions very transparently. In PO, by contrast, you only have an implicit model of the specific context you’re studying in mind. This gives you no guidance whatsoever on how to justify the conditional independence assumptions involving counterfactuals that PO techniques rely on. Whether you’re allowed to call your estimates “causal” is then solely decided by the gut feeling of your seminar attendants and reviewers. As long as they have a hunch that “your treatment is still endogenous” there’s not much you can do – apart from resorting to an argument by authority, maybe.

DAGs have yet another advantage to offer to the the ambitious empiricist. Every graph that you specify gives rise to testable implications, due to the d-separation relationships between the variables in your model. That way it actually becomes possible to check whether the graph is consistent with the joint distribution of the data, which will lend further credibility to your analysis.

If we want to bring DAG methodology forward and achieve a wider diffusion in the community, we clearly need to develop best-practice standards for model building.* And it’s quite evident that models will need to be sufficiently sparse (unless you want to go back to the huge general equilibrium models of the 70s with ten thousand and more equations). In other words, we’ll need to apply a fine Occam’s razor.

I actually expect some kind of convergence with established approaches in economic theory and “structural econometrics” to occur, where the goal is usually to model a couple of key mechanisms in full detail, while leaving the less relevant things “for the error term” in order to keep models tractable.

The good thing is that the testable implications of DAGs always provide the opportunity for an ex-post sanity check. If you realize that the postulated graph doesn’t comply with the data (because some of its d-separation relations are violated), there’s always the possibility to go back to the drawing board and refine the model. Even better, d-separation will guide you exactly to the point where the graph doesn’t fit. So you’re not left in the dark about where to start improving the model, like with other diagnostic tools based on global goodness of fit.

Taking this program seriously also offers a unique opportunity to finally bring back closer together the two competing econometrics camps – PO and “structural”. Making your assumptions explicit – clearly visible for everybody to see in the graph – renders causal inference less of a black box than it currently is under PO. At the same time, you don’t need to be a “structural geek”, who solves systems of equations as a distraction before bedtime, in order to work with graphs and do good empirical work. If you ask me, DAGs offer a perfect middle ground, with just the right balance between complexity and tractability. It’s worth to have a look into them!

* There’s no such thing as “model-free causal inference” – in case you were wondering.

]]>Here, the relationship between X and Y is confounded by unobservable influence factors (denoted by the dashed bidirected arrow). Therefore we cannot estimate the causal effect of X on Y by a simple regression. But since the instrument Z induces variation in X that is unrelated to the unobserved confounders, we can use Z as an auxiliary experiment that allows us to identify the so-called *local average treatment effect (*or *LATE)* of X on Y.¹

For this to work it’s crucial that Z doesn’t directly affect Y (i.e., no arrow from Z to Y). Moreover, there shouldn’t be any unobservable confounders (i.e., other dashed bidirected arcs) between Z and Y, otherwise the identification argument breaks down. These two assumptions need to be justified purely based on theoretical reasonings and cannot be tested with the help of data.

Unfortunately, however, you will frequently come across people who don’t accept that the assumption of instrument validity isn’t testable. Usually, these folks then ask you to do one of the following two things in order to convince them:

- Show that Z is uncorrelated with Y (conditional on the other control variables in your study), or;
- Show that Z is uncorrelated with Y when adjusting for X (again, conditional on the other controls).

Both of these requests are wrong. The first one is particularly moronic. In order to not run into a weak instruments problem we want that Z exerts a strong influence on X. If X also affects Y, there will be a correlation between Z and Y by construction, through the causal chain Z X Y.

The second request is likewise mistaken, because adjusting for X doesn’t d-separate Z and Y. On the contrary, as X is a collider on Z X Y, conditioning on X opens up the path and thus creates a correlation between Z and Y.²

So both “tests” won’t tell you anything about whether the causal structure in the graph above is correct. Z and Y can be significantly correlated (also condional on X) even though the instrument is perfectly valid. These tests have no discriminating power whatsoever. Instead, all you can do is argue on theoretical grounds that the IV assumptions are fulfilled.

In general, there is no such thing as purely data-driven causal inference. At one point, you will always have to rely on untestable assumptions that need to be substantiated by expert knowledge about the empirical setting at hand. Causal graphs are of great help here though, because they make these assumptions super transparent and tractable. I see way too many people — all across the ranks — who are confused about the untestability of IV assumptions. If we would teach them causal graph methodology more thoroughly, I’m sure this would be less of a problem.

¹ Identification of the LATE additionally requires that the effect of Z on X is monotone. If you want to know more about these and other details of IV estimation, you can have a look at my lecture notes on causal inference here.

² I explain the terms *d-separation *and *colliders* both here and here (latter source is more technical)

Sample selection is a very different problem though. A classic example in economics is by Jim Heckman who studied the relationship between hours worked (X) and wages (Y) in a sample of women in 1967. The empirical challenge he addressed—which later got him the Nobel prize—is that we can only observe wages for those women that actually choose to work.¹ A corresponding graph would look like the following.Recognize the selection node S with the double border. It represents a special variable that takes on just two values indicating whether we observe Y for a particular unit in our sample. In Heckman’s example, S would be equal to one for women that choose to work, and zero for unemployed individuals.

In the graph, the decision to work is affected by the same kind of socio-economic factors (Z) that also determine wages (Y). A woman who chooses not to work does so because the wage she would be able to earn on the market wouldn’t compensate her for the opportunity costs of staying outside of the labor force. According to economic theory her reservation wage is then higher than her market wage. It’s very likely that this reservation wage is driven by unobservables U though (e.g., personal wealth, preferences for leisure, family situation, etc.), which might also affect salaries. This is denoted by the dashed bidirected arc in the graph. Therefore the sample of women who work (S=1) is a *selected sample* that isn’t representative for the entire population.²

The important difference here is that in the presence of sample selection we’re not able to observe some variables for part of the population. In the endogeneity case, on the other hand, we do have full records, it’s just that a simple regression of X on Y (even when controlling for Z) won’t suffice. My hunch is that the confusion between the two concepts partly stems from imprecise language. Endogeneity is often a result of *self-selection*, meaning that individuals *choose* X based on unobservable factors U (see the first graph). U then confounds the relationship between X and Y and creates bias. Nevertheless, this self-selection isn’t equivalent to sample selection, because we can still observe Y for all individuals.

Recently, there was an entire paper in Strategic Management Journal devoted to Heckman selection models. But, if you ask me, it doesn’t help much to resolve the confusion either. The example research question the authors discuss is whether mergers and acquisitions (X) affect stock market reactions (Y). They claim that sample selection models are appropriate in this situation *“since stock market reactions are only available for firms that actually complete acquisitions”*. But is that correct? We also observe stock prices for firms that don’t acquire another company. Of course, we should be aware of the endogeneity of M&A decisions. But that’s more appropriately dealt with by an IV approach then, instead of a Heckman selection model.

One last word on this. It’s actually true that parametric models dealing with either sample selection or endogeneity look quite similar in the end. That’s because they propose similar solutions (i.e., control functions) for the two problems. But, for the sake of clarity, we should better keep these concepts apart. So here we have another example where, in my opinion, causal graphs make things very clear, whereas parametric models rather obscure matters.

¹ Think of wages being defined as the salary you would make if you were forced to work. So wages aren’t necessarily zero for unemployed women.

² Heckman solved this problem by making very specific assumptions about the error structure in his model (joint normality) and including a function of the selection propensity score in the final estimation equation that controls for the preferential selection of units into the sample (Angrist 1997).

]]>There is a large literature documenting that firms which are predominantly owned by single families often invest less in innovation and R&D. Reasons for this are that family owners often want to keep tight control over the company, which leads to conservatism in their decision-making and a reluctance to involve outside investors. This makes it harder for them to finance risky and expensive R&D projects. Obviously, in the long-run, this will have an effect on performance, if family firms invest too little in new product development and optimizing their production processes.

Now imagine I want to analyze the relationship between family ownership and firm performance, but I’m not really interested in R&D expenditures. Maybe I want to investigate another phenomenon related to family owners and the innovation aspect is already too well researched to care for it anymore. At the same time though I know that R&D expenditures are an important factor and I should better account for it. So if the graph above is indeed the correct model, I could include R&D ependitures in my regression to hold them constant across firms. Any effect of innovation spending on performance would then be eliminated and I could focus on whatever else I’m interested in. This works even though R&D expenditures are an endogenous, post-treatment variable in the model (you can see this by the arrow that points into it).

So far no problem. But the situation changes dramatically if we add an unobserved confounder.

Now R&D expenditures become what we call a *collider *variable, because two arrows, one emitted from family ownership and the other from *U*, meet in it. Colliders are a tricky business because they open up biasing paths if we control for them (here: *family ownership* *R&D expenditures* *U* *firm performance*). They are almost like landmines in empirical research. Left alone they do no harm, but once you condition on them you end up with estimation bias.

Unfortunately, it’s quite likely that there are unobserved variables that affect both R&D expenditures and firm performance, such as general management quality or a particularly well-trained workforce. Therefore, collider bias is a huge risk when controlling for post-treatment variables.

Sometimes people operate under the vague notion that if they include post-treatment variables in a regression they could get at a causal effect “net of this variable”. Or put differently, they think they could say something about the different causal mechanisms that are at play. This view is tricky, if not mistaken! Because if there are unobserved confounders between the outcome and the intermediate variable, like in the example above, it’s impossible to keep the two causal mechanisms apart. That’s exactly why mediation analysis is so hard. Kosuke Imai and coauthors have developed a sensitivity analysis, which allows you to check whether your mediation analysis remains robust to a mediator-outcome-correlation. If you’re interested in mediation analysis, you can also check out this previous post of mine.

In any case, stay safe and watch out for collider bias!

]]>

“For decades, causal inference methods have found wide applicability in the social and biomedical sciences. As computing systems start intervening in our work and daily lives, questions of cause-and-effect are gaining importance in computer science as well. To enable widespread use of causal inference, we are pleased to announce a new software library, DoWhy. Its name is inspired by Judea Pearl’s do-calculus for causal inference. In addition to providing a programmatic interface for popular causal inference methods, DoWhy is designed to highlight the critical but often neglected assumptions underlying causal inference analyses.”

Source: https://www.microsoft.com/en-us/research/blog/dowhy-a-library-for-causal-inference/

Thanks to Matt Ranger (@vhranger) for the pointer!

At the moment the library’s functionality is still limited (mostly to backdoor adjustment and covariate stratification). But the team seems to be committed to extending its scope in the near future.

“In the future, we look forward to adding more features to the library, including support for more estimation and sensitivity methods and interoperability with available estimation software. We welcome your feedback and contributions as we develop the library. You can check out the DoWhy Python library on Github. We include a couple of examples to get you started through Jupyter notebooks here.”

Great to see that causal inference—once a purely academic endeavor–finds more and more applications in business and that leading tech firms invest in these capabilities. Slowly, it’s becoming one of the hottest topics in data science right now. So don’t miss out!

]]>As you can see, the idea of thinking about interventions in quasi-deterministic systems is strongly rooted in econometrics (so no need for the not-invented-here-syndrome). In a sense, this story is also typical for the modern literature on machine learning, where smart computer scientists discover (some say “reinvent”, but that’s too disparaging for my taste) approaches from statistics and econometrics and take them to the next level. Because of his prior, Turing-award-worthy work on Bayesian networks, Pearl was able adapt the idea of structural causal models and equip it with a powerful symbolic language that allows us to solve problems far beyond what has been possible with traditional econometric techniques. Clearly, we have much to learn from each other and can only benefit from the convergence of interest from both disciplines.

If you want to know more about the history of graphical causal models and some of the amusing anecdotes around their origin (involving for example guinea pigs, but I shouldn’t spoiler), I can highly recommend you Pearl’s newest book *“The Book of Why”*, written together with Dana Mackenzie. It’s both an easily accessible introduction to the topic as well as an entertaining account of the last 25 years of Pearl’s research. On top of that you get some more funny rants about Karl Pearson—“causality’s worst adversary”. Definitely worth a read! :)

PhD students in economics are usually well-trained in the potential outcome framework. Therefore, I mostly frame directed acyclic graphs (DAG) as a useful complement to the standard treatment effects estimators, in order to conceal my true revolutionary motives. ;) One concern with DAGs I sometimes encounter though is that they require so many strong assumptions about the presence (and absence) of causal relationships between variables in your model. By contrast, so the argument goes, for treatment effect estimators, such as nearest-neighbor matching, you only have to justify the exogeneity of your treatment and that’s it. No need to specify a full causal model.

This argument is misguided. You always need a “full” causal model in order to do proper causal inference. But let me specify in more detail what I mean by this. In matching (or inverse probability weighting, or regression, or any other method that relies on unconfoundedness) you encounter a situation like the following.

You would like to estimate the effect of a treatment *T *(e.g., an R&D subsidy) on an outcome variable *Y *(e.g., firm growth). The problem is that there are other variables, *X*, out there that create a correlation between the treatment and outcome. You first need to control for these confounding factors in order to get at the true causal effect of *T* on *Y*.

In the potential outcome framework this means that you need to justify the *unconfoundedness *assumption*.
*

If the treatment is independent of potential outcomes conditional on *X—*and you’re able to measure all these influence factors *X—*then you’re fine. The crux though is, what is *X*? Which variables do you need to control for? And what other influence factors can you safely keep uncontrolled for? To make these claims you need to have a causal model—at least in your mind. And here the circle closes.

Every time you estimate something that entails the unconfoundedness assumption, you imply that your data is generated by a causal process such as the one depicted above. So treatment effects estimators don’t require fewer assumptions than graphical approaches, they just apply for one very specific causal model. If that model fits reality, great! Then you can go out and apply treatment effects methods “off the shelf”. But if it doesn’t you need to think harder about an appropriate model. And DAGs offer you a tremendously useful tool set to handle these types of situations.

Here’s the causal model that applies for the second most prominent estimator from the treatment effects literature—the non-parametric IV estimator.

In this situation there is no possibility of ever controlling for all confounding influences, because some of them remain unobserved (denoted by the dashed bidirected arc between *T *and *Y)*. As a result, unconfoundedness will be violated*. *But instead you can do something else. You can use variation in a third variable *Z* to get at the causal effect of *T *on *Y.¹ *In order for that to work, you have to satisfy a very similar condition to unconfoundedness for the instrument though.

Your instrument has to be *excludable*, or independent of potential outcomes given a vector of control variables *X*. You see that you’re basically left with the same problem. How do you decide what is in *X*, and what isn’t?

In sum, treatment effects estimators such as matching and IV simply give you a template of a causal model at hand. If this template describes reality accurately you can easily find causal effect estimates with the help of standard techniques. Graphical models capture the same standard cases, but on top of that provide you with a much more versatile toolbox for causal inference. The impression that matching and IV require fewer assumptions is a misconception. I admit it’s probably still easier to convince reviewers with the standard methods, simply because we’re so used to them. But that’s just a sign for an imperfection of the scientific process and says nothing about any substantive differences between both approaches. Causal inference requires strong assumptions, one way or the other. There is no such thing as a free lunch in econometrics either.

¹ In addition, you will need to assume monotonicity, i.e., a monotone influence of *Z* on *T* for all members in the population. And even then, you can only identify an effect for a subgroup of compliers, that changes their treatment status due to the instrument (for binary *Z). *These details are of secondary importance for the argument here. If you’re interested you can check the seminal paper by Imbens and Angrist (1994) on nonparaemtric IV.