Please help us reaching more people by sharing the above link with friends and colleagues, or by retweeting this tweet:

Thank you for your help!

]]>I’ve written about this problem before, where I argue that DAGs actually don’t have to be that complex, if we look at, for example, the models we work with in structural econoemtrics or economic theory. But Jason Abaluck, professor at the Yale School of Management, brought up an interesting example that might be useful for illustrating what I have in mind.

Here is my reply:

It’s good point that mapping out what we know in a DAG – especially for unchartered territory – can be complex. Related to the specific example of the college wage premium, I would advise a grad student who studies this question to first do a thorough literature review. That’s the basis for synthesizing what we’ve learned in 50 years or so about the topic. The DAG then serves as a perfect tool for organizing this body of knowledge. Now, for some arrows the decision to include or omit them might be ambiguous. But these are exactly the cases where there is a need for future research. A great opportunity for a fresh grad student.

This process is of course quite tedious, but there isn’t really an alternative to it. When we justify the exogeneity of our instruments, we also need to know all possible confounders that might play a role. The same goes for arguing that there is no self-selection around the discontinuity threshold or that common trends hold. We can only justify these assumptions by synthesizing the prior knowledge we have about the subject under study.

The fact that some people think this would be different with potential outcome methods, is because we’ve accepted loose standards for arguing verbally about ignorability, exogeneity and causal mechanism in our papers and seminars. This process is highly non-transparent and prone to arguments by authority.

Going through the entire body of knowledge about a specific problem and casting it into a DAG is cumbersome, I realize. Once we will start to make our assumptions more explicit though, others will be able to build on our work. They can then test the proposed model against the available data or look for experimental evidence for ambigious causal relationships. This process of knowledge curation is not something one paper can achieve alone, it has to be a truly collaborative exercise. I don’t see how we can have real progress in a field without it.

]]>I have a couple of comments on specific points in the paper though, which I wrote down in several Twitter threads throughout the last weeks. I chose Twitter, because we had many, quite extensive, discussions about DAGs there in recent months (Guido even cites some of our tweets in his paper) and because many economists seem to be active on this platform these days. Tansferring all these threads into blog posts would – frankly – require too much time. But for archiving purposes, I will link to the start tweets of the individual threads here.

(I have also saved the full threads as text files in my documents. So if you don’t like going through them on Twitter, or maybe they will be deleted one day, you can always shoot me a meassge and I will send them to you.)

**Round 1:**

**Round 2:**

**Round 3:**

**Round 4:**

**Round 5:**

**Round 6:**

**Round 7:**

**Round 8:**

**Round 9:**

The topic of causal inference seems to be booming at the moment—and for good reasons.

Causal knowledge is crucial for decision-making. Take the example of an advertiser who wants to know how effective her company’s social media marketing campaign on Instagram is. Unfortunately, our current workhorse tools in machine learning are not capable of answering such a question.

A decision tree classifier might give you a very precise estimate that ads which use blue colors and sans-serif fonts are associated with 12% higher click-through rates. But does that mean that every advertising campaign should switch to that combination in order to boost user engagement? Not necessarily. It might just reflect the fact that a majority of Fortune-500 firms—the ones with great products—happen to use blue and sans-serif in their corporate designs.

This is what Judea Pearl—father of causality in artificial intelligence—calls the difference between “seeing” and “doing”. Standard machine learning tools are designed for seeing, observing, discerning patterns. And they’re pretty good at it! But management decisions very often involve “doing”, as long the goal is to manipulate a variable *X* (e.g., ad design, team diversity, R&D spending, etc.) in order to achieve an effect on another variable Y (click-through rate, creativity, profits, etc.).

In my group we recently won a grant for a research project in which we want to learn more about how this crucial difference affects business practices. In particular, we want to know what kind of questions companies are trying to answer with their data science efforts, and whether these questions require causal knowledge. We also want to understand better whether firms are using appropriate tools for their respective business applications, or whether there’s a need for major retooling in the data science community. After all, there might be important questions that currently remain unanswered, because companies lack the causal inference skills to address them. That’s certainly another issue we would like to explore.

So, if you working in the field of data science and machine learning, and you’re interested in causality, please come talk to us! We would love to hear about your experiences. Slowly but surely, causal inference seems to develop into one of the hottest trends in the tech sector right now, and our goal is to shed more light on this phenomenon with our research.

]]>[…] since the researcher’s goal is to estimate the causal effect of D on Y , usually Z is required only to, along with X, block the back-door paths from D to Y (Pearl 2009), or equivalently, make the treatment assignment conditionally ignorable. In this case, could reflect not only its causal effect on Y , if any, but also other spurious associations not eliminated by standard assumptions.

It’s commonplace in regression analyses to not only interpret the effect of the regressor of interest, *D,* on an outcome variable, *Y*, but also to discuss the coefficients of the control variables. Researchers then often use lines such as: *“effects of the controls have expected signs”*, etc. And it probably happened more than once that authors ran into troubles during peer-review because some regression coefficients where not in line with what reviewers expected.

Cinelli and Hazlett remind us that this is shortsighted, at best, because coefficients of control variables do not necessarily have a *structural* interpretation. Take the following simple example: If we’re interested in estimating the causal effect of *X* on *Y*, *P(Y|do(X))*, it’s entirely sufficient to adjust¹ for *W1* in this graph. That’s because *W1* closes all backdoor paths between *X* and *Y*, and thus the causal effect can be identified as:

However, if we estimate the right-hand side, for example, by linear regression, the coefficient of *W1* will not represent its effect on *Y. *It partly picks up the effect of *W2* too, since *W1* and *W2* are correlated.

If we would also include *W2* in the regression, then the coefficients of the control variables could be interpreted structurally and would represent genuine causal effects. But in practice it’s very unlikely that we’ll be able to measure all causal parents of *Y*. The data collection efforts could just be too huge in a real-world situation.

Luckily, that’s not necessary, however. We only need to make sure that the treatment variable *X *is *unconfounded* or *conditionally ignorable*. And a smaller set of control variables could do the job just fine. But that also implies that the coefficients of controls lose their substantive meaning, because they now represent a complicated weighting of several causal influence factors. Therefore, it doesn’t make much sense to try to put them into context. And if they don’t have expected signs, that’s not a problem.

¹ The term *control variable* is actually a bit of an outdated terminology, because *W1* isn’t *controlled* in the sense of an intervention. It’s rather *adjusted for *or *conditioned on* in terms of taking conditional probabilities. But since the term is so ubiquitous, I’ll use it here too.

Causal inference has always been somewhat of a niche topic in AI. All of the cutting-edge machine learning tools—you know, the ones you’ve heard about, like neural nets, random forests, support vector machines, and so on—remain purely correlational, and can therefore not discern whether the rooster’s crow causes the sunrise, or the other way round. This seems to be changing though and more and more big shots start to recognize the limits of prediction methods and acknowledge a need for major retooling in the community.

Yoshua Bengio from the University of Montreal, who’s one of the pioneers in deep learning, was attending the symposium too. He was speaking about transfer learning and causal discovery (the slides are available on the website). One funny anecdote of the event was that nobody during his talk—apart from himself, maybe—knew yet that Yoshua will be awarded the 2018 Turing award (together with Geoffrey Hinton and Yann LeCun) for his contributions to neural networks and AI.

After his presentation, Yoshua was excusing himself for not attending lunch, because he had to take an “important phone call”. That’s when the news broke. So together with Judea Pearl’s keynote on the first day, that made already two Turing award winners at the symposium.

Based on Pearl’s seminal work on graph-theoretic causal models (directed acyclic graphs), tremendous progress has been made in the field of causal AI during the last 30 years. But causal inference is obviously also super important in other fields that rely on empirical work. And they all have developed their own idiosyncratic methods for approaching causal questions. The symposium program was thus divided into several *“Causality + X”* sessions, where *X* was referring to many of the scientific disciplines in which causal inference plays a role:

- Machine learning and AI
- Computer vision
- Social sciences
- Health sciences / epidemiology

This format created a great opportunity for sharing different perspectives and stimulated learning beyond narrow disciplinary silos.

My session was about causal inference in the social sciences, together with Kosuke Imai from Harvard. I was responsible for representing the economics view.

If you’re interested in my slides, you can have a look here. Soon, Elias Bareinboim (who was organizing the event, thanks Elias!) and I, will also release a working paper, in which we’ll get into much more detail on the subject.

To quickly summarize my main message: Having spent considerable time studying the methods for causal inference developed in computer science, I came to the conclusion that economists can learn a lot from engaging with that literature. Of course, that goes the other way round. So I think we could all benefit tremendously from mutual knowledge exchange, which—I must admit—didn’t happen so far to a satisfactory extent. But I see many promising signs of improvement. More and more economists express interest in DAG methodology and what they have to offer.

One thing became clear to me when attending the symposium. The field of causal AI is developing rapidly in so many directions, and a lot of different fields are currently adopting graph-based approaches to causality. Econ should keep pace if we don’t want to lose touch with these developments. That doesn’t mean that we need to abandon our own unique perspective on causal inference, which is tailored to our specific needs. But coordinating on one basic framework for causal inference can have huge potential for cross-fertilization between disciplines. Something that we’re not nurturing nearly enough at the moment, if you ask me.

]]>This is just one example, so nothing against @HallaMartin. But his tweet got me thinking. Apparently, in the year 2019 it’s not possible anymore to convince people in an econ seminar with a propensity score matching (or any other matching on observables, for that matter). But why is that?

Here’s what I think. The typical matching setup looks somewhat like this:

You’re interested in estimating the causal effect of *X* on *Y.* But in order to do so, you will need to adjust for the confounders *W*, otherwise you’ll end up with biased results. If you’re able to measure *W*, this adjustment can be done in a propensity score matching, which is actually an efficient way of dealing with a large set of covariates.

The problem though is to be sure that you’ve adjusted for all possible confounding factors. How can you be certain that there are no unobserved variables left that affect both *X* and *Y*? Because if the picture looks like the one below (where the unobserved confounders are depicted by the dashed bidirected arc), matching will only give you biased estimates of the causal effect you’re after.

Presumably, the Twitter meme is alluding to exactly this problem. And I agree that it’s hard to make the claim that you’ve accounted for all confounding influence factors in a matching. But how’s that with economists’ most preferred alternative—the instrumental variable (IV) estimator? Here the setup looks like this:

Now, unobserved confounders between *X* and *Y* are allowed, as long as you’re able to find an instruments *Z* that affects *X*, but which is unrelated to *Y*. In that case, *Z *creates exogenous variation in *X* that can be leveraged to estimate *X*‘s causal effect. (Because of the exogonous variation in *X* induced by *Z*, we also call this IV setup a *surrogate experiment*, by the way.)

Great, so we have found a way forward if we’re not 100% sure that we’ve accounted for all unobserved confounders. Instead of a propensity score matching, we can simply resort to an IV estimator.

But if you think about this a bit more, you’ll realize that we face a very similar situation here. The whole IV strategy breaks down if there are unobserved confounders between *Z* and *Y* (see again the dashed arc below). How can we be sure to rule out all influence factors that jointly affect the instrument and the outcome? It’s the same problem all over again.

So in that sense, matching and IV are not very different. In both cases we need to carefully justify our identifying assumptions based on the domain knowledge we have. Whether ruling out is more plausible than depends on the specific context under study. But on theoretical grounds, there’s no difference in strength or quality between the two assumptions. So I don’t really get why—as a rule—economists shouldn’t trust a propensity score matching, but an IV approach is fine.

Now you might say that this is just Twitter babble. But my impression is that most economists nowadays would be indeed very suspicious towards “selection on observables”-types of identification strategies.* Even though there’s nothing inherently implausible about them.

In my view, the opaqueness of the potential outcome (PO) framework is partly to blame for this. Let me explain. In PO you’re starting point is to assume uncofoundedness of the treatment variable

.

This assumption requires that the treatment *X* needs to be independent of the potential outcomes of *Y*, when controlling for a vector of covariates *W* (as in the first picture above). But what is this magic vector *W* that can make all your causal effect estimation dreams come true? Nobody will tell you.

And if the context you’re studying is a bit more complicated than in the graphs I’ve showed you—with several causally conected variables in a model—it’ becomes very complex to even properly think this through. So in the end, deciding whether unconfoundedness holds becomes more of guessing game.

My hunch is that after having seen too many failed attempts of dealing with this sort of complexity, people have developed a general mistrust against unconfoundedness and strong exogeneity type assumptions. But we still don’t want to give up on causal inference altogether. So we move over to the next best thing: IV, RDD, Diff-in-Diff, you name it.

It’s not that these methods have weaker requirements. They all rely on untestable assumptions about unobservables. But maybe they seem more credible because you’ve jumped through more hoops with them?

I don’t know. And I don’t want to get too much into kitchen sink psychology here. I just know that the PO framework makes it incredibly hard to justify crucial identification assumptions, because it’s so much of a black box. And I think there are better alternatives out there, based on the causal graphs I used in this post (see also here). Who knows, maybe by adopting them we might one day be able to appreciate a well carried out propensity score matching again.

* Interestingly though, this only seems to be the case for reduced-form analyses. Structural folks mostly get away with controlling for observables; presumably because structural models make causal assumptions much more explicit than the potential outcome framework.

]]>Causal inference lies at the heart of policy-making, since every policy measure aims at actively manipulating certain economic variables in order to achieve a desired goal. To make an informed decision about which measures to implement, policy makers need to have knowledge about the likely impact of their actions. Newly emerging approaches in machine learning and predictive analytics are inherently inadequate to supply this kind of knowledge though, as they remain purely correlation-based and are thus not able to address causal questions.

Based on the seminal work by Judea Pearl (2000), the literature on causal inference in computer science and artificial intelligence (AI) has developed unique tools to tackle causal prediction problems, which go well beyond the standard approaches in econometrics. Areas in which this literature has made important contributions are as diverse as:

- Estimating causal effects with observational data
- Learning from surrogate experiments (“encouragement designs”)
- Dealing with selection bias
- External validity of policy experiments
- Transporting experimental results across heterogeneous populations

This paper synthesizes recent advances in the field of causal AI and gives an overview of how these techniques add to the existing econometric toolbox. We show how—in particular combined with the large data sets that are increasingly becoming available—these approaches provide entirely new avenues for policy research. Since other disciplines, such as epidemiology, sociology, and political science, were much quicker than economics in adopting these tools, our hope is that our paper will contribute to a catching up in this direction.

Pearl, J. (2000): *Causality: Models, Reasoning, and Inference*, New York, United States, NY: Cambridge University Press.

But then you go out to apply the methods to your own particular problem and you soon realize that it’s very hard to keep the model at a manageable size. Because how can you be sure that two variables aren’t related to each other? So you better keep a link between them. But suddenly everything depends on everything and all hope for getting at the desired *P(y|do)) *gets lost.

Indeed, in complete networks like this one

estimating causal effects will be nearly impossible. Besides the fact that the graph is obviously not acyclic, if all variables in the model are causally related, you would need to observe all of them at just the right frequency to get anywhere with identification.

Does that ultimately undermine the usefulness of graphical approaches? Well, I wouldn’t say so. If anything, it shows you how under-speciefied the implicit models we usually work with in the currently prevalent potential outcome (PO) framework are. Because if you really believe that “everything causes everything”, then good luck with justifying your unconfoundedness assumption or exclusion restriction. PO and DAG folks sit – as they say – in exactly the same boat here.

DAGs have one big advantage over PO though. Namely that they disclose crucial identifying assumptions very transparently. In PO, by contrast, you only have an implicit model of the specific context you’re studying in mind. This gives you no guidance whatsoever on how to justify the conditional independence assumptions involving counterfactuals that PO techniques rely on. Whether you’re allowed to call your estimates “causal” is then solely decided by the gut feeling of your seminar attendants and reviewers. As long as they have a hunch that “your treatment is still endogenous” there’s not much you can do – apart from resorting to an argument by authority, maybe.

DAGs have yet another advantage to offer to the the ambitious empiricist. Every graph that you specify gives rise to testable implications, due to the d-separation relationships between the variables in your model. That way it actually becomes possible to check whether the graph is consistent with the joint distribution of the data, which will lend further credibility to your analysis.

If we want to bring DAG methodology forward and achieve a wider diffusion in the community, we clearly need to develop best-practice standards for model building.* And it’s quite evident that models will need to be sufficiently sparse (unless you want to go back to the huge general equilibrium models of the 70s with ten thousand and more equations). In other words, we’ll need to apply a fine Occam’s razor.

I actually expect some kind of convergence with established approaches in economic theory and “structural econometrics” to occur, where the goal is usually to model a couple of key mechanisms in full detail, while leaving the less relevant things “for the error term” in order to keep models tractable.

The good thing is that the testable implications of DAGs always provide the opportunity for an ex-post sanity check. If you realize that the postulated graph doesn’t comply with the data (because some of its d-separation relations are violated), there’s always the possibility to go back to the drawing board and refine the model. Even better, d-separation will guide you exactly to the point where the graph doesn’t fit. So you’re not left in the dark about where to start improving the model, like with other diagnostic tools based on global goodness of fit.

Taking this program seriously also offers a unique opportunity to finally bring back closer together the two competing econometrics camps – PO and “structural”. Making your assumptions explicit – clearly visible for everybody to see in the graph – renders causal inference less of a black box than it currently is under PO. At the same time, you don’t need to be a “structural geek”, who solves systems of equations as a distraction before bedtime, in order to work with graphs and do good empirical work. If you ask me, DAGs offer a perfect middle ground, with just the right balance between complexity and tractability. It’s worth to have a look into them!

* There’s no such thing as “model-free causal inference” – in case you were wondering.

]]>Here, the relationship between X and Y is confounded by unobservable influence factors (denoted by the dashed bidirected arrow). Therefore we cannot estimate the causal effect of X on Y by a simple regression. But since the instrument Z induces variation in X that is unrelated to the unobserved confounders, we can use Z as an auxiliary experiment that allows us to identify the so-called *local average treatment effect (*or *LATE)* of X on Y.¹

For this to work it’s crucial that Z doesn’t directly affect Y (i.e., no arrow from Z to Y). Moreover, there shouldn’t be any unobservable confounders (i.e., other dashed bidirected arcs) between Z and Y, otherwise the identification argument breaks down. These two assumptions need to be justified purely based on theoretical reasonings and cannot be tested with the help of data.

Unfortunately, however, you will frequently come across people who don’t accept that the assumption of instrument validity isn’t testable. Usually, these folks then ask you to do one of the following two things in order to convince them:

- Show that Z is uncorrelated with Y (conditional on the other control variables in your study), or;
- Show that Z is uncorrelated with Y when adjusting for X (again, conditional on the other controls).

Both of these requests are wrong. The first one is particularly moronic. In order to not run into a weak instruments problem we want that Z exerts a strong influence on X. If X also affects Y, there will be a correlation between Z and Y by construction, through the causal chain Z X Y.

The second request is likewise mistaken, because adjusting for X doesn’t d-separate Z and Y. On the contrary, as X is a collider on Z X Y, conditioning on X opens up the path and thus creates a correlation between Z and Y.²

So both “tests” won’t tell you anything about whether the causal structure in the graph above is correct. Z and Y can be significantly correlated (also condional on X) even though the instrument is perfectly valid. These tests have no discriminating power whatsoever. Instead, all you can do is argue on theoretical grounds that the IV assumptions are fulfilled.

In general, there is no such thing as purely data-driven causal inference. At one point, you will always have to rely on untestable assumptions that need to be substantiated by expert knowledge about the empirical setting at hand. Causal graphs are of great help here though, because they make these assumptions super transparent and tractable. I see way too many people — all across the ranks — who are confused about the untestability of IV assumptions. If we would teach them causal graph methodology more thoroughly, I’m sure this would be less of a problem.

¹ Identification of the LATE additionally requires that the effect of Z on X is monotone. If you want to know more about these and other details of IV estimation, you can have a look at my lecture notes on causal inference here.

² I explain the terms *d-separation *and *colliders* both here and here (latter source is more technical)