Mapping Unchartered Territory

A frequent point of criticism against Directed Acyclic Graphs is that writing them down for a real-world problem can be a difficult task. There are numerous possible variables to consider and it’s not clear how we can determine all the causal relationships between them. We recently had a Twitter discussion where exactly this argument popped up again.

I’ve written about this problem before, where I argue that DAGs actually don’t have to be that complex, if we look at, for example, the models we work with in structural econoemtrics or economic theory. But Jason Abaluck, professor at the Yale School of Management, brought up an interesting example that might be useful for illustrating what I have in mind.

Here is my reply:

It’s good point that mapping out what we know in a DAG – especially for unchartered territory – can be complex. Related to the specific example of the college wage premium, I would advise a grad student who studies this question to first do a thorough literature review. That’s the basis for synthesizing what we’ve learned in 50 years or so about the topic. The DAG then serves as a perfect tool for organizing this body of knowledge. Now, for some arrows the decision to include or omit them might be ambiguous. But these are exactly the cases where there is a need for future research. A great opportunity for a fresh grad student.

This process is of course quite tedious, but there isn’t really an alternative to it. When we justify the exogeneity of our instruments, we also need to know all possible confounders that might play a role. The same goes for arguing that there is no self-selection around the discontinuity threshold or that common trends hold. We can only justify these assumptions by synthesizing the prior knowledge we have about the subject under study.

The fact that some people think this would be different with potential outcome methods, is because we’ve accepted loose standards for arguing verbally about ignorability, exogeneity and causal mechanism in our papers and seminars. This process is highly non-transparent and prone to arguments by authority.

Going through the entire body of knowledge about a specific problem and casting it into a DAG is cumbersome, I realize. Once we will start to make our assumptions more explicit though, others will be able to build on our work. They can then test the proposed model against the available data or look for experimental evidence for ambigious causal relationships. This process of knowledge curation is not something one paper can achieve alone, it has to be a truly collaborative exercise. I don’t see how we can have real progress in a field without it.

PO vs. DAGs – Comments on Guido Imbens’ New Paper

Guido Imbens published a new working paper in which he develops a detailed comparison of the potential outcomes framework (PO) and directed acyclic graphs (DAG) for causal inference in econometrics. I really appreciate this paper, because it introduces a broader audience in economics to DAGs and highlights the complementarity of both approaches for applied econometric work. Continue reading PO vs. DAGs – Comments on Guido Imbens’ New Paper

Causal Data Science in Business

A while back I was posting about Facebook’s causal inference group and how causal data science tools slowly find their way from academia into business. Since then I came across many more examples of well-known companies investing in their causal inference (CI) capabilities: Microsoft released its DoWhy library for Python, providing CI tools based on Directed Acylic Graphs (DAGs); I recently met people from IBM Research interested in the topic; Zalando is constantly looking for people to join their CI/ML team; and Lufthansa, Uber, and Lyft have research units working on causal AI applications too.

The topic of causal inference seems to be booming at the moment—and for good reasons.

Causal knowledge is crucial for decision-making. Take the example of an advertiser who wants to know how effective her company’s social media marketing campaign on Instagram is. Unfortunately, our current workhorse tools in machine learning are not capable of answering such a question.

A decision tree classifier might give you a very precise estimate that ads which use blue colors and sans-serif fonts are associated with 12% higher click-through rates. But does that mean that every advertising campaign should switch to that combination in order to boost user engagement? Not necessarily. It might just reflect the fact that a majority of Fortune-500 firms—the ones with great products—happen to use blue and sans-serif in their corporate designs.

This is what Judea Pearl—father of causality in artificial intelligence—calls the difference between “seeing” and “doing”. Standard machine learning tools are designed for seeing, observing, discerning patterns. And they’re pretty good at it! But management decisions very often involve “doing”, as long the goal is to manipulate a variable X (e.g., ad design, team diversity, R&D spending, etc.) in order to achieve an effect on another variable Y (click-through rate, creativity, profits, etc.).

In my group we recently won a grant for a research project in which we want to learn more about how this crucial difference affects business practices. In particular, we want to know what kind of questions companies are trying to answer with their data science efforts, and whether these questions require causal knowledge. We also want to understand better whether firms are using appropriate tools for their respective business applications, or whether there’s a need for major retooling in the data science community. After all, there might be important questions that currently remain unanswered, because companies lack the causal inference skills to address them. That’s certainly another issue we would like to explore.

So, if you working in the field of data science and machine learning, and you’re interested in causality, please come talk to us! We would love to hear about your experiences. Slowly but surely, causal inference seems to develop into one of the hottest trends in the tech sector right now, and our goal is to shed more light on this phenomenon with our research.

Don’t Put Too Much Meaning Into Control Variables

I’m currently reading this great paper by Carlos Cinelli and Chad Hazlett: “Making Sense of Sensitivity: Extending Omitted Variable Bias”. They develop a full suite of sensitivity analysis tools for the omitted variable problem in linear regression, which everyone interested in causal inference should have a look at. While kind of a side topic, they make an important point on page 6 (footnote 6):

[…] since the researcher’s goal is to estimate the causal effect of D on Y , usually Z is required only to, along with X, block the back-door paths from D to Y (Pearl 2009), or equivalently, make the treatment assignment conditionally ignorable. In this case, \hat{\gamma} could reflect not only its causal effect on Y , if any, but also other spurious associations not eliminated by standard assumptions.

It’s commonplace in regression analyses to not only interpret the effect of the regressor of interest, D, on an outcome variable, Y, but also to discuss the coefficients of the control variables. Researchers then often use lines such as: “effects of the controls have expected signs”, etc. And it probably happened more than once that authors ran into troubles during peer-review because some regression coefficients where not in line with what reviewers expected.

Cinelli and Hazlett remind us that this is shortsighted, at best, because coefficients of control variables do not necessarily have a structural interpretation. Take the following simple example:280419 If we’re interested in estimating the causal effect of X on Y, P(Y|do(X)), it’s entirely sufficient to adjust¹ for W1 in this graph. That’s because W1 closes all backdoor paths between X and Y, and thus the causal effect can be identified as:

P(Y|do(X)) = \sum_{W_1} P(Y|X, W_1)P(W_1).

However, if we estimate the right-hand side, for example, by linear regression, the coefficient of W1 will not represent its effect on Y. It partly picks up the effect of W2 too, since W1 and W2 are correlated.

If we would also include W2 in the regression, then the coefficients of the control variables could be interpreted structurally and would represent genuine causal effects. But in practice it’s very unlikely that we’ll be able to measure all causal parents of Y. The data collection efforts could just be too huge in a real-world situation.

Luckily, that’s not necessary, however. We only need to make sure that the treatment variable X is unconfounded or conditionally ignorable. And a smaller set of control variables could do the job just fine. But that also implies that the coefficients of controls lose their substantive meaning, because they now represent a complicated weighting of several causal influence factors. Therefore, it doesn’t make much sense to try to put them into context. And if they don’t have expected signs, that’s not a problem.


¹ The term control variable is actually a bit of an outdated terminology, because W1 isn’t controlled in the sense of an intervention. It’s rather adjusted for or conditioned on in terms of taking conditional probabilities. But since the term is so ubiquitous, I’ll use it here too.

Beyond Curve Fitting

Last week I attended the AAAI spring symposium on “Beyond Curve Fitting: Causation, Counterfactuals, and Imagination-based AI”, held at Stanford University. Since Judea Pearl and Dana Mackenzie published “The Book of Why”, the topic of causal inference gains increasing momentum in the machine learning and artificial intelligence community. If we want to build truly intelligent machines, which are able to interact with us in a meaningful way, we have to teach them the concept of causality. Otherwise, our future robots will never be able to understand that forcing the rooster to crow at 3am in the morning won’t make the sun appear. Continue reading Beyond Curve Fitting

Why so much hate against propensity score matching?

I’ve seen several variants of this meme on Twitter recently.

This is just one example, so nothing against @HallaMartin. But his tweet got me thinking. Apparently, in the year 2019 it’s not possible anymore to convince people in an econ seminar with a propensity score matching (or any other matching on observables, for that matter). But why is that?

Here’s what I think. The typical matching setup looks somewhat like this:

You’re interested in estimating the causal effect of X on Y. But in order to do so, you will need to adjust for the confounders W, otherwise you’ll end up with biased results. If you’re able to measure W, this adjustment can be done in a propensity score matching, which is actually an efficient way of dealing with a large set of covariates.

The problem though is to be sure that you’ve adjusted for all possible confounding factors. How can you be certain that there are no unobserved variables left that affect both X and Y? Because if the picture looks like the one below (where the unobserved confounders are depicted by the dashed bidirected arc), matching will only give you biased estimates of the causal effect you’re after.

Presumably, the Twitter meme is alluding to exactly this problem. And I agree that it’s hard to make the claim that you’ve accounted for all confounding influence factors in a matching. But how’s that with economists’ most preferred alternative—the instrumental variable (IV) estimator? Here the setup looks like this:

Now, unobserved confounders between X and Y are allowed, as long as you’re able to find an instruments Z that affects X, but which is unrelated to Y. In that case, creates exogenous variation in X that can be leveraged to estimate X‘s causal effect. (Because of the exogonous variation in X induced by Z, we also call this IV setup a surrogate experiment, by the way.)

Great, so we have found a way forward if we’re not 100% sure that we’ve accounted for all unobserved confounders. Instead of a propensity score matching, we can simply resort to an IV estimator.

But if you think about this a bit more, you’ll realize that we face a very similar situation here. The whole IV strategy breaks down if there are unobserved confounders between Z and Y (see again the dashed arc below). How can we be sure to rule out all influence factors that jointly affect the instrument and the outcome? It’s the same problem all over again.

So in that sense, matching and IV are not very different. In both cases we need to carefully justify our identifying assumptions based on the domain knowledge we have. Whether ruling out X \dashleftarrow\dashrightarrow X is more plausible than Z \dashleftarrow\dashrightarrow X depends on the specific context under study. But on theoretical grounds, there’s no difference in strength or quality between the two assumptions. So I don’t really get why—as a rule—economists shouldn’t trust a propensity score matching, but an IV approach is fine.

Now you might say that this is just Twitter babble. But my impression is that most economists nowadays would be indeed very suspicious towards “selection on observables”-types of identification strategies.* Even though there’s nothing inherently implausible about them.

In my view, the opaqueness of the potential outcome (PO) framework is partly to blame for this. Let me explain. In PO you’re starting point is to assume uncofoundedness of the treatment variable

(Y^1, Y^0) \perp X | W.

This assumption requires that the treatment X needs to be independent of the potential outcomes of Y, when controlling for a vector of covariates W (as in the first picture above). But what is this magic vector W that can make all your causal effect estimation dreams come true? Nobody will tell you.

And if the context you’re studying is a bit more complicated than in the graphs I’ve showed you—with several causally conected variables in a model—it’ becomes very complex to even properly think this through. So in the end, deciding whether unconfoundedness holds becomes more of guessing game.

My hunch is that after having seen too many failed attempts of dealing with this sort of complexity, people have developed a general mistrust against unconfoundedness and strong exogeneity type assumptions. But we still don’t want to give up on causal inference altogether. So we move over to the next best thing: IV, RDD, Diff-in-Diff, you name it.

It’s not that these methods have weaker requirements. They all rely on untestable assumptions about unobservables. But maybe they seem more credible because you’ve jumped through more hoops with them?

I don’t know. And I don’t want to get too much into kitchen sink psychology here. I just know that the PO framework makes it incredibly hard to justify crucial identification assumptions, because it’s so much of a black box. And I think there are better alternatives out there, based on the causal graphs I used in this post (see also here). Who knows, maybe by adopting them we might one day be able to appreciate a well carried out propensity score matching again.

* Interestingly though, this only seems to be the case for reduced-form analyses. Structural folks mostly get away with controlling for observables; presumably because structural models make causal assumptions much more explicit than the potential outcome framework.

Causal Inference for Policymaking

I just submitted an extended abstract of an upcoming paper to a conference that will discuss new analytical tools and techniques for policymaking. The abstract contains a brief discussion about the importance of causal inference for taking informed policy decisions. And I would like to share these thoughts here. Continue reading Causal Inference for Policymaking