Causal Inference is More than Fitting the Data Well

This post first appeared on February 1, 2021, on

Causal inference is becoming an increasingly important topic in industry. Several big players have already taken notice and started to invest in the causal data science skills of their people. One piece of evidence surely was the huge success of the first Causal Data Science Meeting last year. Our own research further proves this point. Over the course of last year, we have talked to many data scientists working in the tech sector, as well as related industries, and all of them reported to us that interest in causality and frustration about the limits of classical machine learning are rising. Especially when you tackle complex problems that are related to the strategic direction of your company, the ability to forecast the effects of your actions—and thus causal inference—becomes of great significance.

Yet, we also learned that applying causal inference methods poses a number of significant challenges to practitioners. Not only is there an educational gap and many data scientists still do not have much experience with these tools, but cleanly identifying the root causes behind relationships in your data and ruling out alternative explanations can be time-consuming. Data science teams often simply do not have the time to run an elaborate study because of pressure to bring models to production quickly.

Need for a cultural change

Another important bottleneck we have encountered in our research is cultural though. Classical machine learning is all about minimizing prediction error. The more accurately your model is able to, e.g., classify x-ray images or forecast future stock market prices the better. This simple target gives you an objective standard of evaluation which is easy to understand for everyone. ML research made great progress in the past by running competitions on which methods and algorithms provide the best out-of-sample-fit in various problem domains ranging from image recognition to natural language processing. Such an objective and simple evaluation criterion is missing in causal inference.

CI is much harder than simply optimizing a loss function and context-specific domain knowledge plays a crucial role. Unless you can benchmark your model predictions to actual experiments, which is pretty rare in practice and even then, you will only be able to tell how well you did ex-post, there is no simple criterion to judge the accuracy of a particular estimate. The quality of causal inferences depends on several crucial assumptions, which are not easily testable with the data at hand. This forces people to completely rethink the way they approach their data science and ML problems.

In fact, there is an important theoretical reason why causal data science is challenging in that regard. It is called the Pearl causal hierarchy. The PCH, which is also known under the name ladder of causation, states that any data analysis can be mapped to one of three distinct layers of an information hierarchy. At the lowest rung there are associations, which refer to simple conditional probability statements between variables in the data. They remain purely correlational (“how does X relate to Y?”) and therefore do not have any causal meaning. The second rung relates to interventions (“what happens to Y if I manipulate X?”) and here we already enter the world of causality. On the third layer we finally have counterfactuals (“What would Y be if X had been x?”), which represent the highest form of causal reasoning.

Causal inference cannot be purely data-driven

The PCH tells us that to climb the ladder of causation and be able to infer causal effects from the data, we need to be willing to make at least some causal assumptions in the first place. “No causes in, no causes out”! This fact can be proven mathematically. There is no CI method that would be entirely data-driven. You always need that extra ingredient in form of specific domain knowledge that is introduced to the problem and which can only be judged based on experience and theoretical reasoning. This is the way causal diagrams work, for example, but other causal assumptions such as conditional independence, instrument validity or parallel trends in difference-in-differences fall into the same category.

Because these causal assumptions are necessarily context-specific, they are more complex and multidimensional than a simple fit criterion based on squared loss. That does not mean that they are in any way arbitrary though. The theoretical requirements for causal inference imposed by the PCH call for an entirely new way of thinking about data science, which also introduces non-trivial organizational challenges. We need to put domain experts such as clients, engineers, and sales partners in the loop, who can tell us whether our assumptions make sense and the way we model a certain problem is accurate. This will lead to a much more holistic approach to data science and the way teams are structured. Some first steps going in that direction are described here in a post by Patrick Doupe, principal economist at Zalando. In the coming months we plan to publish more content of that sort creating a dialogue between industry and academia on how to push causal inference applications in industry practice.