Skip to main content

DoubleML: Machine Learning Meets Causal Inference

Causal-Inference
Kuan-Hao (Wilson)
Author
Kuan-Hao (Wilson)
Working at Google. Passionate about causal inference and A/B testing.
Table of Contents

In my previous article on causal inference fundamentals, I explored how data scientists can construct “parallel worlds” to answer the eternal question: “What would have happened if…?” I discussed the critical challenges (confounding variables, self-selection bias, and the counterfactual problem) that make causal analysis so formidable (and super interesting!). Most importantly, I highlighted scenarios where A/B testing, our gold standard for establishing causation, becomes impossible: network effects, broad interventions, and historical analysis.

Today, we’re diving deeper into one of the most powerful solutions to these challenges: DoubleML, a revolutionary approach that fundamentally transforms how we extract causal and actionable insights from observational data. We’ll first briefly introduce what DoubleML is, then elaborate on how it excels compared to other causal inference methodologies, and finally discuss my firsthand experience using DoubleML and its Python implementation.

doubleml
(Source: DoubleML)

What is Causal Inference and DoubleML?
#

Causal inference provides the scientific framework to move beyond “what happened” to “what caused it to happen.” As I explained in my earlier piece, it’s about building parallel worlds with data, i.e., reconstructing the counterfactual scenarios we can never directly observe.

DoubleML represents a significant step forward in this field. By combining machine learning’s predictive prowess with econometrics’ statistical rigor, it addresses a fundamental tension:

How can we use sophisticated ML models to capture complex patterns while still obtaining valid statistical inference for causal effects?

doubleml
DoubleML = ML + Econometrics
(Source: DoubleML Tutorial)

The Core Concepts: Connect A/B Testing Principles to Observational Data
#

Remember how A/B testing works? Randomization creates balanced groups, ensuring any user characteristics (e.g., proportion of female customers) are the same between groups and any difference in outcomes stems from the treatment itself. DoubleML achieves something similar with observational data through a clever application of the Frisch-Waugh-Lovell (FWL) theorem.

Think of it this way: In an ideal A/B test, randomization “automatically” removes the influence of confounders. DoubleML creates this same effect mathematically by:

  1. Using ML to predict outcomes
    • Regress Y on confounder X
  2. Using ML to predict treatment assignment
    • Regress T on confounder X
  3. Analyzing the “residuals” (the unpredicted parts) to isolate causal effects
    • Regress residual of Y on residual of T

This process essentially creates “as-if randomized” data from observational reality because all the un-randomized parts are “explained” by the Xs after taking only the residuals of treatment assignment prediction.

Challenges of Traditional Causal Inference
#

So, DoubleML sounds cool. The question is: why DoubleML, instead of other causal inference methods?

Of course, There exist lots of “traditional” causal inference methods, e.g., instrumental variables, propensity score matching, regression discontinuity, etc., which were revolutionary in their time. But many of them were designed for simpler data environments:

  • Rely on parametric assumptions (often linearity)
  • Limited to relatively few variables
  • Struggle with complex, non-linear relationships

Examples include: Linear regression with controls, 2SLS (two stage least square), matching methods. You could learn more about these examples in Causal Inference for the Brave and True.

Why are they “traditional”? These methods emerged when computational power was limited and datasets were small. More importantly, they made strong assumptions to make problems tractable. Those assumptions often don’t hold in modern applications.

Here are the key challenges these traditional causal inference methods face:

The Bias Problem
#

When we apply predictive machine learning models directly to causal inference, we often introduce bias through regularization or overfitting. This bias “contaminates” our causal estimates, preventing them from achieving the statistical properties we need for reliable inference.

Think of it like trying to measure the effect of a new fertilizer on plant growth, but your measuring instrument adds its own unpredictable errors to every measurement.

The Flexibility Trap
#

Classical methods typically require strong parametric assumptions like assuming linear relationships between variables. But real-world data is messy and complex. I believe you must have seen firsthand: the linearity assumption rarely hold.

The relationship between exercise and mental health, for instance, might involve complex interactions with sleep, diet, social factors, and genetics. When we force these relationships into simple linear models, we get biased results that don’t reflect reality.

The Inference Challenge
#

While machine learning excels at prediction, it struggles to provide valid statistical inference for causal effects. You might predict depression risk with 90% accuracy, but can you confidently say that increasing exercise by 30 minutes daily will reduce depression risk by 15% ± 5%? Traditional ML methods can’t give you those confidence intervals you need for policy or medical decisions.

doubleml
I need Confidence Intervals to make myself confident in my data. Do you?
(Source: Reddit)

Traditional Methods vs. DoubleML: A Paradigm Shift
#

Now, how does DoubleML solve these challenges?

DoubleML elegantly addresses these limitations by combining the predictive power of machine learning with the rigorous inference framework of econometrics. Here’s how:

ChallengeTraditional ProblemDoubleML SolutionImpact
Regularization BiasML predictions contaminate causal estimatesNeyman OrthogonalityBias-immune estimates even with regularized ML
Complex RelationshipsLinear assumptions fail in real dataFlexible ML ModelsCaptures any non-linear pattern
High DimensionsCurse of dimensionalityBuilt-in RegularizationHandles thousands of variables
Statistical InferenceML gives predictions, not confidence intervalsSample SplittingValid p-values and confidence intervals
Model RiskWrong model = wrong answerDouble RobustnessConsistent if either model correct

Let’s elaborate the keywords in this cheatsheet:

Neyman Orthogonality
#

Neyman Orthogonality:
A property that ensures the robustness and reliability of our estimates by making them less sensitive to small errors in estimating nuisance parameters

This is DoubleML’s secret sauce, a big innovation. Even if your machine learning models for predicting outcomes have some bias (and they always do), Neyman orthogonality ensures these biases don’t contaminate your final causal estimates. It’s like having a self-correcting mechanism that filters out the noise while preserving the signal.

Sample Splitting (Cross-fitting)
#

It’s like DoubleML’s debiasing engine. DoubleML splits your data cleverly. It trains ML models on one part of the data to learn patterns, then estimates causal effects on a completely different part. This prevents overfitting from polluting your causal estimates. The result? Asymptotically unbiased estimates with valid confidence intervals. This is something pure ML approaches can’t provide.

doubleml
Cross-fitting (Source: Mohamed Hmamouch)

ML Flexibility Without the Cost
#

DoubleML lets you use any machine learning algorithm (random forests, neural networks, XGBoost) to capture complex patterns in your data. But unlike traditional ML applications, it does this without sacrificing the statistical rigor needed for causal inference.

Real-World Case Study: Google Pixel Ecosystem Strategy
#

As I mentioned in my introduction to causal inference article, our Google Pixel team faced a classic causal question: Does Pixel Watch ownership increase Pixel Phone loyalty? We couldn’t randomly assign watches. That would be impractical (maybe also unethical?).

DoubleML allowed us to:

  • Control for confounders like tech enthusiasm, spending patterns, and phone device preferences
  • Use gradient boosting (i.e., XGBoost) to capture complex user behavior patterns
  • Obtain confidence intervals for executive decision-making

The results? We quantified the causal effect with statistical precision, enabling data-driven ecosystem strategy decisions. This exemplifies how companies like Netflix, Uber, and Microsoft routinely apply DoubleML for similar challenges.

Warnings: What DoubleML Can’t Fix
#

During my time applying DoubleML at Google’s Pixel team, I learned a sobering truth: even the most sophisticated methods have their Achilles’ heel.

DoubleML is powerful, but it’s not invincible.

The Invisible Assumptions
#

DoubleML assumes you’ve captured all relevant confounders, but how can you know what you don’t know? We agonized over whether hidden variables were silently undermining our watch-to-phone loyalty estimates. No algorithm can detect the ghosts of missing confounders.

The Domain Knowledge Dependency
#

While DoubleML flexibly captures complex patterns, it can’t transcend the limitations of its underlying models. Choosing the wrong variables doesn’t just weaken results. It can introduce collider bias that completely invalidates conclusions. I witnessed brilliant ML engineers produce pristine models with terrible causal estimates because they lacked deep product understanding.

DoubleML amplifies expertise; it doesn’t replace it.

The Data Quality Wall
#

I know it’s a cliché: Garbage in, garbage out.

Sophisticated math can’t resurrect meaning from flawed data. Missing values, measurement errors, selection biases, etc. These problems persist regardless of methodological brilliance.

The lesson? DoubleML is a powerful telescope, but you still need to know where to point it and accept that some stars remain forever hidden.

Getting Started with DoubleML in Python
#

For practitioners ready to implement DoubleML, the Python ecosystem offers excellent tools:

Recommended Stack: DoWhy + EconML

# The power of modern causal inference
from dowhy import CausalModel
from econml.dml import LinearDML

# Define your causal question
model = CausalModel(
    data=your_data,
    treatment='intervention',
    outcome='business_metric',
    common_causes=['confounder1', 'confounder2']
)

# Apply DoubleML
dml_estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.econml.dml.LinearDML"
)

These two packages form seamless integration. You could use DoubleML easily within DoWhy’s framework:

Conclusion: The Future of Data-Driven Decisions
#

DoubleML isn’t just another statistical method. It’s a paradigm shift in how we think about causation in complex systems. By bridging the gap between correlation and causation, it empowers us to make decisions based on what will happen, not just what has happened.

The journey from correlation to causation is challenging but essential. Whether you’re optimizing product features, evaluating policy interventions, or understanding customer behavior, DoubleML provides the tools to answer the fundamental question: “What should we do?”

Ready to master DoubleML? Two resources will accelerate your journey:

In our data-rich world, those who can distinguish correlation from causation don’t just analyze the past. They shape the future.