An Introduction to Causal Inference

Your model predicts that customers who receive a discount are more likely to buy. Your stakeholder sees the correlation and concludes the discount program is driving revenue. You deploy a broader discount campaign. Revenue does not move.

What went wrong? The model was correct — discount recipients do buy more often. But the marketing team was already sending discounts to customers with high purchase intent: recent browsers, cart abandoners, loyalty members. The discount was not causing the purchase. The purchase intent was causing both the purchase and the discount. Your model captured a correlation. Your stakeholder acted on a causal claim. The gap between those two things cost real money.

This is the fundamental problem of causal inference, and it shows up everywhere:

Patients who take a drug are healthier afterward — but sicker patients were less likely to take the drug.
Students who attend tutoring score higher — but motivated students self-select into tutoring.
Users who see a new feature retain better — but the feature was rolled out to power users first.
Companies that adopt a new tool grow faster — but fast-growing companies are more likely to adopt new tools.

In every case, the treatment group and control group differ before the treatment is applied, and that pre-existing difference contaminates any naive comparison.

The Potential Outcomes Framework

The cleanest way to think about causation is through potential outcomes, also called the Rubin causal model. For each individual, there are two potential outcomes: the outcome if treated ($Y_1$) and the outcome if not treated ($Y_0$). The causal effect for that individual is $Y_1 - Y_0$.

The fundamental problem: you can only observe one of these outcomes. A customer either received the discount or did not. You never see both versions of reality for the same person. This is why causal inference is hard — you are trying to estimate a quantity that is, by definition, partially unobservable.

What you can estimate is the Average Treatment Effect (ATE): the average of $Y_1 - Y_0$ across all individuals. In a randomized experiment, treatment assignment is independent of potential outcomes, so a simple difference in means gives you an unbiased estimate of the ATE. In observational data, treatment assignment depends on covariates — and those covariates also affect the outcome. The entire field of causal inference is about closing that gap.

There are two other estimands you will encounter. The Average Treatment Effect on the Treated (ATT) answers: among those who were treated, what was the average effect? This is the relevant quantity when you are evaluating a past intervention — did the discount help the people who actually received it? The Conditional Average Treatment Effect (CATE) answers: what is the treatment effect for individuals with specific characteristics $X$? CATE estimation is the foundation of uplift modeling, which we cover in Section 9.2.

Why Randomized Experiments Are the Gold Standard

Randomization solves the causal inference problem by construction. When you randomly assign treatment, the treatment and control groups are balanced on all covariates — both observed and unobserved. Any difference in outcomes can be attributed to the treatment, because nothing else differs systematically between the groups.

This is why A/B tests are so powerful and why you should run one whenever you can. But “whenever you can” has real limits:

Ethical constraints. You cannot randomly deny patients a treatment that might save their lives. You cannot randomly assign students to bad schools.
Business constraints. Your CEO will not let you randomly not discount 50% of high-value customers for three months to measure incrementality.
Structural constraints. The treatment already happened. A policy was already enacted. A product was already launched to specific markets. You are analyzing the aftermath, not designing the experiment.
Cost and time. Running a properly powered experiment takes weeks or months. The business wants an answer now.
Contamination and spillover. Even when you can randomize, treatment effects may spill over between groups. If you discount 50% of riders on a ride-sharing platform, the other 50% experience shorter wait times because demand drops — contaminating your control group. Network effects and marketplace dynamics make clean randomization harder than it looks.

These are not edge cases. In most organizations, the majority of business decisions — pricing changes, policy rollouts, marketing campaigns, product launches — happen without randomization. The data lands on your desk after the fact, and you are asked to determine whether the intervention worked. That is the domain of observational causal inference.

When randomization is not feasible, you need observational causal methods — techniques that attempt to reconstruct the conditions of a randomized experiment from non-random data. Every one of these methods requires assumptions, and every assumption is a potential failure point. The goal of this chapter is to make those assumptions explicit, testable where possible, and to give you code that implements each method correctly.

There is a common misconception that “more data” solves the causal inference problem. It does not. A confounder that biases your estimate with 1,000 observations biases it identically with 10 million observations. More data gives you more precision — a tighter confidence interval around a biased answer. The methods in this chapter address the bias, not the variance. They change the structure of the comparison, not the volume of data.

The Landscape of Observational Methods

Observational causal inference methods differ in their assumptions, data requirements, and failure modes. Choosing the right method depends on the structure of your problem:

Causal Inference Methods

Propensity Score Methods (matching, weighting) work when you believe you have measured all the important confounders. You model the probability of receiving treatment given covariates, then use that probability to create balanced comparisons. The key assumption — no unmeasured confounders — is untestable, which means domain expertise matters more than statistical technique.

Difference-in-Differences (DiD) works when you have panel data: observations on the same units before and after an intervention, with some units treated and others not. The key assumption — parallel trends — is partially testable with pre-treatment data. DiD is the workhorse of policy evaluation and is increasingly used in tech for measuring the impact of product launches and policy changes.

Synthetic Control extends DiD to the case where you have a single treated unit (one country, one city, one product line). It constructs a weighted combination of untreated units that mimics the treated unit’s pre-treatment trajectory, then measures the post-treatment divergence. It works in situations where DiD’s parallel trends assumption is implausible because no single control unit is a good match.

Uplift Modeling shifts the question from “what is the average treatment effect?” to “which individuals benefit most from treatment?” Instead of estimating one number, you estimate a conditional treatment effect for each individual and use it to target interventions. This is where causal inference meets machine learning at its most practical — optimizing who to treat, not just whether treatment works.

How to Choose

The decision tree is shorter than you might expect:

Can you run a randomized experiment? Do it. Nothing in this chapter is a substitute for randomization. These methods exist for when experimentation is not feasible.
Do you have rich covariate data and believe you have measured all confounders? Use propensity score matching or IPW (Section 9.1). This is common in marketing and product analytics where user-level features are abundant.
Do you have panel data — the same units observed before and after treatment? Use difference-in-differences (Section 9.2). This is the standard in policy evaluation and geo-experiments. It handles time-invariant unobserved confounders that propensity scores cannot.
Do you have panel data but only one treated unit? Use synthetic control (Section 9.2). Common when a policy rolls out to one state, one market, or one product.
Do you need to optimize who to treat, not just whether to treat? Use uplift modeling (Section 9.2). This bridges causal inference and targeting, and it is where the budget impact of causal thinking becomes most concrete.

These methods are not mutually exclusive. You might use DiD to show a policy worked on average, then use uplift models to decide which customer segments to target in the next campaign. The methods answer different questions, and production systems often chain them.

A Warning About Causal Language

Language matters in causal inference more than in any other area of data science. The phrase “X is associated with Y” is a statement about data. The phrase “X causes Y” is a claim about the world. You will see data scientists, journalists, and executives swap between these statements as if they were interchangeable. They are not.

Be precise in your reports and presentations. If you ran an observational analysis, say “we estimate that X is associated with a Y-unit increase in Z, after controlling for W.” If you ran a randomized experiment, you can say “X caused a Y-unit increase in Z.” If you used DiD with plausible assumptions, say “we estimate that X caused a Y-unit increase, under the parallel trends assumption.” The assumptions belong in the sentence — not buried in an appendix that nobody reads.

What You Need Before Starting

Causal inference is not a statistical trick you apply after the fact. It requires thinking carefully about the data-generating process before you write any code. The most important step — one that most practitioners skip — is drawing a causal graph: a Directed Acyclic Graph (DAG) representing your assumptions about which variables cause which. The DAG tells you which variables to control for, which to leave out, and whether your problem is even identifiable from the data you have.

A DAG is not a formality. It is a commitment to a set of scientific claims about how the world works, and different DAGs lead to different analysis strategies. Two equally competent analysts who disagree about the DAG will reach different conclusions from the same data — and that disagreement is correct. The data alone cannot tell you which variables are confounders and which are mediators. That requires domain knowledge, and domain knowledge should be explicit, debatable, and recorded — which is exactly what a DAG provides.

The next two sections will make all of this concrete. Section 9.1 covers confounders, colliders, and propensity score methods — the toolkit for when you have rich covariate data. Section 9.2 covers difference-in-differences, synthetic control, and uplift modeling — the toolkit for when you have panel data or need individual-level targeting.

Every method in this chapter can produce precise, confident, wrong answers if its assumptions are violated. The code will be straightforward. The hard part is knowing when to trust the results.