Confounders and Observational Methods

9.1 — Correlation vs. Causation: The Confounder Problem

Ice cream sales and drowning deaths are positively correlated. Should you ban ice cream? Of course not — summer causes both. The temperature is a confounder: a variable that influences both the treatment (ice cream sales) and the outcome (drowning deaths), creating a spurious association between them.

This is obvious with ice cream. It is far less obvious when your data shows that customers who receive a promotional email spend 40% more, or that users who adopt a premium feature have 3× lower churn. In both cases, a confounder — purchase intent, engagement level — is driving both the treatment and the outcome, and the “effect” you are measuring is partly or entirely an artifact.

Confounders, Colliders, and the Causal Graph

A Directed Acyclic Graph (DAG) is the most important tool in causal inference. It makes your assumptions about the data-generating process explicit and machine-readable. Three structures appear repeatedly:

Confounder (fork): $X \leftarrow Z \rightarrow Y$. The variable $Z$ causes both $X$ and $Y$. If you do not control for $Z$, the association between $X$ and $Y$ is biased. You must condition on confounders.

Mediator (chain): $X \rightarrow Z \rightarrow Y$. The variable $Z$ is on the causal path from $X$ to $Y$. If you condition on $Z$, you block the causal effect you are trying to measure. You must not condition on mediators (unless you want the direct effect only).

Collider: $X \rightarrow Z \leftarrow Y$. The variable $Z$ is caused by both $X$ and $Y$. Conditioning on a collider creates a spurious association between $X$ and $Y$ that did not exist before. This is the counter-intuitive one — controlling for more variables does not always reduce bias. Sometimes it introduces bias.

The classic collider example: talent and attractiveness are independent in the general population. But among Hollywood actors (the collider — you need either talent or attractiveness to get in), they become negatively correlated. Conditioning on “is an actor” creates a spurious association.

The lesson: you cannot blindly throw all variables into your model and assume more controls = less bias. The DAG tells you which variables to condition on and which to leave alone. Drawing the DAG is not optional.

Simpson’s Paradox: When Aggregation Lies

Simpson’s paradox occurs when a trend that appears in aggregated data reverses when you split by a confounding variable. It is not a curiosity — it is a direct consequence of confounders, and it shows up in real business data more often than you think.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)

def simulate_simpsons_paradox(n: int = 2000) -> pd.DataFrame:
    """Simulate a dataset where the treatment appears harmful overall
    but is beneficial within each subgroup.

    Scenario: A company runs a promotional discount (treatment).
    High-value customers are more likely to receive the discount
    AND more likely to buy regardless. The confounder is customer segment.
    """
    # Customer segment: 0 = low-value, 1 = high-value
    segment = rng.binomial(1, 0.4, size=n)

    # Treatment assignment depends on segment (confounder!)
    # High-value customers are MORE likely to get the discount
    treatment_prob = np.where(segment == 1, 0.7, 0.3)
    treatment = rng.binomial(1, treatment_prob)

    # Outcome depends on both segment AND treatment
    # Treatment has a POSITIVE effect (+10 pp) within each segment
    # But high-value customers have higher baseline purchase rates
    base_rate = np.where(segment == 1, 0.6, 0.2)
    treatment_effect = 0.10  # True causal effect: +10 percentage points
    purchase_prob = base_rate + treatment * treatment_effect
    purchase = rng.binomial(1, purchase_prob)

    return pd.DataFrame({
        "segment": segment,
        "treatment": treatment,
        "purchase": purchase,
    })


df = simulate_simpsons_paradox()

# Aggregated view: treatment looks LESS effective (or even harmful)
agg = df.groupby("treatment")["purchase"].mean()
print("=== Aggregated (ignoring segment) ===")
print(f"  No treatment: {agg[0]:.3f}")
print(f"  Treatment:    {agg[1]:.3f}")
print(f"  Naive 'effect': {agg[1] - agg[0]:+.3f}")

# Stratified view: treatment is beneficial in BOTH segments
print("\n=== Stratified by segment ===")
for seg in [0, 1]:
    sub = df[df["segment"] == seg]
    rates = sub.groupby("treatment")["purchase"].mean()
    print(f"  Segment {seg}: no treatment={rates[0]:.3f}, "
          f"treatment={rates[1]:.3f}, effect={rates[1] - rates[0]:+.3f}")

The aggregated comparison shows the treatment has a small or negative effect. The stratified comparison shows a positive effect in both segments. The aggregated result is wrong because it mixes two groups with different baseline rates and different treatment probabilities. The confounder (segment) determines both who gets treated and who buys, and ignoring it produces a misleading result.

The fix is not always “stratify.” The fix is to identify the confounders using a causal graph and adjust for them appropriately. Sometimes that means stratification, sometimes regression, sometimes the methods we cover next.

9.2 — Propensity Score Methods

The fundamental challenge: treated and untreated groups differ on observed covariates, and those covariates affect the outcome. Propensity score methods tackle this by modeling the treatment assignment mechanism itself.

The Propensity Score

The propensity score is defined as $e(X) = P(\text{Treatment} = 1 \mid X)$ — the probability of receiving treatment given observed covariates $X$. The key theoretical result, due to Rosenbaum and Rubin (1983): conditioning on the propensity score achieves the same balance as conditioning on all the covariates simultaneously. Instead of matching on dozens of variables, you match on one number.

This reduces a high-dimensional problem to a one-dimensional one. But the propensity score is not a silver bullet — it only balances on observed covariates. Unobserved confounders remain a threat.

Estimating the Propensity Score

Logistic regression is the standard choice. Avoid overly complex models (gradient-boosted trees with 500 trees) — the goal is a well-specified model of treatment assignment, not maximum predictive accuracy.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

rng = np.random.default_rng(42)


def generate_observational_data(n: int = 3000) -> pd.DataFrame:
    """Simulate an observational study where treatment assignment
    is confounded by age, income, and prior engagement.

    True treatment effect: +500 in revenue.
    """
    age = rng.normal(45, 12, n).clip(18, 80)
    income = rng.normal(60_000, 20_000, n).clip(20_000, 150_000)
    prior_engagement = rng.poisson(5, n)

    # Treatment assignment depends on covariates (confounding)
    logit = (
        -2.0
        + 0.03 * (age - 45)
        + 0.00002 * (income - 60_000)
        + 0.15 * (prior_engagement - 5)
    )
    treatment_prob = 1 / (1 + np.exp(-logit))
    treatment = rng.binomial(1, treatment_prob)

    # Outcome depends on covariates AND treatment
    true_effect = 500  # The causal effect we want to recover
    revenue = (
        200
        + 5 * age
        + 0.005 * income
        + 50 * prior_engagement
        + true_effect * treatment
        + rng.normal(0, 300, n)
    )

    return pd.DataFrame({
        "age": age,
        "income": income,
        "prior_engagement": prior_engagement,
        "treatment": treatment,
        "revenue": revenue,
    })


df = generate_observational_data()

# Naive estimate (biased because treated group has higher income, etc.)
naive_effect = (
    df[df["treatment"] == 1]["revenue"].mean()
    - df[df["treatment"] == 0]["revenue"].mean()
)
print(f"Naive estimate of treatment effect: {naive_effect:.0f}")
print(f"True treatment effect: 500")
print(f"Bias: {naive_effect - 500:.0f}\n")

# Estimate propensity scores
covariates = ["age", "income", "prior_engagement"]
X = df[covariates].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

ps_model = LogisticRegression(max_iter=1000)
ps_model.fit(X_scaled, df["treatment"].values)

df["propensity_score"] = ps_model.predict_proba(X_scaled)[:, 1]

print("Propensity score distribution:")
print(f"  Treated:   mean={df.loc[df['treatment']==1, 'propensity_score'].mean():.3f}, "
      f"std={df.loc[df['treatment']==1, 'propensity_score'].std():.3f}")
print(f"  Untreated: mean={df.loc[df['treatment']==0, 'propensity_score'].mean():.3f}, "
      f"std={df.loc[df['treatment']==0, 'propensity_score'].std():.3f}")

The naive estimate is biased upward — treated individuals have higher income and engagement, which inflates their revenue regardless of treatment. The propensity score captures this treatment assignment mechanism.

Propensity Score Matching

Matching pairs each treated unit with the most similar untreated unit based on propensity score. After matching, you have a pseudo-randomized dataset where treated and untreated groups are balanced on measured covariates.

from scipy.spatial import KDTree
from typing import NamedTuple


class MatchResult(NamedTuple):
    ate: float
    matched_treated: np.ndarray
    matched_control: np.ndarray
    smd_before: dict[str, float]
    smd_after: dict[str, float]


def propensity_score_match(
    df: pd.DataFrame,
    ps_col: str = "propensity_score",
    treatment_col: str = "treatment",
    outcome_col: str = "revenue",
    covariates: list[str] | None = None,
    caliper: float = 0.05,
) -> MatchResult:
    """Nearest-neighbor propensity score matching with caliper.

    Caliper = maximum allowed distance in propensity score space.
    Without a caliper, bad matches inflate bias.
    """
    treated = df[df[treatment_col] == 1]
    control = df[df[treatment_col] == 0]

    # Build KD-tree on control propensity scores
    control_ps = control[ps_col].values.reshape(-1, 1)
    tree = KDTree(control_ps)

    treated_ps = treated[ps_col].values.reshape(-1, 1)
    distances, indices = tree.query(treated_ps, k=1)

    # Apply caliper: discard matches with distance > caliper
    mask = distances.flatten() <= caliper
    matched_treated_idx = treated.index[mask]
    matched_control_idx = control.index[indices.flatten()[mask]]

    matched_treated = df.loc[matched_treated_idx]
    matched_control = df.loc[matched_control_idx]

    # ATE from matched pairs
    ate = (
        matched_treated[outcome_col].values
        - matched_control[outcome_col].values
    ).mean()

    # Balance check: standardized mean differences
    smd_before: dict[str, float] = {}
    smd_after: dict[str, float] = {}
    if covariates:
        for cov in covariates:
            pooled_std = df[cov].std()
            smd_before[cov] = (
                treated[cov].mean() - control[cov].mean()
            ) / pooled_std
            smd_after[cov] = (
                matched_treated[cov].mean() - matched_control[cov].mean()
            ) / pooled_std

    return MatchResult(ate, matched_treated.values, matched_control.values,
                       smd_before, smd_after)


result = propensity_score_match(df, covariates=covariates)

print(f"\n=== Propensity Score Matching ===")
print(f"Matched ATE: {result.ate:.0f} (true: 500)")
print(f"\nCovariate balance (Standardized Mean Difference):")
print(f"  {'Covariate':<20} {'Before':>10} {'After':>10}")
print(f"  {'-'*40}")
for cov in covariates:
    print(f"  {cov:<20} {result.smd_before[cov]:>10.3f} {result.smd_after[cov]:>10.3f}")
print(f"\n  Goal: |SMD| < 0.1 after matching")

The balance check is not optional. If the standardized mean differences (SMDs) remain large after matching (|SMD| > 0.1), the matching has failed to create comparable groups and the treatment effect estimate should not be trusted. Common remedies: adjust the caliper, add interaction terms to the propensity model, or switch to a different method.

Propensity Score Matching

Inverse Probability Weighting (IPW)

Matching discards unmatched units, which wastes data and can reduce precision. IPW uses all observations but reweights them: treated units receive weight $1/e(X)$ and untreated units receive weight $1/(1-e(X))$. This creates a pseudo-population where treatment is independent of covariates.

The intuition: a treated unit with propensity score 0.9 is “expected” — lots of similar people were treated. It gets a weight of $1/0.9 \approx 1.1$. A treated unit with propensity score 0.1 is “surprising” — few similar people were treated, so it carries more information about the treatment effect. It gets weight $1/0.1 = 10$. This upweighting of unlikely-to-be-treated individuals among the treated group, and unlikely-to-be-untreated individuals among the control group, mimics what a randomized experiment would produce.

from scipy import stats


def ipw_ate(
    df: pd.DataFrame,
    ps_col: str = "propensity_score",
    treatment_col: str = "treatment",
    outcome_col: str = "revenue",
    trimming_threshold: float = 0.05,
) -> tuple[float, float, float]:
    """Estimate ATE using Inverse Probability Weighting.

    Returns (ATE, standard_error, 95% CI width).

    Trimming: discard units with extreme propensity scores
    to avoid infinite weights blowing up the estimate.
    """
    # Trim extreme propensity scores
    mask = (
        (df[ps_col] >= trimming_threshold)
        & (df[ps_col] <= 1 - trimming_threshold)
    )
    trimmed = df[mask].copy()
    n_trimmed = len(df) - len(trimmed)
    if n_trimmed > 0:
        print(f"  Trimmed {n_trimmed} units with extreme propensity scores")

    ps = trimmed[ps_col].values
    t = trimmed[treatment_col].values
    y = trimmed[outcome_col].values

    # IPW weights
    weights = t / ps + (1 - t) / (1 - ps)

    # Weighted outcomes
    weighted_treated = (t * y / ps).sum() / (t / ps).sum()
    weighted_control = ((1 - t) * y / (1 - ps)).sum() / ((1 - t) / (1 - ps)).sum()

    ate = weighted_treated - weighted_control

    # Standard error via influence function (Horvitz-Thompson)
    scores = t * y / ps - (1 - t) * y / (1 - ps) - ate * (t / ps - (1 - t) / (1 - ps))
    se = np.sqrt(np.var(scores, ddof=1) / len(scores))

    ci_95 = stats.norm.ppf(0.975) * se

    return ate, se, ci_95


ate, se, ci = ipw_ate(df)
print(f"\n=== Inverse Probability Weighting ===")
print(f"  ATE estimate: {ate:.0f}")
print(f"  Standard error: {se:.0f}")
print(f"  95% CI: [{ate - ci:.0f}, {ate + ci:.0f}]")
print(f"  True effect: 500")

Two critical implementation details:

Trimming. Propensity scores near 0 or 1 create enormous weights that dominate the estimate and inflate variance. A common practice is to trim observations with propensity scores below 0.05 or above 0.95. This introduces a small bias but dramatically reduces variance — a favorable trade-off in practice.

Positivity violation. If some covariate combination makes treatment deterministic (propensity score = 0 or 1), IPW cannot estimate the treatment effect for those individuals. This is called a positivity violation. Check the propensity score distribution — if there is no overlap between treated and control groups at certain scores, no statistical method can reliably estimate causal effects in that region.

When Propensity Score Methods Fail

Both matching and IPW share the same Achilles’ heel: the unconfoundedness assumption (also called “ignorability” or “selection on observables”). This assumption states that, conditional on observed covariates, treatment assignment is independent of potential outcomes. In plain language: after you control for everything you measured, there are no remaining hidden variables that affect both treatment and outcome.

This assumption is untestable. You cannot check whether unobserved confounders exist — that is a domain knowledge question, not a statistical one. If a sales team uses private customer conversations to decide who gets a discount, and those conversations are not in your data, no propensity score model will remove the bias.

Practical guidance for assessing unconfoundedness:

Ask the treatment assigner. Talk to the person or system that decided who gets treated. Understand the decision rules. If the decision depends on variables you have not measured, you have a problem.
Include all plausible confounders. Err on the side of including too many covariates rather than too few — but never include post-treatment variables (variables measured after treatment was applied), as these are mediators or colliders, not confounders.
Report honestly. State which confounders you controlled for and which plausible confounders might be missing. An honest analysis with acknowledged limitations is more valuable than a precise analysis with hidden assumptions.

When you lack confidence in unconfoundedness and you have panel data (observations over time), difference-in-differences — covered in the next section — offers an alternative that relies on a different, sometimes more plausible, set of assumptions.