Causal inference is the branch of data analysis concerned with answering "what if" questions — what would happen to an outcome Y if we changed a treatment or exposure X? Unlike simple correlation, causal inference seeks to isolate the effect of X on Y even when we only observe data passively (i.e. without running a controlled experiment). This matters whenever we want to guide policy, business decisions, or scientific conclusions — for example, estimating how a new marketing campaign will boost sales, whether a scholarship improves student grades, or how smoking affects health.

No single tool can handle every scenario because real‐world data are messy, confounded by unobserved factors, influenced by thresholds or time trends. Instead, researchers choose among a palette of methods, each tailored to a particular data structure and set of identifying assumptions.

Causal inference seeks to estimate the coefficient 𝛽 that captures "if we change X, how much does Y change?

Core Methods

Controlled Regression

Regression Discontinuity (RDD)

Difference-in-Difference (DiD)

Instrumental Variables (IV)

ML + Causal Inference

1. Controlled Regression: Isolating the Effect of One Variable

When and Why to Use

Controlled regression is used when you want to understand the effect of one variable (X) on an outcome (Y) while accounting for other influencing factors (confounders, denoted as W). This method is especially useful for observational data where randomization is not possible.

Logic Behind the Method

The idea is to "control" for confounding variables by including them as additional predictors. This way, you isolate the specific contribution of the treatment variable, making it comparable to a randomized experiment.

Data Suitability

  • Suitable when you have continuous or categorical variables.
  • Works well when you can observe and measure all the relevant confounders.
  • Ideal for cross-sectional or panel data.

Python Code: Controlled Regression Example

Scenario: Examining the effect of exercise hours on weight loss, while controlling for age and BMI.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

# Simulating data
np.random.seed(0)
n = 200
age = np.random.normal(40, 10, n)  # Age with mean 40 and std 10
bmi = np.random.normal(25, 4, n)   # Baseline BMI
exercise = np.random.uniform(0, 5, n)  # Exercise hours per week
# Weight loss: each hour of exercise reduces weight by 0.3 kg
weight_loss = 0.3 * exercise - 0.01 * (age - 40) + 0.05 * (bmi - 25) + np.random.normal(0, 0.5, n)

df = pd.DataFrame({'weight_loss': weight_loss, 'exercise': exercise, 'age': age, 'bmi': bmi})

# Fitting the controlled regression model
model = smf.ols('weight_loss ~ exercise + age + bmi', data=df).fit()
print(model.summary())

Interpreting Results:

  • The coefficient of exercise represents the effect of each additional hour of exercise on weight loss, controlling for age and BMI.
  • The p-value indicates whether the effect is statistically significant.
  • The R-squared value shows how much variance in weight loss is explained by the model.

2. Regression Discontinuity Design (RDD): Exploiting Thresholds

When and Why to Use

RDD is used when treatment assignment occurs based on whether a variable crosses a specific cutoff. It is ideal when you have a sharp cutoff (like a policy change at a specific score or age).

Logic Behind the Method

If the running variable is close to the cutoff, individuals just above and below the threshold are assumed to be similar. The difference in outcomes at the cutoff can then be attributed to the treatment.

Data Suitability

  • Use when a clear cutoff rule determines treatment.
  • Works well if you have data on both sides of the cutoff for comparison.

Python Code: RDD Example

Scenario: Estimating the effect of a scholarship on GPA where students with a score of 80 or higher receive the scholarship.

import numpy as np
import pandas as pd
import statsmodels.api as sm

np.random.seed(1)
n = 300
score = np.random.uniform(60, 100, n)  # Test scores between 60 and 100
treat = (score >= 80).astype(int)  # Treatment if score >= 80
# GPA increases by 0.4 with scholarship
gpa = 2.5 + 0.4 * treat + 0.01 * (score - 80) + np.random.normal(0, 0.2, n)

df = pd.DataFrame({'score': score, 'gpa': gpa})
df['treat'] = (df['score'] >= 80).astype(int)
df['running'] = df['score'] - 80

# RDD model using linear terms on both sides of the cutoff
df = df[np.abs(df['running']) <= 10]  # Using a bandwidth of 10
X = sm.add_constant(df[['treat', 'running']])
rdd_model = sm.OLS(df['gpa'], X).fit()
print(rdd_model.summary())

Interpreting Results:

  • The coefficient of treat shows the causal effect of receiving the scholarship on GPA.
  • A significant positive value suggests that the scholarship positively impacts students' performance.

3. Difference-in-Differences (DiD): Tracking Changes Over Time

When and Why to Use

DiD is used to compare changes in outcomes between treated and control groups before and after a treatment. It is useful when you have a natural experiment or a policy change affecting one group.

Logic Behind the Method

It relies on the parallel trends assumption: if no treatment had occurred, the difference between groups would have remained constant over time. By comparing the changes, DiD isolates the treatment effect.

Data Suitability

  • Requires time-series or panel data.
  • Must have a treated and control group with pre- and post-treatment observations.

Python Code: DiD Example

Scenario: Analyzing the effect of a marketing campaign on sales in Region A compared to Region B.

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

np.random.seed(2)
data = []
for region in ['A', 'B']:
    for period in ['pre', 'post']:
        sales = 100 + (20 if region == 'A' and period == 'post' else 0) + np.random.normal(0, 5, 10)
        for sale in sales:
            data.append({'region': region, 'period': period, 'sales': sale})

df = pd.DataFrame(data)
df['post'] = (df['period'] == 'post').astype(int)
df['treat'] = (df['region'] == 'A').astype(int)

# DiD regression model
model = smf.ols('sales ~ post + treat + post:treat', data=df).fit()
print(model.summary())

Interpreting Results:

  • The interaction coefficient (post:treat) shows the difference in changes between treated and control groups.
  • A significant value indicates that the marketing campaign had a measurable impact on sales.

4. Instrumental Variables (IV): Tackling Endogeneity

When and Why to Use

Instrumental Variables (IV) are used when the treatment variable (X) is endogenous — meaning it is correlated with unobserved variables that also affect the outcome (Y). This endogeneity can bias the estimation of the causal effect. IVs help address this problem by leveraging an instrument (Z) that influences the treatment but not the outcome directly.

Logic Behind the Method

The basic idea of IV is to find a variable (Z) that:

  1. Affects the treatment (X) — this is called relevance.
  2. Does not directly affect the outcome (Y) — this is called exclusion restriction.

The IV method uses two stages:

  1. First Stage: Estimate how the instrument affects the treatment.
  2. Second Stage: Use the predicted treatment from the first stage to estimate the effect on the outcome.

Data Suitability

  • Use when you suspect endogeneity or reverse causality.
  • Requires a valid instrument that influences the treatment but not the outcome directly.
  • Works well when randomization is not possible.

Python Code: IV Example

Scenario: Estimating the effect of education (X) on earnings (Y), using distance to college (Z) as an instrument.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.sandbox.regression.gmm import IV2SLS

# Simulating data
np.random.seed(3)
n = 250
distance = np.random.normal(0, 1, n)  # Instrument: distance to college
ability = np.random.normal(0, 1, n)   # Unobserved factor (confounder)
education = 12 + 2 * distance + ability + np.random.normal(0, 1, n)  # Treatment
earnings = 20000 + 3000 * education + 500 * ability + np.random.normal(0, 10000, n)  # Outcome

df = pd.DataFrame({'earnings': earnings, 'education': education, 'distance': distance})

# IV regression using 2SLS
# First stage: Estimate education using distance
X = sm.add_constant(df[['distance']])
y = df['education']
first_stage = sm.OLS(y, X).fit()

# Second stage: Estimate earnings using predicted education from the first stage
df['predicted_education'] = first_stage.predict(X)
second_stage = sm.OLS(df['earnings'], sm.add_constant(df['predicted_education'])).fit()

print(second_stage.summary())

Interpreting Results:

  • The coefficient on predicted education represents the causal effect of education on earnings, controlling for the endogeneity problem.
  • The instrument's validity can be checked by looking at the first-stage F-statistic (should be >10 for a strong instrument).
  • The second-stage coefficient indicates how much an additional year of education increases earnings.

Why Use IV Instead of Simple Regression?

Simple regression would be biased due to unobserved factors like ability, which affects both education and earnings.

IV helps isolate the exogenous variation in education driven by distance to college, which is assumed not to directly affect earnings.

5. Causal Machine Learning: Leveraging Complex Data Structures

When and Why to Use

Causal Machine Learning is employed when:

  • You have high-dimensional data with many variables.
  • The relationship between treatment and outcome may be nonlinear or heterogeneous.
  • You want to uncover individual treatment effects rather than an average effect.

Logic Behind the Method

Traditional causal methods often fail when data is large or complex. Causal machine learning combines flexible algorithms (like trees and forests) with causal inference techniques to handle:

  1. High-dimensional covariates.
  2. Nonlinear interactions.
  3. Complex relationships between treatment and outcome.

Key Techniques:

  1. Double Selection (Lasso): Uses Lasso regression to select variables that predict both the treatment and the outcome.
  2. Double/Debiased Machine Learning (DML): Uses ML models to predict both the treatment and outcome, then estimates the residualized effect.
  3. Causal Forests: Builds a model to estimate heterogeneous treatment effects for different subgroups.

Python Code: Causal Machine Learning Example

Scenario: Predicting the effect of a marketing campaign (X) on sales (Y), accounting for customer demographics.

Double/Debiased Machine Learning (DML)

from sklearn.ensemble import RandomForestRegressor
from econml.dml import LinearDML
import numpy as np
import pandas as pd

# Simulating data
np.random.seed(4)
n = 500
age = np.random.uniform(18, 65, n)
income = np.random.uniform(30000, 100000, n)
campaign = np.random.binomial(1, 0.5, n)  # Treatment: campaign exposure
# True effect: campaign increases sales by 20 units
sales = 50 + 20 * campaign + 0.1 * age + 0.05 * income + np.random.normal(0, 10, n)

df = pd.DataFrame({'sales': sales, 'campaign': campaign, 'age': age, 'income': income})

# Double ML using a random forest as the machine learning model
model = LinearDML(model_y=RandomForestRegressor(), model_t=RandomForestRegressor())
model.fit(df['sales'], df['campaign'], X=df[['age', 'income']])

# Estimate the Average Treatment Effect (ATE)
ate = model.ate(df[['age', 'income']])
print(f"Estimated ATE: {ate}")

# Estimate Conditional Average Treatment Effects (CATE) for the first 5 customers
cate = model.effect(df[['age', 'income']].iloc[:5])
print(f"Estimated CATEs: {cate}")

Interpreting Results:

  • ATE: Shows the overall average effect of the marketing campaign on sales.
  • CATE: Gives the estimated effect for individual customers, showing how the impact may vary across demographics.
  • Flexible Models: Using random forests accounts for nonlinear relationships between variables.

Why Use Causal Machine Learning Instead of Traditional Methods?

  • Traditional models assume linear relationships and few variables.
  • ML methods can handle complex, high-dimensional data and provide insights into individual effects.
  • Particularly useful in personalized marketing, precision medicine, and scenarios where treatment effects vary across subgroups.

Final Thoughts

Causal inference is crucial when you need to determine the true effect of a treatment or intervention rather than just observing associations. The choice of method largely depends on the nature of your data, the research question, and the specific challenges you face (like confounding, non-linearity, or complex interactions).

Choosing the Right Method:

  • If you have observational data with measured confounders, start with Controlled Regression. It's straightforward but requires you to account for all relevant confounders.
  • If your data has a sharp cutoff or threshold that determines treatment, Regression Discontinuity Design (RDD) is the most appropriate. It leverages local randomization around the cutoff.
  • When comparing changes over time between treated and untreated groups, use Difference-in-Differences (DiD). This method is reliable if the parallel trends assumption holds.
  • If your treatment variable is endogenous (influenced by unobserved factors), leverage Instrumental Variables (IV). This method requires a valid instrument that affects the treatment but not the outcome directly.
  • If your data is complex, high-dimensional, or non-linear, go for Causal Machine Learning. These methods can uncover heterogeneous effects and adapt to complex relationships.

Causal Inference Method Selection Test

Answer the following Yes/No questions to identify the most suitable causal inference method for your analysis:

1. Controlled Regression

  1. Do you have observational data? (Yes/No)
  2. Can you measure all the confounding variables that may affect the outcome? (Yes/No)
  3. Are your variables continuous or categorical? (Yes/No)
  4. Do you have cross-sectional or panel data? (Yes/No)

If you answered "Yes" to all of the above, Controlled Regression is likely suitable.

2. Regression Discontinuity Design (RDD)

  1. Is your treatment assignment based on a specific cutoff or threshold? (Yes/No)
  2. Can you clearly identify a running variable that determines the assignment? (Yes/No)
  3. Do you have enough data close to the cutoff on both sides? (Yes/No)

If you answered "Yes" to all of the above, RDD is likely suitable.

3. Difference-in-Differences (DiD)

  1. Do you have time series data or data collected at multiple time points? (Yes/No)
  2. Do you have both treated and control groups? (Yes/No)
  3. Can you clearly identify pre- and post-intervention periods for both groups? (Yes/No)
  4. Can you reasonably assume that the treated and control groups would have followed parallel trends if not for the intervention? (Yes/No)

If you answered "Yes" to all of the above, DiD is likely suitable.

4. Instrumental Variables (IV)

  1. Do you suspect your treatment variable (X) is endogenous (affected by unobserved confounders)? (Yes/No)
  2. Do you have a valid instrument (Z) that affects the treatment but not the outcome directly? (Yes/No)
  3. Can you demonstrate that the instrument significantly affects the treatment? (Yes/No)
  4. Is the instrument plausibly unrelated to the outcome except through the treatment? (Yes/No)

If you answered "Yes" to all of the above, IV is likely suitable.

5. Machine Learning + Causal Inference

  1. Do you have high-dimensional data with many potential covariates? (Yes/No)
  2. Do you expect the relationship between variables to be non-linear or complex? (Yes/No)
  3. Are you interested in estimating heterogeneous treatment effects (i.e., effects that vary across subgroups)? (Yes/No)
  4. Do you have access to a large dataset for training models? (Yes/No)

If you answered "Yes" to all of the above, Machine Learning + Causal Inference is likely suitable.

Note:

For each method, if you have answered "Yes" to all questions in that section, the method is likely appropriate for your analysis.

If you have some "No" answers, consider revisiting your data structure and the assumptions required for that method.

If multiple methods are appropriate, choose the one that best fits the structure of your data and the nature of your research question.

By taking this simple test, you can ensure that you select the most suitable causal inference method, leading to more robust and credible results in your analysis.

Best Practices:

  1. Understand Your Data: Choose a method that aligns with your data structure and the type of treatment assignment.
  2. Check Assumptions: Each method comes with its own set of assumptions (like parallel trends in DiD or no direct path from instrument to outcome in IV). Validate these before interpreting results.
  3. Combine Methods When Necessary: Sometimes combining methods (e.g., RDD with ML for flexibility) can enhance robustness and accuracy.
  4. Robustness Checks: Perform placebo tests, balance diagnostics, or sensitivity analyses to validate causal claims.
  5. Interpret with Caution: Even with sophisticated techniques, results may still be biased if assumptions are violated or if the instrument is weak.