NeoForge Labs

Part 3: Counterfactual Reasoning with Causal DAGs

Kelyn Njeri — Fri, 16 Jan 2026 04:00:07 GMT

In Parts 1 and 2, you learned why causality matters and how to build causal DAGs. Today, we're climbing to Level 3 of Pearl's Ladder: Counterfactual Reasoning.

This is the most powerful form of causal AI—reasoning about alternate realities and answering "what if" questions that standard ML can't touch.

By the end of this article, you'll:

Understand what counterfactuals are and why they're powerful
Implement counterfactual inference with your DAG
Generate personalized explanations for individual cases
Estimate individual treatment effects
Build "what if" scenario analysis tools

Let's reason about alternate realities.

What Are Counterfactuals?

The Three Worlds of Causality

Factual World (What happened):

"I watered this plant heavily"
"The plant developed root rot"
This is observation—what we actually saw

Interventional World (What would happen):

"If I water the next plant moderately, what happens?"
"Disease probability drops to 20%"
This is prediction—what we expect in the future

Counterfactual World (What would have happened):

"Would THIS plant be healthy if I had watered it differently?"
"85% probability it would be healthy"
This is retrospection—alternate history for a specific instance

Why Counterfactuals Are Special

Counterfactuals require THREE pieces of information:

1. The causal mechanism (from your DAG)

How do variables causally relate?
Watering → Moisture → Pathogen → Disease

2. The specific instance (observed data)

This plant had: high watering, high moisture, disease
Its vigor: 0.7, environmental stress: 0.4

3. The alternate action (the intervention)

What if watering had been moderate instead?
How would moisture, pathogen, disease differ?

Standard ML only has #2. Intervention (Level 2) has #1 and #3. Only counterfactuals combine all three.

The Counterfactual Formula

Notation: P(Y_x' | X=x, Y=y)

Read as: "Probability of outcome Y under intervention x', given we observed X=x and Y=y in reality"

Example:

P(Healthy | do(Watering=moderate), Watering=heavy, Diseased)
"Would plant be healthy with moderate watering, given we observed heavy watering and disease?"

This is fundamentally different from:

P(Healthy | Watering=moderate) — observational (correlation)
P(Healthy | do(Watering=moderate)) — interventional (average effect)

Counterfactuals condition on BOTH the alternate intervention AND the factual observation.

Implementing Counterfactual Inference

The Three Steps of Counterfactual Analysis

Step 1: Abduction — Infer latent variables from observations Step 2: Action — Modify the model according to the intervention Step 3: Prediction — Compute the counterfactual outcome

Let's implement this with our plant disease DAG:

import numpy as np
import pandas as pd
from dowhy import CausalModel

class CounterfactualEngine:
    """
    Counterfactual reasoning engine for plant disease diagnosis.
    """

    def __init__(self, causal_dag, data):
        self.dag = causal_dag
        self.data = data
        self.model = self._build_model()

    def _build_model(self):
        """Build the causal model from DAG."""
        return CausalModel(
            data=self.data,
            treatment='leaf_moisture_hours',
            outcome='symptom_severity',
            graph=self.dag,
            common_causes=['environmental_stress', 'watering_practice'],
            effect_modifiers=['plant_vigor']
        )

    def abduction(self, sample_idx):
        """
        Step 1: Infer latent variables from observed data.
        Given what we observed, what are the unobserved factors?
        """
        observed = self.data.iloc[sample_idx]

        # Infer noise terms (unobserved confounders)
        # These capture the instance-specific factors
        noise_terms = {
            'u_moisture': self._infer_moisture_noise(observed),
            'u_pathogen': self._infer_pathogen_noise(observed),
            'u_disease': self._infer_disease_noise(observed),
            'u_severity': self._infer_severity_noise(observed)
        }

        return observed, noise_terms

    def _infer_moisture_noise(self, obs):
        """Infer moisture noise from observation."""
        # Expected moisture given inputs
        expected = 5.0 + obs['environmental_stress'] * 10
        if obs['watering_practice'] == 0:
            expected -= 3
        elif obs['watering_practice'] == 2:
            expected += 5

        # Noise is observed - expected
        return obs['leaf_moisture_hours'] - expected

    def _infer_pathogen_noise(self, obs):
        """Infer pathogen growth noise."""
        expected = (obs['leaf_moisture_hours'] / 24) ** 1.5
        return obs['pathogen_growth'] - expected

    def _infer_disease_noise(self, obs):
        """Infer disease threshold noise."""
        # Binary outcome, return indicator
        return obs['disease_present']

    def _infer_severity_noise(self, obs):
        """Infer symptom severity noise."""
        if obs['disease_present'] == 0:
            return 0
        expected = obs['disease_present'] * (1 - obs['plant_vigor'] * 0.5)
        return obs['symptom_severity'] - expected

    def action(self, observed, noise_terms, intervention):
        """
        Step 2: Modify model according to intervention.
        Set the treatment variable to counterfactual value.
        """
        # Create counterfactual data point
        cf_data = observed.copy()

        # Apply intervention (break incoming edges)
        for var, value in intervention.items():
            cf_data[var] = value

        return cf_data, noise_terms

    def prediction(self, cf_data, noise_terms):
        """
        Step 3: Compute counterfactual outcome.
        Propagate intervention through causal graph.
        """
        # Re-compute downstream variables with intervention

        # Leaf moisture (intervened, so use counterfactual value)
        cf_moisture = cf_data['leaf_moisture_hours']

        # Pathogen growth (function of new moisture + same noise)
        cf_pathogen = (cf_moisture / 24) ** 1.5 + noise_terms['u_pathogen']
        cf_pathogen = np.clip(cf_pathogen, 0, 1)

        # Disease (function of new pathogen + same noise threshold)
        cf_disease = 1 if cf_pathogen > 0.6 else 0

        # Symptom severity (function of new disease + same plant vigor + same noise)
        cf_severity = cf_disease * (1 - cf_data['plant_vigor'] * 0.5) + noise_terms['u_severity']
        cf_severity = np.clip(cf_severity, 0, 1)

        return {
            'leaf_moisture_hours': cf_moisture,
            'pathogen_growth': cf_pathogen,
            'disease_present': cf_disease,
            'symptom_severity': cf_severity
        }

    def counterfactual(self, sample_idx, intervention):
        """
        Complete counterfactual analysis.

        Args:
            sample_idx: Index of observed instance
            intervention: Dict of {variable: counterfactual_value}

        Returns:
            Dict with factual, counterfactual, and effect
        """
        # Step 1: Abduction
        observed, noise_terms = self.abduction(sample_idx)

        # Step 2: Action
        cf_data, noise_terms = self.action(observed, noise_terms, intervention)

        # Step 3: Prediction
        cf_outcome = self.prediction(cf_data, noise_terms)

        return {
            'factual': {
                'leaf_moisture_hours': observed['leaf_moisture_hours'],
                'pathogen_growth': observed['pathogen_growth'],
                'disease_present': observed['disease_present'],
                'symptom_severity': observed['symptom_severity']
            },
            'counterfactual': cf_outcome,
            'individual_effect': {
                'disease_change': cf_outcome['disease_present'] - observed['disease_present'],
                'severity_change': cf_outcome['symptom_severity'] - observed['symptom_severity']
            },
            'explanation': self._generate_explanation(observed, cf_outcome)
        }

    def _generate_explanation(self, factual, counterfactual):
        """Generate natural language explanation of counterfactual."""
        explanation = []

        # Compare factual vs counterfactual
        if factual['disease_present'] == 1 and counterfactual['disease_present'] == 0:
            explanation.append(
                f"With moderate watering (reducing moisture from {factual['leaf_moisture_hours']:.1f} to "
                f"{counterfactual['leaf_moisture_hours']:.1f} hours), this plant would have avoided disease."
            )
        elif factual['disease_present'] == 0 and counterfactual['disease_present'] == 1:
            explanation.append(
                f"If watering had been excessive (increasing moisture to "
                f"{counterfactual['leaf_moisture_hours']:.1f} hours), this plant would have developed disease."
            )
        else:
            explanation.append(
                f"Disease status would remain unchanged, but symptom severity would change from "
                f"{factual['symptom_severity']:.2f} to {counterfactual['symptom_severity']:.2f}."
            )

        # Add mechanism
        explanation.append(
            f"Mechanism: Moisture affects pathogen growth ({factual['pathogen_growth']:.2f} → "
            f"{counterfactual['pathogen_growth']:.2f}), which determines disease presence."
        )

        return " ".join(explanation)


# Usage Example
if __name__ == "__main__":
    # Load data and DAG (from Part 2)
    from part2_causal_dag import generate_causal_data, causal_graph

    data = generate_causal_data(n_samples=1000)
    cf_engine = CounterfactualEngine(causal_graph, data)

    # Find a diseased plant
    diseased_idx = data[data['disease_present'] == 1].index[0]

    print("=" * 60)
    print("COUNTERFACTUAL ANALYSIS")
    print("=" * 60)

    print(f"\nAnalyzing Plant #{diseased_idx}")
    print(f"Factual: Watering = {data.loc[diseased_idx, 'watering_practice']}")
    print(f"         Moisture = {data.loc[diseased_idx, 'leaf_moisture_hours']:.1f} hours")
    print(f"         Disease = {bool(data.loc[diseased_idx, 'disease_present'])}")
    print(f"         Severity = {data.loc[diseased_idx, 'symptom_severity']:.2f}")

    # Counterfactual: What if watering was optimal?
    intervention = {'leaf_moisture_hours': 6.0}  # Optimal moisture

    result = cf_engine.counterfactual(diseased_idx, intervention)

    print(f"\nCounterfactual: Watering = optimal")
    print(f"                Moisture = {result['counterfactual']['leaf_moisture_hours']:.1f} hours")
    print(f"                Disease = {bool(result['counterfactual']['disease_present'])}")
    print(f"                Severity = {result['counterfactual']['symptom_severity']:.2f}")

    print(f"\nIndividual Treatment Effect:")
    print(f"  Disease change: {result['individual_effect']['disease_change']}")
    print(f"  Severity change: {result['individual_effect']['severity_change']:.2f}")

    print(f"\nExplanation:")
    print(f"  {result['explanation']}")

Output Example

============================================================
COUNTERFACTUAL ANALYSIS
============================================================

Analyzing Plant #42
Factual: Watering = 2 (overwatered)
         Moisture = 18.3 hours
         Disease = True
         Severity = 0.73

Counterfactual: Watering = optimal
                Moisture = 6.0 hours
                Disease = False
                Severity = 0.00

Individual Treatment Effect:
  Disease change: -1
  Severity change: -0.73

Explanation:
  With moderate watering (reducing moisture from 18.3 to 6.0 hours), 
  this plant would have avoided disease. Mechanism: Moisture affects 
  pathogen growth (0.82 → 0.31), which determines disease presence.

Applications of Counterfactual Reasoning

1. Personalized Recommendations

Standard ML: "Plants with disease X should receive treatment Y" (average effect)

Counterfactual AI: "THIS plant would benefit most from intervention Z" (personalized)

def recommend_intervention(cf_engine, plant_idx):
    """
    Find optimal intervention for specific plant.
    """
    # Test multiple interventions
    interventions = {
        'reduce_watering': {'leaf_moisture_hours': 5.0},
        'moderate_watering': {'leaf_moisture_hours': 8.0},
        'increase_watering': {'leaf_moisture_hours': 12.0}
    }

    results = {}
    for name, intervention in interventions.items():
        result = cf_engine.counterfactual(plant_idx, intervention)
        results[name] = result['counterfactual']['symptom_severity']

    # Find best intervention
    best = min(results.items(), key=lambda x: x[1])

    return {
        'recommendation': best[0],
        'expected_severity': best[1],
        'all_options': results
    }

# Example usage
plant_idx = 42
recommendation = recommend_intervention(cf_engine, plant_idx)

print(f"Optimal intervention for Plant #{plant_idx}:")
print(f"  {recommendation['recommendation']}")
print(f"  Expected severity: {recommendation['expected_severity']:.2f}")
print(f"\nAll options:")
for intervention, severity in recommendation['all_options'].items():
    print(f"  {intervention}: {severity:.2f}")

2. Explanation Generation

Why did this plant get diseased?

def explain_disease(cf_engine, diseased_idx, healthy_idx):
    """
    Explain why one plant got diseased and another didn't.
    """
    diseased = cf_engine.data.iloc[diseased_idx]
    healthy = cf_engine.data.iloc[healthy_idx]

    # Compare key differences
    differences = []

    if diseased['watering_practice'] != healthy['watering_practice']:
        differences.append(
            f"Watering: Plant #{diseased_idx} was watered differently "
            f"({diseased['watering_practice']} vs {healthy['watering_practice']})"
        )

    if abs(diseased['plant_vigor'] - healthy['plant_vigor']) > 0.2:
        differences.append(
            f"Vigor: Plant #{diseased_idx} had {'lower' if diseased['plant_vigor'] < healthy['plant_vigor'] else 'higher'} vigor "
            f"({diseased['plant_vigor']:.2f} vs {healthy['plant_vigor']:.2f})"
        )

    # Counterfactual: Would diseased plant be healthy with healthy plant's watering?
    intervention = {'leaf_moisture_hours': healthy['leaf_moisture_hours']}
    cf_result = cf_engine.counterfactual(diseased_idx, intervention)

    if cf_result['counterfactual']['disease_present'] == 0:
        differences.append(
            f"CRITICAL: If Plant #{diseased_idx} had received the same watering as "
            f"Plant #{healthy_idx}, it would have remained healthy."
        )

    return {
        'differences': differences,
        'counterfactual': cf_result,
        'root_cause': 'watering_practice' if cf_result['individual_effect']['disease_change'] < 0 else 'plant_vigor'
    }

# Usage
diseased_plant = data[data['disease_present'] == 1].index[0]
healthy_plant = data[data['disease_present'] == 0].index[0]

explanation = explain_disease(cf_engine, diseased_plant, healthy_plant)

print(f"Why did Plant #{diseased_plant} get diseased?")
for diff in explanation['differences']:
    print(f"  • {diff}")
print(f"\nRoot cause: {explanation['root_cause']}")

3. Regret Analysis

What should I have done differently?

def regret_analysis(cf_engine, sample_idx):
    """
    Analyze what optimal action would have been.
    """
    actual = cf_engine.data.iloc[sample_idx]

    # Test all possible watering practices
    watering_options = [0, 1, 2]  # under, optimal, over

    results = {}
    for watering in watering_options:
        # Compute expected moisture for this watering
        expected_moisture = 5.0 + actual['environmental_stress'] * 10
        if watering == 0:
            expected_moisture -= 3
        elif watering == 2:
            expected_moisture += 5

        intervention = {'leaf_moisture_hours': max(0, min(24, expected_moisture))}
        cf_result = cf_engine.counterfactual(sample_idx, intervention)

        results[watering] = {
            'disease': cf_result['counterfactual']['disease_present'],
            'severity': cf_result['counterfactual']['symptom_severity']
        }

    # Find optimal action
    optimal = min(results.items(), key=lambda x: (x[1]['disease'], x[1]['severity']))

    actual_watering = actual['watering_practice']
    regret = {
        'optimal_action': optimal[0],
        'actual_action': actual_watering,
        'regret': results[actual_watering]['severity'] - optimal[1]['severity']
    }

    return regret

# Usage
plant_idx = 42
regret = regret_analysis(cf_engine, plant_idx)

print(f"Regret Analysis for Plant #{plant_idx}:")
print(f"  Actual action: {['under', 'optimal', 'over'][regret['actual_action']]} watering")
print(f"  Optimal action: {['under', 'optimal', 'over'][regret['optimal_action']]} watering")
print(f"  Regret: {regret['regret']:.2f} severity points")

if regret['regret'] > 0.1:
    print(f"  ⚠️  Significant regret! Better watering would have reduced severity substantially.")
else:
    print(f"  ✓ Action was near-optimal.")

4. Policy Evaluation

Was our intervention strategy effective?

def evaluate_policy(cf_engine, treated_indices, control_indices):
    """
    Evaluate treatment effect using counterfactual reasoning.
    """
    # For treated group: What if they hadn't been treated?
    treated_effects = []
    for idx in treated_indices:
        # Assume treatment was reducing moisture
        cf_result = cf_engine.counterfactual(
            idx, 
            {'leaf_moisture_hours': cf_engine.data.loc[idx, 'leaf_moisture_hours'] + 5.0}
        )
        treated_effects.append(cf_result['individual_effect']['severity_change'])

    # For control group: What if they had been treated?
    control_effects = []
    for idx in control_indices:
        cf_result = cf_engine.counterfactual(
            idx,
            {'leaf_moisture_hours': max(0, cf_engine.data.loc[idx, 'leaf_moisture_hours'] - 5.0)}
        )
        control_effects.append(-cf_result['individual_effect']['severity_change'])

    # Overall treatment effect
    ate = np.mean(treated_effects + control_effects)

    return {
        'average_treatment_effect': ate,
        'treated_effect': np.mean(treated_effects),
        'control_effect': np.mean(control_effects),
        'heterogeneity': np.std(treated_effects + control_effects)
    }

Individual Treatment Effects (ITE)

Beyond Average Treatment Effects

Average Treatment Effect (ATE): What's the effect on average?

"Reducing watering decreases disease by 15% on average"

Individual Treatment Effect (ITE): What's the effect for THIS individual?

"For Plant #42, reducing watering would decrease disease by 85%"
"For Plant #17, reducing watering would have no effect"

Why ITE Matters

Precision medicine/agriculture:

Not everyone responds the same way
Treatment X might help person A but harm person B
Counterfactuals let us estimate personalized effects

Computing ITE

def compute_ite(cf_engine, sample_idx, treatment_var, treatment_value):
    """
    Compute Individual Treatment Effect.

    ITE = Y_1 - Y_0
    where Y_1 is outcome under treatment, Y_0 is outcome under control
    """
    # Factual outcome (what actually happened)
    factual = cf_engine.data.iloc[sample_idx]

    # Counterfactual outcome (what would happen under treatment)
    intervention = {treatment_var: treatment_value}
    cf_result = cf_engine.counterfactual(sample_idx, intervention)

    ite = cf_result['counterfactual']['symptom_severity'] - factual['symptom_severity']

    return {
        'ite': ite,
        'factual_outcome': factual['symptom_severity'],
        'counterfactual_outcome': cf_result['counterfactual']['symptom_severity'],
        'would_benefit': ite < -0.1,  # At least 10% improvement
        'confidence': 'high' if abs(cf_result['counterfactual']['pathogen_growth'] - factual['pathogen_growth']) > 0.2 else 'low'
    }

# Usage: Estimate ITE for multiple plants
ite_results = []
for idx in range(100):
    ite = compute_ite(cf_engine, idx, 'leaf_moisture_hours', 6.0)
    ite_results.append({
        'plant_idx': idx,
        'ite': ite['ite'],
        'would_benefit': ite['would_benefit']
    })

ite_df = pd.DataFrame(ite_results)

print("Individual Treatment Effect Distribution:")
print(f"  Mean ITE: {ite_df['ite'].mean():.3f}")
print(f"  Std ITE: {ite_df['ite'].std():.3f}")
print(f"  % who would benefit: {ite_df['would_benefit'].mean():.1%}")

# Identify who benefits most
top_beneficiaries = ite_df.nsmallest(10, 'ite')
print(f"\nTop 10 beneficiaries from treatment:")
print(top_beneficiaries)

Counterfactual Fairness

Ensuring Fair AI Decisions

The problem: ML models can discriminate based on protected attributes

Counterfactual fairness: "Would the decision be the same if the person had a different protected attribute?"

def check_counterfactual_fairness(cf_engine, sample_idx, protected_attr, alt_value):
    """
    Check if decision would change with different protected attribute.
    """
    # Factual decision
    factual = cf_engine.data.iloc[sample_idx]
    factual_decision = "treat" if factual['symptom_severity'] > 0.5 else "monitor"

    # Counterfactual decision (with different protected attribute)
    intervention = {protected_attr: alt_value}
    cf_result = cf_engine.counterfactual(sample_idx, intervention)
    cf_decision = "treat" if cf_result['counterfactual']['symptom_severity'] > 0.5 else "monitor"

    is_fair = factual_decision == cf_decision

    return {
        'is_fair': is_fair,
        'factual_decision': factual_decision,
        'counterfactual_decision': cf_decision,
        'protected_attr': protected_attr,
        'explanation': f"Decision {'would' if is_fair else 'would NOT'} remain the same"
    }

Practical Tips for Counterfactual Reasoning

1. Validate Structural Equations

Your counterfactuals are only as good as your causal model:

Test on known interventions
Compare to randomized trials when available
Check if counterfactual predictions match observed data

2. Handle Uncertainty

def counterfactual_with_uncertainty(cf_engine, sample_idx, intervention, n_samples=100):
    """
    Compute counterfactual with uncertainty via bootstrapping.
    """
    results = []

    for _ in range(n_samples):
        # Add noise to inference
        cf_result = cf_engine.counterfactual(sample_idx, intervention)
        results.append(cf_result['counterfactual']['symptom_severity'])

    return {
        'mean': np.mean(results),
        'std': np.std(results),
        'ci_lower': np.percentile(results, 2.5),
        'ci_upper': np.percentile(results, 97.5)
    }

3. Combine with Domain Knowledge

The most powerful counterfactuals come from:

Causal structure (DAG)
Domain expertise (mechanisms)
Data (observations)

Don't rely on any one alone.

You've Mastered Counterfactuals

Congratulations! You now understand:

✅ What counterfactuals are and why they're powerful
✅ The three-step process: Abduction → Action → Prediction
✅ How to implement counterfactual inference
✅ Individual Treatment Effects (ITE)
✅ Applications: personalization, explanation, regret analysis
✅ Counterfactual fairness

This is Level 3 reasoning. Most AI can't do this.

What's Next: Intervention Design

In Part 4 (Wednesday, Jan 22), we move from analysis to action:

How do we use counterfactuals to design optimal interventions?
What's the best treatment for each individual?
How do we optimize for multiple objectives?
How do we account for costs and constraints?

We'll build a complete intervention recommendation engine that combines everything from Parts 1-3.

Your Homework

1. Implement counterfactual engine

Use the code from this article
Test on your plant disease data
Generate counterfactual explanations

2. Experiment with interventions

Try different intervention values
Compare factual vs counterfactual outcomes
Find cases with high regret

3. Think about your domain

What counterfactual questions would be valuable?
What interventions do you want to optimize?
What constraints matter in practice?

4. Challenge yourself

Can you extend the engine to multiple treatments?
How would you handle continuous outcomes?
What about time-series counterfactuals?

Bring these to Part 4. We're building the intervention engine.

Series Navigation:

← Part 2: Building Causal DAGs
Part 3: Counterfactual Reasoning ← You are here
Part 4: Intervention Design → (Jan 22)
Part 5: Distributed Systems (Jan 24)

Code & Resources:

Part of the NeoForge Labs research series on production-grade causal AI.

Questions? I read every comment.

Part 2: Building Your First Causal DAG

Kelyn Njeri — Wed, 14 Jan 2026 04:00:18 GMT

In Part 1, you learned why causality matters. Correlation tells you what happens, but causation tells you why and what to do about it.

Today, we're building your first causal Directed Acyclic Graph (DAG)—the foundation of causal reasoning.

By the end of this article, you'll:

Understand what DAGs are and why they're powerful
Build a complete causal model for plant disease detection
Learn to identify confounders, mediators, and colliders
Know how to validate your causal assumptions
Have working code to implement your DAG in Python

No more theory. Let's build something real.

What you'll need:

Python 3.12+
Basic understanding of probability
Curiosity about how things actually work

Let's go.

What Is a Causal DAG?

Graphs as Causal Models

A Directed Acyclic Graph (DAG) is a visual representation of causal relationships:

Three key components:

1. Nodes (variables): Things that can change

Environmental temperature
Soil moisture
Plant health
Disease presence

2. Directed edges (arrows): Causal relationships

A → B means "A causes B"
Direction matters: Temperature → Disease ≠ Disease → Temperature

3. Acyclic (no loops): No circular causation

Can't have: A → B → C → A
Time flows forward, causes precede effects

Why Graphs?

Compact representation of causal knowledge:

Instead of writing:

If temperature is high AND humidity is high THEN moisture increases
If moisture is high AND air circulation is low THEN pathogen growth increases
If pathogen growth is high THEN disease risk increases
...

We draw:

DAG Knowledge Representation

The graph encodes:

Direct causes (arrows)
Indirect causes (paths)
Independence relationships (absence of arrows)
Causal mechanisms (structure)

Reading the Graph

From the DAG above, we can read:

Direct effects:

Temperature directly causes Moisture
Pathogen Growth directly causes Disease

Indirect effects:

Temperature indirectly affects Disease (via Moisture → Pathogen Growth)
Humidity indirectly affects Disease (via same path)

Independence (no arrow):

Temperature does NOT directly cause Disease
- It only affects it through the moisture mechanism
Air Circulation does NOT affect Moisture
- It only affects pathogen growth

This is powerful: The structure tells us what variables are related and HOW.

The DAG Answers Intervention Questions

Want to know: "What happens if I reduce humidity?"

Follow the arrows:

Humidity ↓
→ Moisture ↓
→ Pathogen Growth ↓
→ Disease ↓

The causal path tells us the effect of our intervention.

Want to know: "What happens if I increase temperature?"

Check the paths:

Temperature → Moisture ↑
Moisture → Pathogen Growth (depends on pathogen type!)
- For fungi that need moisture: Growth ↓
- For heat-loving pathogens: Growth ↑

The DAG shows us we need domain knowledge to complete the model.

What DAGs Cannot Do

Important limitations:

1. DAGs don't learn from data alone

You need domain knowledge to draw arrows
Data can validate or reject your DAG
But structure comes from understanding mechanisms

2. DAGs assume causality is stable

Same causes → same effects
Holds across contexts (mostly)
May break with extreme distribution shift

3. DAGs can be wrong

Missing arrows = missed confounders
Wrong direction = incorrect causal reasoning
Validation is crucial

But: A good DAG, grounded in domain expertise, is far more reliable than pure correlation.

Building the Plant Disease DAG

Step-by-Step: From Mechanism to Graph

Let's build our plant disease causal model systematically.

Step 1: Identify Root Causes (Exogenous Variables)

What are the fundamental inputs we can observe or control?

1. Environmental Stress

Temperature extremes
Humidity levels
Light availability
Composite measure of environmental conditions

2. Watering Practice

Frequency
Amount
Under/optimal/overwatered
Farmer-controlled variable

3. Plant Vigor

Overall plant health
Genetic factors
Age and maturity
Baseline resilience

These are exogenous (external) variables—they're not caused by other variables in our model.

Step 2: Identify the Causal Mechanism

Question: How do root causes lead to disease?

Domain knowledge from plant pathology:

1. High environmental stress + watering → Leaf Moisture

Hot weather increases evaporation
Watering increases surface water
Combined: creates conditions for pathogens

2. Leaf Moisture → Pathogen Growth

Fungi need moisture to germinate
Bacteria need water film to enter plant
Critical threshold: ~6-8 hours of leaf wetness

3. Pathogen Growth → Disease

Sufficient pathogen load → infection
Varies by plant immunity

4. Disease + Plant Vigor → Symptom Severity

Same disease manifests differently
Healthy plants show mild symptoms
Weak plants show severe symptoms

5. Symptom Severity → Observable Symptoms

What we actually see
Yellowing, spots, wilting, etc.

Step 3: Draw the Arrows

Now we connect the dots:

Node Legend:

Cyan: Exogenous (controllable or observable inputs)
Purple: Intermediate mechanisms
Pink: Latent (unobserved) variable

Step 4: Validate the Structure

For each arrow, ask: "Does X directly cause Y?"

Arrow	Justification	Validated?
Environmental Stress → Leaf Moisture	Temperature/humidity affect surface water	✅
Watering → Leaf Moisture	Direct causal mechanism	✅
Leaf Moisture → Pathogen Growth	Pathogens need water	✅
Pathogen Growth → Disease	Sufficient load → infection	✅
Disease → Symptom Severity	Disease causes symptoms	✅
Plant Vigor → Symptom Severity	Vigor moderates expression	✅
Symptom Severity → Observable Symptoms	What we measure	✅

Missing arrows (intentionally):

Environmental Stress → Disease?
- NO direct arrow: stress affects disease ONLY through moisture
Watering → Disease?
- NO direct arrow: watering affects disease ONLY through moisture
Plant Vigor → Disease?
- NO direct arrow: vigor affects symptom severity, not disease presence

This is important! The absence of arrows encodes causal assumptions.

Step 5: Name the Causal Roles

Understanding special node types:

Confounders:

Variables that affect both treatment and outcome
Example: If Environmental Stress affected both Watering AND Disease directly
Our model: No confounders (by design for simplicity)

Mediators:

Variables on the causal path
Example: Leaf Moisture mediates Environmental Stress → Disease
Pathogen Growth mediates Leaf Moisture → Disease

Colliders:

Variables caused by multiple parents
Example: Symptom Severity is a collider (caused by Disease AND Plant Vigor)
Special property: conditioning on colliders can create spurious associations!

Effect Modifiers:

Variables that change the magnitude of effects
Example: Plant Vigor modifies how Disease translates to Symptoms
High vigor → mild symptoms even with disease

Our Final DAG

7 nodes, 7 edges, complete causal story:

Root Causes → Mechanisms → Observable Outcomes

This is our working model. In Part 3, we'll use it for counterfactual reasoning. In Part 4, we'll design interventions.

But first, let's implement it in code.

Implementing the DAG in Python

Coding Your DAG with DoWhy

Let's make this concrete with Python code.

Install dependencies:

pip install dowhy pandas numpy networkx matplotlib

Define the DAG:

from dowhy import CausalModel
import pandas as pd
import numpy as np

# Define the causal graph
causal_graph = """
digraph {
    Environmental_Stress -> Leaf_Moisture;
    Watering_Practice -> Leaf_Moisture;
    Leaf_Moisture -> Pathogen_Growth;
    Pathogen_Growth -> Disease_Present;
    Disease_Present -> Symptom_Severity;
    Plant_Vigor -> Symptom_Severity;
    Symptom_Severity -> Observable_Symptoms;
}
"""

# Create sample data (we'll use synthetic for now)
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'environmental_stress': np.random.beta(2, 5, n_samples),  # 0-1 scale
    'watering_practice': np.random.choice([0, 1, 2], n_samples),  # 0=under, 1=optimal, 2=over
    'plant_vigor': np.random.beta(8, 2, n_samples),  # Usually healthy
    'leaf_moisture_hours': np.zeros(n_samples),  # We'll compute
    'pathogen_growth': np.zeros(n_samples),
    'disease_present': np.zeros(n_samples),
    'symptom_severity': np.zeros(n_samples),
})

# Generate data according to causal structure
for i in range(n_samples):
    # Leaf moisture depends on environmental stress and watering
    base_moisture = 5.0  # baseline
    stress_effect = data.loc[i, 'environmental_stress'] * 10
    watering_effect = [-3, 0, 5][data.loc[i, 'watering_practice']]

    data.loc[i, 'leaf_moisture_hours'] = np.clip(
        base_moisture + stress_effect + watering_effect + np.random.normal(0, 1),
        0, 24
    )

    # Pathogen growth depends on leaf moisture
    moisture = data.loc[i, 'leaf_moisture_hours']
    data.loc[i, 'pathogen_growth'] = np.clip(
        (moisture / 24) ** 1.5 + np.random.normal(0, 0.1),
        0, 1
    )

    # Disease depends on pathogen growth
    pathogen = data.loc[i, 'pathogen_growth']
    data.loc[i, 'disease_present'] = 1 if pathogen > 0.6 else 0

    # Symptom severity depends on disease and plant vigor
    disease = data.loc[i, 'disease_present']
    vigor = data.loc[i, 'plant_vigor']
    data.loc[i, 'symptom_severity'] = np.clip(
        disease * (1 - vigor * 0.5) + np.random.normal(0, 0.1),
        0, 1
    )

print(data.head())
print(f"\nDisease prevalence: {data['disease_present'].mean():.2%}")

Create the causal model:

model = CausalModel(
    data=data,
    treatment='leaf_moisture_hours',
    outcome='symptom_severity',
    graph=causal_graph,
    common_causes=['environmental_stress', 'watering_practice'],
    effect_modifiers=['plant_vigor']
)

# Visualize the graph
model.view_model()

# Identify the causal effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

What this code does:

Defines causal structure (the DAG)
Generates synthetic data following causal equations
Creates DoWhy model linking data to structure
Identifies causal effect of leaf moisture on symptoms

The structural equations:

# These are the causal mechanisms encoded in the data generation:

Leaf_Moisture = f(Environmental_Stress, Watering_Practice, noise)
Pathogen_Growth = g(Leaf_Moisture, noise)
Disease = h(Pathogen_Growth, noise)
Symptom_Severity = j(Disease, Plant_Vigor, noise)

Each function represents a causal mechanism. The DAG shows which variables go into which functions.

Querying the Model

Now we can ask causal questions:

# Estimate causal effect
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression"
)

print(f"Causal effect of leaf moisture on symptom severity: {estimate.value:.4f}")
print(f"95% Confidence Interval: [{estimate.get_confidence_intervals()[0]:.4f}, {estimate.get_confidence_intervals()[1]:.4f}]")

This tells us: For every additional hour of leaf moisture, symptom severity increases by X.

That's a causal claim, not correlation!

Example output:

Causal effect of leaf moisture on symptom severity: 0.0234
95% Confidence Interval: [0.0198, 0.0271]

Interpretation: Each additional hour of leaf moisture causes 
a 2.34% increase in symptom severity (statistically significant).

We'll do much more with this in Part 3 (counterfactuals) and Part 4 (interventions).

Common DAG Patterns & Pitfalls

Causal Patterns You Need to Know

Understanding these patterns will help you build better DAGs.

Pattern 1: Confounding

Problem: Confounder causes both treatment and outcome, creating spurious association.

Example:

Season (C) → Watering frequency (T)
Season (C) → Disease prevalence (O)

You observe: More watering correlates with more disease.
Reality: It's because summer has both more watering AND more disease.

Solution: Control for confounders in analysis (we'll cover this in Part 4).

Pattern 2: Mediation

Definition: Mediator sits on the causal path between treatment and outcome.

Example in our DAG:

Watering → Leaf Moisture → Pathogen Growth → Disease

Leaf Moisture and Pathogen Growth are mediators.

Why it matters:

Total effect: Watering → Disease (full causal path)
Direct effect: None (Watering doesn't cause Disease except through mediators)
Mediated effect: Watering → Moisture → Pathogen → Disease

Intervention implications:

You can intervene at any point in the chain
Earlier intervention (reduce watering) prevents the entire cascade
Later intervention (fungicide for pathogen) only stops downstream effects

Pattern 3: Collider Bias (The Trap)

Critical property: A and B are independent, but become correlated if you condition on C!

Example:

Disease (A) → Symptom Severity (C)
Plant Vigor (B) → Symptom Severity (C)

Symptom Severity is a collider.

The trap: If you only analyze plants with severe symptoms (conditioning on collider), you'll find:

Plants with low vigor tend to have disease
Plants with high vigor tend NOT to have disease

But this is spurious! You selected on the outcome.

In reality:

Disease and Plant Vigor are independent (no arrow between them)
They only appear related when you filter by severe symptoms

Real-world example:

Imagine you're studying what makes successful entrepreneurs. You only survey people who became billionaires (conditioning on outcome).

You find: High risk-taking OR exceptional luck leads to billions.

Among billionaires:

High risk-takers had average luck
Low risk-takers had exceptional luck

Spurious correlation! Risk-taking and luck appear negatively correlated, but only because you conditioned on success (the collider).

How to avoid: Don't condition on colliders unless you have a good reason.

Pattern 4: Selection Bias

Similar to collider bias, but about sample selection:

Example: You train your model only on:

Plants brought to clinic (Selection)
Which happens when: Disease is visible OR plant is expensive

Now disease and plant value appear correlated—but only because of selection!

Solution: Be aware of how your sample was selected.

DAG Validation Checklist

Before trusting your DAG:

[ ] Every arrow represents a direct causal effect
[ ] Missing arrows represent true independence
[ ] No feedback loops (check: acyclic?)
[ ] Domain experts reviewed the structure
[ ] Edge cases considered (extreme values)
[ ] Alternative DAGs ruled out
[ ] Testable implications identified
[ ] Data validates independence claims

Testing Your DAG

How to Know If Your DAG Is Right

A DAG makes testable predictions about independence relationships.

D-Separation: Reading Independence from Structure

Two variables are d-separated if all paths between them are blocked.

Example from our DAG:

Question: Are Environmental Stress and Plant Vigor independent?

Answer: Yes, they're independent (no path connects them).

Testable prediction: In our data, Environmental Stress and Plant Vigor should be uncorrelated.

If we find correlation, our DAG is wrong!

# Test independence
from scipy.stats import pearsonr

correlation, p_value = pearsonr(
    data['environmental_stress'], 
    data['plant_vigor']
)

print(f"Correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value > 0.05:
    print("✓ Independent (DAG validated)")
else:
    print("✗ Dependent (DAG may be wrong)")

Conditional Independence Tests

More complex: variables may be independent given others.

Example:

Are Watering Practice and Disease independent given Leaf Moisture?

Claim: Watering affects Disease ONLY through Moisture.

Test: Given Moisture, Watering and Disease should be independent.

In notation: Watering ⊥ Disease | Moisture

Python test:

from scipy.stats import chi2_contingency

# Group by leaf moisture levels
data['moisture_level'] = pd.cut(
    data['leaf_moisture_hours'], 
    bins=3, 
    labels=['low', 'med', 'high']
)

# Within each moisture level, test independence
print("Testing: Watering ⊥ Disease | Moisture\n")

for level in ['low', 'med', 'high']:
    subset = data[data['moisture_level'] == level]

    # Contingency table: watering vs disease
    table = pd.crosstab(
        subset['watering_practice'], 
        subset['disease_present']
    )

    # Chi-square test
    chi2, p_value, dof, expected = chi2_contingency(table)

    print(f"Moisture {level}: p-value = {p_value:.4f}")
    if p_value > 0.05:
        print("  ✓ Independent (DAG validated)")
    else:
        print("  ✗ Dependent (DAG may be wrong)")
    print()

Expected output:

Testing: Watering ⊥ Disease | Moisture

Moisture low: p-value = 0.3421
  ✓ Independent (DAG validated)

Moisture med: p-value = 0.5634
  ✓ Independent (DAG validated)

Moisture high: p-value = 0.4523
  ✓ Independent (DAG validated)

If independence holds, our DAG structure is supported by data.

What If Tests Fail?

If your DAG fails independence tests:

1. Missing arrow: Add direct causal link

Example: Watering → Disease (direct effect we missed)

2. Wrong direction: Reverse an arrow

Example: Maybe Disease → Moisture (sick plants retain water?)

3. Missing confounder: Add common cause

Example: Season → both Watering AND Disease

4. Wrong assumptions: Reconsider causal mechanism

Example: Different disease types have different causal paths

Iterate: Build DAG → Test → Revise → Repeat

This is the scientific method applied to causal structure!

Advanced Validation: Falsification Tests

# DoWhy includes built-in refutation tests
from dowhy import CausalModel

# Refute by adding random common cause
refutation = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="random_common_cause"
)
print(refutation)

# Expected: Effect should remain stable
# If effect changes dramatically, DAG may be missing confounders

Practical Tips for DAG Construction

Start Simple, Iterate

Don't try to model everything at once:

Start with 3-5 key variables
- Treatment of interest
- Outcome of interest
- 1-3 confounders
Add complexity gradually
- Mediators
- Effect modifiers
- Additional confounders
Test at each step
- Validate new arrows
- Check independence claims
- Ensure model still makes sense

Use Domain Expertise

Best practices:

Interview domain experts: "What causes X?" "Does Y affect Z directly?"
Review literature: What causal mechanisms are established?
Start with consensus: Build on well-known relationships
Document assumptions: Write down why each arrow exists
Invite criticism: Ask skeptics "What am I missing?"

Common Mistakes to Avoid

1. Arrows everywhere

Don't connect everything
Missing arrows are meaningful (independence claims)

2. Correlation → Arrow

Just because X and Y correlate doesn't mean X → Y
Check for confounders first

3. Forgetting time

Causes must precede effects
Check temporal ordering

4. Ignoring mechanisms

Ask "HOW does X cause Y?"
If you can't explain it, maybe it's not causal

5. No validation

Always test your DAG
Data should support structure

Complete Working Example

Here's a complete, runnable script you can use as a template:

"""
Complete Causal DAG Implementation
Plant Disease Detection Example
"""

from dowhy import CausalModel
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, pearsonr

# Set random seed for reproducibility
np.random.seed(42)

# Define causal graph
causal_graph = """
digraph {
    Environmental_Stress [label="Environmental Stress"];
    Watering_Practice [label="Watering Practice"];
    Plant_Vigor [label="Plant Vigor"];
    Leaf_Moisture [label="Leaf Moisture"];
    Pathogen_Growth [label="Pathogen Growth"];
    Disease_Present [label="Disease Present"];
    Symptom_Severity [label="Symptom Severity"];

    Environmental_Stress -> Leaf_Moisture;
    Watering_Practice -> Leaf_Moisture;
    Leaf_Moisture -> Pathogen_Growth;
    Pathogen_Growth -> Disease_Present;
    Disease_Present -> Symptom_Severity;
    Plant_Vigor -> Symptom_Severity;
}
"""

def generate_causal_data(n_samples=1000):
    """Generate data following the causal DAG structure."""

    data = pd.DataFrame({
        'environmental_stress': np.random.beta(2, 5, n_samples),
        'watering_practice': np.random.choice([0, 1, 2], n_samples),
        'plant_vigor': np.random.beta(8, 2, n_samples),
        'leaf_moisture_hours': np.zeros(n_samples),
        'pathogen_growth': np.zeros(n_samples),
        'disease_present': np.zeros(n_samples),
        'symptom_severity': np.zeros(n_samples),
    })

    for i in range(n_samples):
        # Causal mechanism 1: Environmental Stress + Watering → Leaf Moisture
        base_moisture = 5.0
        stress_effect = data.loc[i, 'environmental_stress'] * 10
        watering_effect = [-3, 0, 5][data.loc[i, 'watering_practice']]

        data.loc[i, 'leaf_moisture_hours'] = np.clip(
            base_moisture + stress_effect + watering_effect + np.random.normal(0, 1),
            0, 24
        )

        # Causal mechanism 2: Leaf Moisture → Pathogen Growth
        moisture = data.loc[i, 'leaf_moisture_hours']
        data.loc[i, 'pathogen_growth'] = np.clip(
            (moisture / 24) ** 1.5 + np.random.normal(0, 0.1),
            0, 1
        )

        # Causal mechanism 3: Pathogen Growth → Disease
        pathogen = data.loc[i, 'pathogen_growth']
        data.loc[i, 'disease_present'] = 1 if pathogen > 0.6 else 0

        # Causal mechanism 4: Disease + Plant Vigor → Symptom Severity
        disease = data.loc[i, 'disease_present']
        vigor = data.loc[i, 'plant_vigor']
        data.loc[i, 'symptom_severity'] = np.clip(
            disease * (1 - vigor * 0.5) + np.random.normal(0, 0.1),
            0, 1
        )

    return data

def validate_dag(data):
    """Run validation tests on the DAG structure."""

    print("=" * 60)
    print("DAG VALIDATION TESTS")
    print("=" * 60)

    # Test 1: Environmental Stress ⊥ Plant Vigor
    print("\n1. Testing: Environmental_Stress ⊥ Plant_Vigor")
    corr, p_val = pearsonr(data['environmental_stress'], data['plant_vigor'])
    print(f"   Correlation: {corr:.4f}, P-value: {p_val:.4f}")
    if p_val > 0.05:
        print("   ✓ Independent (as expected)")
    else:
        print("   ✗ Dependent (DAG may be wrong!)")

    # Test 2: Watering ⊥ Disease | Leaf Moisture
    print("\n2. Testing: Watering ⊥ Disease | Leaf_Moisture")
    data['moisture_level'] = pd.cut(
        data['leaf_moisture_hours'], 
        bins=3, 
        labels=['low', 'med', 'high']
    )

    independence_holds = True
    for level in ['low', 'med', 'high']:
        subset = data[data['moisture_level'] == level]
        if len(subset) < 10:
            continue

        table = pd.crosstab(subset['watering_practice'], subset['disease_present'])
        chi2, p_val, dof, expected = chi2_contingency(table)

        print(f"   Moisture {level}: p-value = {p_val:.4f}", end="")
        if p_val > 0.05:
            print(" ✓")
        else:
            print(" ✗")
            independence_holds = False

    if independence_holds:
        print("   ✓ Conditional independence holds")
    else:
        print("   ✗ Conditional independence violated")

    print("\n" + "=" * 60)

def estimate_causal_effect(data, causal_graph):
    """Estimate causal effect using DoWhy."""

    print("\n" + "=" * 60)
    print("CAUSAL EFFECT ESTIMATION")
    print("=" * 60)

    # Create causal model
    model = CausalModel(
        data=data,
        treatment='leaf_moisture_hours',
        outcome='symptom_severity',
        graph=causal_graph,
        common_causes=['environmental_stress', 'watering_practice'],
        effect_modifiers=['plant_vigor']
    )

    # Identify causal effect
    identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
    print("\nIdentified Estimand:")
    print(identified_estimand)

    # Estimate effect
    estimate = model.estimate_effect(
        identified_estimand,
        method_name="backdoor.linear_regression"
    )

    print(f"\nCausal Effect: {estimate.value:.4f}")
    print(f"Interpretation: Each additional hour of leaf moisture")
    print(f"causes a {estimate.value:.4f} increase in symptom severity")

    # Refutation test
    print("\nRefutation Test (Random Common Cause):")
    refutation = model.refute_estimate(
        identified_estimand,
        estimate,
        method_name="random_common_cause"
    )
    print(refutation)

    return model, estimate

def main():
    """Run complete DAG analysis."""

    print("\n" + "=" * 60)
    print("BUILDING CAUSAL DAG: PLANT DISEASE DETECTION")
    print("=" * 60)

    # Generate data
    print("\nGenerating synthetic data (n=1000)...")
    data = generate_causal_data(n_samples=1000)

    print(f"\nData Summary:")
    print(f"Disease prevalence: {data['disease_present'].mean():.2%}")
    print(f"Mean symptom severity: {data['symptom_severity'].mean():.3f}")
    print(f"Mean leaf moisture: {data['leaf_moisture_hours'].mean():.2f} hours")

    # Validate DAG
    validate_dag(data)

    # Estimate causal effect
    model, estimate = estimate_causal_effect(data, causal_graph)

    print("\n" + "=" * 60)
    print("ANALYSIS COMPLETE")
    print("=" * 60)
    print("\nNext Steps:")
    print("1. Part 3: Use this DAG for counterfactual reasoning")
    print("2. Part 4: Design interventions based on causal effects")
    print("3. Part 5: Scale to production systems")

    return data, model, estimate

if __name__ == "__main__":
    data, model, estimate = main()

Save this as causal_dag.py and run:

python causal_dag.py

You've Built a Causal Model

Congratulations! You now have:

✅ A complete causal DAG for plant disease
✅ Understanding of confounders, mediators, colliders
✅ Working Python implementation with DoWhy
✅ Methods to validate your causal structure
✅ Template code you can adapt to any domain

This is the foundation. Everything we do next builds on this DAG.

What's Next: Counterfactual Reasoning

In Part 3 (Friday, Jan 16), we'll use this DAG to answer questions like:

"This plant has disease. Would it be healthy if I had watered less?"
"I applied intervention X. What would have happened without it?"
"Why did this specific plant get diseased when that one didn't?"

These are counterfactual questions—the most powerful form of causal reasoning.

We'll implement:

Counterfactual inference algorithms
"What if" scenario analysis
Personalized explanation generation
Individual treatment effect estimation

Your Homework Before Part 3

1. Run the code in this article

Generate the data
Build the DAG
Validate the structure
Estimate causal effects

2. Modify the DAG

Add a new variable (e.g., "Soil Quality")
Add corresponding arrows
Update the data generation
Test if it still validates

3. Apply to your domain

Think about a problem you're working on
Identify 5-7 key variables
Draw a DAG on paper
What causal questions would you want to answer?

4. Prepare questions

What's unclear about DAG construction?
What validation tests are you curious about?
What challenges do you foresee for your domain?

Bring these to Part 3. We're going deeper.

Series Navigation:

← Part 1: Why Causality Matters
Part 2: Building Your First Causal DAG ← You are here
Part 3: Counterfactual Reasoning (Jan 16)
Part 4: Intervention Design (Jan 21)
Part 5: Distributed Systems (Jan 23)

Code & Resources:

This is part of my research at NeoForge Labs on causal AI systems. Follow along as we build production-grade causal reasoning from scratch.

Questions? Drop them in the comments below. I read and respond to everything.

Found this useful? Share it with someone who's struggling with production ML failures. Let's build better AI together.

Part 1: Why Causality Matters for AI

Kelyn Njeri — Mon, 12 Jan 2026 04:00:00 GMT

Your AI model achieves 95% accuracy predicting plant diseases from images. Impressive, right?

You deploy it to farmers. It works… until it doesn't. When farmers follow its recommendations, nothing happens. Sometimes, things get worse. The model saw patterns, learned correlations, but understood nothing about why diseases occur or what actually causes them. This is the correlation trap, and it's everywhere in modern AI.

Today, we're going to explore why the future of AI isn't just about bigger models or more data. It's about causality: understanding the mechanisms that generate our data, not just the patterns within it. By the end of the series, you'll build a causal reasoning system that doesn't just predict plant diseases, it explains why they occur and recommends interventions that actually work.

Let's start with the fundamental question: What's the difference between correlation and causation, and why should you care?

The Pattern Recognition Machine

Modern machine learning is fundamentally a pattern-matching engine. Given data about X and Y, it learns:

P(Y | X) - "What is the probability of Y given that we observe X?"

This works brilliantly for:

Image Classification: "Given these pixels, is this a cat?"
Recommendation systems: "Given this user's history, what will they like?"
Spam detection: "Given this email's features, is it spam?"

But here's the problem: observing X is not the same as changing X.

The Classic Trap: Ice Cream and Drowning

Imagine you're building a public safety AI. Your model discovers a strong correlation:

When ice cream sales go up, drowning deaths also go up.

Should you ban ice cream to prevent drowning?

Obviously not. The real causal structure is:

Hot weather increases ice cream sales
Hot weather increases number of people going swimming hence leading to more drowning deaths

In this case, hot weather is a confounder, it causes both variables. Ice cream sales and drowning deaths are correlated but not causally related. Your ML model sees the correlation but it has no idea about the mechanism.

Why This Breaks in Production

You might think, "Sure, but that's an obvious example. In practice, we'd catch that." Would you?

Consider our plant disease detector:

It learns: Yellowing leaves → Nitrogen deficiency
Correlation: 90% accuracy

But what it misses:

Overwatering → Root rot → Yellowing
Fungal infection → Yellowing
Natural senescence → Yellowing

The model sees "yellowing = nitrogen deficiency" because that's the most common pattern in the training data. But when you apply nitrogen fertilizer to an overwatered plant, you make the problem worse.

Correlation told you what's common. Causation tells you what actually works.

Pearl's Ladder: The Three Levels of Intelligence

Judea Pearl, the godfather of causal inference, describes three levels of causal reasoning:

Let's break these down with our plant disease example:

Level 1: Association (Seeing)

Question: "What symptoms are present?"
Notation: P(Symptoms | Disease)
ML Capability: ✅ Current AI excels here

Example:

Observation: Plant has brown spots and yellowing leaves
Model predicts: "85% probability of early blight"

This is correlation. The model sees patterns but doesn't understand the mechanisms.

Level 2: Intervention (Doing)

Question: "What happens if I change watering frequency?"
Notation: P(Disease | do(Watering = optimal))
ML Capability: ❌ Most AI fails here

The do() operator is crucial. It represents intervention, actively changing a variable, not just observing it.

Example:

Observational: P(Disease | Watering = high) might show correlation
Interventional: P(Disease | do(Watering = optimal)) shows causal effect

The difference:

Observation: Plants that are overwatered tend to be diseased (maybe because sick plants retain water?)
Intervention: If we reduce watering, does disease decrease? (causal effect)

Level 3: Counterfactuals (Imagining)

Question: "Would this plant be healthy if I had watered it differently?"
Notation: P(Healthy | Watered differently, saw disease)
ML Capability: ❌❌ Almost no AI does this

This is the most powerful level. You're asking about alternate realities:

Example:

Factual: "I watered heavily, and the plant developed root rot"
Counterfactual: "If I had watered moderately, would the plant be healthy?"

This requires understanding:

The causal mechanism (overwatering → root rot)
The specific instance (this plant, these conditions)
Alternate histories (what would have been different)

Most AI systems operate at Level 1. Human experts operate at Levels 2 and 3. We're going to build AI that does the same.

The Problems with Pure Correlation

Let's be concrete about why correlation-based ML fails in practice:

Problem 1: Distribution Shift

Your model learns from data collected in:

Season: Summer
Location: Greenhouse A
Conditions: Controlled environment

You deploy to:

Season: Winter
Location: Outdoor farm
Conditions: Wild weather variation

What happens? All the correlations change. Your model has no idea what remains true (causal relationships) vs. what was just a coincidence (spurious correlation).

Problem 2: Spurious Correlations

Training data artifact: Most diseased plants in your dataset are near the south wall of the greenhouse. The model then learns to correlate south wall to disease.

Reality: South wall gets more light → higher temperature → more humidity → disease.

When you tell a farmer, "move your plants away from south-facing walls," you've given useless advice based on spurious correlation.

With causal knowledge: You'd recommend humidity control, which actually addresses the mechanism.

Problem 3: No Intervention Guidance

Even when your model correctly identifies disease, it can't answer:

What should I do about it?
Which intervention will be most effective?
What's the root cause I should address?

It can only tell you: "This looks like early blight" (association).

It cannot tell you: "Reduce watering and improve air circulation" (intervention).

What We Need Instead

A causal model that:

Explains mechanisms: Why does disease occur?
Predicts interventions: What happens if I change X?
Handles distribution shift: Which relationships are stable across contexts?
Enables counterfactual reasoning: What would have happened if…?

This is what we're building in this series.

A Different Approach: Causal Graphs

Instead of learning correlations from data, we explicitly model causal relationships:

This Directed Acyclic Graph (DAG) represents our causal understanding:

Arrows show causation, not just correlation
No arrow means no direct causal effect
Structure encodes domain knowledge

With this graph, we can answer intervention questions:

Q: "What happens if I reduce watering?"
A: Follow the causal path: Watering ↓ → Moisture ↓ → Pathogen Growth ↓ → Disease ↓

This is fundamentally different from correlation. We're modeling the data-generating process, not just patterns in data.

The Power of do()

The do() operator represents intervention:

P(Disease | Watering = high): Observation (what we see)
P(Disease | do(Watering = low)): Intervention (what would happen if we change it)

These are different!

Observation includes confounders. Maybe plants that are naturally disease-prone are also overwatered by worried farmers.

Intervention breaks the confounding. We're asking: independent of everything else, what's the causal effect?

What's Coming in This Series

Over the next 5 articles, you'll learn to:

Part 2: Build causal DAGs from domain knowledge
Part 3: Use counterfactual reasoning to predict alternate outcomes
Part 4: Design interventions based on causal effects
Part 5: Scale causal inference to production systems

By the end, you'll have built a complete causal diagnostic system for plant diseases, and you'll understand how to apply these principles to any domain.

Case Study: The Yellowing Leaves Mystery

Let's make this concrete with a real diagnostic scenario.

The Correlation Approach

Farmer brings you a plant with yellowing leaves.

Your ML model:

Analyzes image
Matches pattern to training data
Outputs: "80% probability: Nitrogen deficiency"

Recommendation: Apply nitrogen fertilizer

What Actually Happens

Farmer applies nitrogen. Plant gets worse.

Why? The actual cause was overwatering leading to root rot. Adding nitrogen to an already-sick plant stressed it further.

The Causal Approach

Instead of just pattern matching, we reason causally:

Causal diagnostic process:

Identify possible causes (multiple hypotheses)
Check diagnostic indicators for each cause
Find root cause via causal mechanism
Recommend intervention targeting the actual cause

Results:

Soil moisture: Very high ✓
Soil nitrogen: Normal levels
Leaf spots: None
Affected leaves: Throughout plant

Diagnosis: Overwatering → Root rot → Nutrient uptake impaired → Yellowing

Intervention: Reduce watering, improve drainage, let soil dry

Outcome: Plant recovers

The Difference

Correlation ML	Causal Reasoning
Pattern matching	Mechanism understanding
Single prediction	Multiple hypotheses
No "why"	Explains root cause
Generic recommendation	Targeted intervention
Fails on edge cases	Handles novel scenarios

This is why causality matters.

Where We're Headed

You've now seen why correlation isn't enough. Pattern matching fails when:

Distributions shift
Interventions are needed
You need to explain "why"

In Part 2, we'll get hands-on. You'll learn to:

Build your first causal DAG
Encode domain knowledge as a graph structure
Identify confounders, mediators, and colliders
Validate your causal assumptions

We'll continue with our plant disease example, constructing the complete causal graph that maps environmental factors → physiological responses → observable symptoms.

By the end of Part 2, you'll have a working causal model, the foundation for everything that comes after.

Your Challenge

Before Part 2, think about a problem in your domain:

What patterns do your ML models learn?
What's the actual causal mechanism?
Where have you seen correlation fail?

Bring these questions to Part 2. We're going to build something better.

Series Navigation:

Part 1: Why Causality Matters ← You are here
Part 2: Building Your First Causal DAG (Jan 15)
Part 3: Counterfactual Reasoning (Jan 17)
Part 4: Intervention Design (Jan 22)
Part 5: Distributed Systems (Jan 24)

This is part of my research at NeoForge Labs on causal AI systems. Follow along as we build production-grade causal reasoning from scratch.

Questions? Drop them in the comments below.