Guide to Creating Your First Causal DAG

In Part 1, you learned why causality matters. Correlation tells you what happens, but causation tells you why and what to do about it.

Today, we're building your first causal Directed Acyclic Graph (DAG)—the foundation of causal reasoning.

By the end of this article, you'll:

Understand what DAGs are and why they're powerful
Build a complete causal model for plant disease detection
Learn to identify confounders, mediators, and colliders
Know how to validate your causal assumptions
Have working code to implement your DAG in Python

No more theory. Let's build something real.

What you'll need:

Python 3.12+
Basic understanding of probability
Curiosity about how things actually work

Let's go.

What Is a Causal DAG?

Graphs as Causal Models

A Directed Acyclic Graph (DAG) is a visual representation of causal relationships:

Three key components:

1. Nodes (variables): Things that can change

Environmental temperature
Soil moisture
Plant health
Disease presence

2. Directed edges (arrows): Causal relationships

A → B means "A causes B"
Direction matters: Temperature → Disease ≠ Disease → Temperature

3. Acyclic (no loops): No circular causation

Can't have: A → B → C → A
Time flows forward, causes precede effects

Why Graphs?

Compact representation of causal knowledge:

Instead of writing:

If temperature is high AND humidity is high THEN moisture increases
If moisture is high AND air circulation is low THEN pathogen growth increases
If pathogen growth is high THEN disease risk increases
...

We draw:

DAG Knowledge Representation

The graph encodes:

Direct causes (arrows)
Indirect causes (paths)
Independence relationships (absence of arrows)
Causal mechanisms (structure)

Reading the Graph

From the DAG above, we can read:

Direct effects:

Temperature directly causes Moisture
Pathogen Growth directly causes Disease

Indirect effects:

Temperature indirectly affects Disease (via Moisture → Pathogen Growth)
Humidity indirectly affects Disease (via same path)

Independence (no arrow):

Temperature does NOT directly cause Disease
- It only affects it through the moisture mechanism
Air Circulation does NOT affect Moisture
- It only affects pathogen growth

This is powerful: The structure tells us what variables are related and HOW.

The DAG Answers Intervention Questions

Want to know: "What happens if I reduce humidity?"

Follow the arrows:

Humidity ↓
→ Moisture ↓
→ Pathogen Growth ↓
→ Disease ↓

The causal path tells us the effect of our intervention.

Want to know: "What happens if I increase temperature?"

Check the paths:

Temperature → Moisture ↑
Moisture → Pathogen Growth (depends on pathogen type!)
- For fungi that need moisture: Growth ↓
- For heat-loving pathogens: Growth ↑

The DAG shows us we need domain knowledge to complete the model.

What DAGs Cannot Do

Important limitations:

1. DAGs don't learn from data alone

You need domain knowledge to draw arrows
Data can validate or reject your DAG
But structure comes from understanding mechanisms

2. DAGs assume causality is stable

Same causes → same effects
Holds across contexts (mostly)
May break with extreme distribution shift

3. DAGs can be wrong

Missing arrows = missed confounders
Wrong direction = incorrect causal reasoning
Validation is crucial

But: A good DAG, grounded in domain expertise, is far more reliable than pure correlation.

Building the Plant Disease DAG

Step-by-Step: From Mechanism to Graph

Let's build our plant disease causal model systematically.

Step 1: Identify Root Causes (Exogenous Variables)

What are the fundamental inputs we can observe or control?

1. Environmental Stress

Temperature extremes
Humidity levels
Light availability
Composite measure of environmental conditions

2. Watering Practice

Frequency
Amount
Under/optimal/overwatered
Farmer-controlled variable

3. Plant Vigor

Overall plant health
Genetic factors
Age and maturity
Baseline resilience

These are exogenous (external) variables—they're not caused by other variables in our model.

Step 2: Identify the Causal Mechanism

Question: How do root causes lead to disease?

Domain knowledge from plant pathology:

1. High environmental stress + watering → Leaf Moisture

Hot weather increases evaporation
Watering increases surface water
Combined: creates conditions for pathogens

2. Leaf Moisture → Pathogen Growth

Fungi need moisture to germinate
Bacteria need water film to enter plant
Critical threshold: ~6-8 hours of leaf wetness

3. Pathogen Growth → Disease

Sufficient pathogen load → infection
Varies by plant immunity

4. Disease + Plant Vigor → Symptom Severity

Same disease manifests differently
Healthy plants show mild symptoms
Weak plants show severe symptoms

5. Symptom Severity → Observable Symptoms

What we actually see
Yellowing, spots, wilting, etc.

Step 3: Draw the Arrows

Now we connect the dots:

Node Legend:

Cyan: Exogenous (controllable or observable inputs)
Purple: Intermediate mechanisms
Pink: Latent (unobserved) variable

Step 4: Validate the Structure

For each arrow, ask: "Does X directly cause Y?"

Arrow	Justification	Validated?
Environmental Stress → Leaf Moisture	Temperature/humidity affect surface water	✅
Watering → Leaf Moisture	Direct causal mechanism	✅
Leaf Moisture → Pathogen Growth	Pathogens need water	✅
Pathogen Growth → Disease	Sufficient load → infection	✅
Disease → Symptom Severity	Disease causes symptoms	✅
Plant Vigor → Symptom Severity	Vigor moderates expression	✅
Symptom Severity → Observable Symptoms	What we measure	✅

Missing arrows (intentionally):

Environmental Stress → Disease?
- NO direct arrow: stress affects disease ONLY through moisture
Watering → Disease?
- NO direct arrow: watering affects disease ONLY through moisture
Plant Vigor → Disease?
- NO direct arrow: vigor affects symptom severity, not disease presence

This is important! The absence of arrows encodes causal assumptions.

Step 5: Name the Causal Roles

Understanding special node types:

Confounders:

Variables that affect both treatment and outcome
Example: If Environmental Stress affected both Watering AND Disease directly
Our model: No confounders (by design for simplicity)

Mediators:

Variables on the causal path
Example: Leaf Moisture mediates Environmental Stress → Disease
Pathogen Growth mediates Leaf Moisture → Disease

Colliders:

Variables caused by multiple parents
Example: Symptom Severity is a collider (caused by Disease AND Plant Vigor)
Special property: conditioning on colliders can create spurious associations!

Effect Modifiers:

Variables that change the magnitude of effects
Example: Plant Vigor modifies how Disease translates to Symptoms
High vigor → mild symptoms even with disease

Our Final DAG

7 nodes, 7 edges, complete causal story:

Root Causes → Mechanisms → Observable Outcomes

This is our working model. In Part 3, we'll use it for counterfactual reasoning. In Part 4, we'll design interventions.

But first, let's implement it in code.

Implementing the DAG in Python

Coding Your DAG with DoWhy

Let's make this concrete with Python code.

Install dependencies:

pip install dowhy pandas numpy networkx matplotlib

Define the DAG:

from dowhy import CausalModel
import pandas as pd
import numpy as np

# Define the causal graph
causal_graph = """
digraph {
    Environmental_Stress -> Leaf_Moisture;
    Watering_Practice -> Leaf_Moisture;
    Leaf_Moisture -> Pathogen_Growth;
    Pathogen_Growth -> Disease_Present;
    Disease_Present -> Symptom_Severity;
    Plant_Vigor -> Symptom_Severity;
    Symptom_Severity -> Observable_Symptoms;
}
"""

# Create sample data (we'll use synthetic for now)
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'environmental_stress': np.random.beta(2, 5, n_samples),  # 0-1 scale
    'watering_practice': np.random.choice([0, 1, 2], n_samples),  # 0=under, 1=optimal, 2=over
    'plant_vigor': np.random.beta(8, 2, n_samples),  # Usually healthy
    'leaf_moisture_hours': np.zeros(n_samples),  # We'll compute
    'pathogen_growth': np.zeros(n_samples),
    'disease_present': np.zeros(n_samples),
    'symptom_severity': np.zeros(n_samples),
})

# Generate data according to causal structure
for i in range(n_samples):
    # Leaf moisture depends on environmental stress and watering
    base_moisture = 5.0  # baseline
    stress_effect = data.loc[i, 'environmental_stress'] * 10
    watering_effect = [-3, 0, 5][data.loc[i, 'watering_practice']]

    data.loc[i, 'leaf_moisture_hours'] = np.clip(
        base_moisture + stress_effect + watering_effect + np.random.normal(0, 1),
        0, 24
    )

    # Pathogen growth depends on leaf moisture
    moisture = data.loc[i, 'leaf_moisture_hours']
    data.loc[i, 'pathogen_growth'] = np.clip(
        (moisture / 24) ** 1.5 + np.random.normal(0, 0.1),
        0, 1
    )

    # Disease depends on pathogen growth
    pathogen = data.loc[i, 'pathogen_growth']
    data.loc[i, 'disease_present'] = 1 if pathogen > 0.6 else 0

    # Symptom severity depends on disease and plant vigor
    disease = data.loc[i, 'disease_present']
    vigor = data.loc[i, 'plant_vigor']
    data.loc[i, 'symptom_severity'] = np.clip(
        disease * (1 - vigor * 0.5) + np.random.normal(0, 0.1),
        0, 1
    )

print(data.head())
print(f"\nDisease prevalence: {data['disease_present'].mean():.2%}")

Create the causal model:

model = CausalModel(
    data=data,
    treatment='leaf_moisture_hours',
    outcome='symptom_severity',
    graph=causal_graph,
    common_causes=['environmental_stress', 'watering_practice'],
    effect_modifiers=['plant_vigor']
)

# Visualize the graph
model.view_model()

# Identify the causal effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

What this code does:

Defines causal structure (the DAG)
Generates synthetic data following causal equations
Creates DoWhy model linking data to structure
Identifies causal effect of leaf moisture on symptoms

The structural equations:

# These are the causal mechanisms encoded in the data generation:

Leaf_Moisture = f(Environmental_Stress, Watering_Practice, noise)
Pathogen_Growth = g(Leaf_Moisture, noise)
Disease = h(Pathogen_Growth, noise)
Symptom_Severity = j(Disease, Plant_Vigor, noise)

Each function represents a causal mechanism. The DAG shows which variables go into which functions.

Querying the Model

Now we can ask causal questions:

# Estimate causal effect
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression"
)

print(f"Causal effect of leaf moisture on symptom severity: {estimate.value:.4f}")
print(f"95% Confidence Interval: [{estimate.get_confidence_intervals()[0]:.4f}, {estimate.get_confidence_intervals()[1]:.4f}]")

This tells us: For every additional hour of leaf moisture, symptom severity increases by X.

That's a causal claim, not correlation!

Example output:

Causal effect of leaf moisture on symptom severity: 0.0234
95% Confidence Interval: [0.0198, 0.0271]

Interpretation: Each additional hour of leaf moisture causes 
a 2.34% increase in symptom severity (statistically significant).

We'll do much more with this in Part 3 (counterfactuals) and Part 4 (interventions).

Common DAG Patterns & Pitfalls

Causal Patterns You Need to Know

Understanding these patterns will help you build better DAGs.

Pattern 1: Confounding

Problem: Confounder causes both treatment and outcome, creating spurious association.

Example:

Season (C) → Watering frequency (T)
Season (C) → Disease prevalence (O)

You observe: More watering correlates with more disease.
Reality: It's because summer has both more watering AND more disease.

Solution: Control for confounders in analysis (we'll cover this in Part 4).

Pattern 2: Mediation

Definition: Mediator sits on the causal path between treatment and outcome.

Example in our DAG:

Watering → Leaf Moisture → Pathogen Growth → Disease

Leaf Moisture and Pathogen Growth are mediators.

Why it matters:

Total effect: Watering → Disease (full causal path)
Direct effect: None (Watering doesn't cause Disease except through mediators)
Mediated effect: Watering → Moisture → Pathogen → Disease

Intervention implications:

You can intervene at any point in the chain
Earlier intervention (reduce watering) prevents the entire cascade
Later intervention (fungicide for pathogen) only stops downstream effects

Pattern 3: Collider Bias (The Trap)

Critical property: A and B are independent, but become correlated if you condition on C!

Example:

Disease (A) → Symptom Severity (C)
Plant Vigor (B) → Symptom Severity (C)

Symptom Severity is a collider.

The trap: If you only analyze plants with severe symptoms (conditioning on collider), you'll find:

Plants with low vigor tend to have disease
Plants with high vigor tend NOT to have disease

But this is spurious! You selected on the outcome.

In reality:

Disease and Plant Vigor are independent (no arrow between them)
They only appear related when you filter by severe symptoms

Real-world example:

Imagine you're studying what makes successful entrepreneurs. You only survey people who became billionaires (conditioning on outcome).

You find: High risk-taking OR exceptional luck leads to billions.

Among billionaires:

High risk-takers had average luck
Low risk-takers had exceptional luck

Spurious correlation! Risk-taking and luck appear negatively correlated, but only because you conditioned on success (the collider).

How to avoid: Don't condition on colliders unless you have a good reason.

Pattern 4: Selection Bias

Similar to collider bias, but about sample selection:

Example: You train your model only on:

Plants brought to clinic (Selection)
Which happens when: Disease is visible OR plant is expensive

Now disease and plant value appear correlated—but only because of selection!

Solution: Be aware of how your sample was selected.

DAG Validation Checklist

Before trusting your DAG:

[ ] Every arrow represents a direct causal effect
[ ] Missing arrows represent true independence
[ ] No feedback loops (check: acyclic?)
[ ] Domain experts reviewed the structure
[ ] Edge cases considered (extreme values)
[ ] Alternative DAGs ruled out
[ ] Testable implications identified
[ ] Data validates independence claims

Testing Your DAG

How to Know If Your DAG Is Right

A DAG makes testable predictions about independence relationships.

D-Separation: Reading Independence from Structure

Two variables are d-separated if all paths between them are blocked.

Example from our DAG:

Question: Are Environmental Stress and Plant Vigor independent?

Answer: Yes, they're independent (no path connects them).

Testable prediction: In our data, Environmental Stress and Plant Vigor should be uncorrelated.

If we find correlation, our DAG is wrong!

# Test independence
from scipy.stats import pearsonr

correlation, p_value = pearsonr(
    data['environmental_stress'], 
    data['plant_vigor']
)

print(f"Correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value > 0.05:
    print("✓ Independent (DAG validated)")
else:
    print("✗ Dependent (DAG may be wrong)")

Conditional Independence Tests

More complex: variables may be independent given others.

Example:

Are Watering Practice and Disease independent given Leaf Moisture?

Claim: Watering affects Disease ONLY through Moisture.

Test: Given Moisture, Watering and Disease should be independent.

In notation: Watering ⊥ Disease | Moisture

Python test:

from scipy.stats import chi2_contingency

# Group by leaf moisture levels
data['moisture_level'] = pd.cut(
    data['leaf_moisture_hours'], 
    bins=3, 
    labels=['low', 'med', 'high']
)

# Within each moisture level, test independence
print("Testing: Watering ⊥ Disease | Moisture\n")

for level in ['low', 'med', 'high']:
    subset = data[data['moisture_level'] == level]

    # Contingency table: watering vs disease
    table = pd.crosstab(
        subset['watering_practice'], 
        subset['disease_present']
    )

    # Chi-square test
    chi2, p_value, dof, expected = chi2_contingency(table)

    print(f"Moisture {level}: p-value = {p_value:.4f}")
    if p_value > 0.05:
        print("  ✓ Independent (DAG validated)")
    else:
        print("  ✗ Dependent (DAG may be wrong)")
    print()

Expected output:

Testing: Watering ⊥ Disease | Moisture

Moisture low: p-value = 0.3421
  ✓ Independent (DAG validated)

Moisture med: p-value = 0.5634
  ✓ Independent (DAG validated)

Moisture high: p-value = 0.4523
  ✓ Independent (DAG validated)

If independence holds, our DAG structure is supported by data.

What If Tests Fail?

If your DAG fails independence tests:

1. Missing arrow: Add direct causal link

Example: Watering → Disease (direct effect we missed)

2. Wrong direction: Reverse an arrow

Example: Maybe Disease → Moisture (sick plants retain water?)

3. Missing confounder: Add common cause

Example: Season → both Watering AND Disease

4. Wrong assumptions: Reconsider causal mechanism

Example: Different disease types have different causal paths

Iterate: Build DAG → Test → Revise → Repeat

This is the scientific method applied to causal structure!

Advanced Validation: Falsification Tests

# DoWhy includes built-in refutation tests
from dowhy import CausalModel

# Refute by adding random common cause
refutation = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="random_common_cause"
)
print(refutation)

# Expected: Effect should remain stable
# If effect changes dramatically, DAG may be missing confounders

Practical Tips for DAG Construction

Start Simple, Iterate

Don't try to model everything at once:

Start with 3-5 key variables
- Treatment of interest
- Outcome of interest
- 1-3 confounders
Add complexity gradually
- Mediators
- Effect modifiers
- Additional confounders
Test at each step
- Validate new arrows
- Check independence claims
- Ensure model still makes sense

Use Domain Expertise

Best practices:

Interview domain experts: "What causes X?" "Does Y affect Z directly?"
Review literature: What causal mechanisms are established?
Start with consensus: Build on well-known relationships
Document assumptions: Write down why each arrow exists
Invite criticism: Ask skeptics "What am I missing?"

Common Mistakes to Avoid

1. Arrows everywhere

Don't connect everything
Missing arrows are meaningful (independence claims)

2. Correlation → Arrow

Just because X and Y correlate doesn't mean X → Y
Check for confounders first

3. Forgetting time

Causes must precede effects
Check temporal ordering

4. Ignoring mechanisms

Ask "HOW does X cause Y?"
If you can't explain it, maybe it's not causal

5. No validation

Always test your DAG
Data should support structure

Complete Working Example

Here's a complete, runnable script you can use as a template:

"""
Complete Causal DAG Implementation
Plant Disease Detection Example
"""

from dowhy import CausalModel
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, pearsonr

# Set random seed for reproducibility
np.random.seed(42)

# Define causal graph
causal_graph = """
digraph {
    Environmental_Stress [label="Environmental Stress"];
    Watering_Practice [label="Watering Practice"];
    Plant_Vigor [label="Plant Vigor"];
    Leaf_Moisture [label="Leaf Moisture"];
    Pathogen_Growth [label="Pathogen Growth"];
    Disease_Present [label="Disease Present"];
    Symptom_Severity [label="Symptom Severity"];

    Environmental_Stress -> Leaf_Moisture;
    Watering_Practice -> Leaf_Moisture;
    Leaf_Moisture -> Pathogen_Growth;
    Pathogen_Growth -> Disease_Present;
    Disease_Present -> Symptom_Severity;
    Plant_Vigor -> Symptom_Severity;
}
"""

def generate_causal_data(n_samples=1000):
    """Generate data following the causal DAG structure."""

    data = pd.DataFrame({
        'environmental_stress': np.random.beta(2, 5, n_samples),
        'watering_practice': np.random.choice([0, 1, 2], n_samples),
        'plant_vigor': np.random.beta(8, 2, n_samples),
        'leaf_moisture_hours': np.zeros(n_samples),
        'pathogen_growth': np.zeros(n_samples),
        'disease_present': np.zeros(n_samples),
        'symptom_severity': np.zeros(n_samples),
    })

    for i in range(n_samples):
        # Causal mechanism 1: Environmental Stress + Watering → Leaf Moisture
        base_moisture = 5.0
        stress_effect = data.loc[i, 'environmental_stress'] * 10
        watering_effect = [-3, 0, 5][data.loc[i, 'watering_practice']]

        data.loc[i, 'leaf_moisture_hours'] = np.clip(
            base_moisture + stress_effect + watering_effect + np.random.normal(0, 1),
            0, 24
        )

        # Causal mechanism 2: Leaf Moisture → Pathogen Growth
        moisture = data.loc[i, 'leaf_moisture_hours']
        data.loc[i, 'pathogen_growth'] = np.clip(
            (moisture / 24) ** 1.5 + np.random.normal(0, 0.1),
            0, 1
        )

        # Causal mechanism 3: Pathogen Growth → Disease
        pathogen = data.loc[i, 'pathogen_growth']
        data.loc[i, 'disease_present'] = 1 if pathogen > 0.6 else 0

        # Causal mechanism 4: Disease + Plant Vigor → Symptom Severity
        disease = data.loc[i, 'disease_present']
        vigor = data.loc[i, 'plant_vigor']
        data.loc[i, 'symptom_severity'] = np.clip(
            disease * (1 - vigor * 0.5) + np.random.normal(0, 0.1),
            0, 1
        )

    return data

def validate_dag(data):
    """Run validation tests on the DAG structure."""

    print("=" * 60)
    print("DAG VALIDATION TESTS")
    print("=" * 60)

    # Test 1: Environmental Stress ⊥ Plant Vigor
    print("\n1. Testing: Environmental_Stress ⊥ Plant_Vigor")
    corr, p_val = pearsonr(data['environmental_stress'], data['plant_vigor'])
    print(f"   Correlation: {corr:.4f}, P-value: {p_val:.4f}")
    if p_val > 0.05:
        print("   ✓ Independent (as expected)")
    else:
        print("   ✗ Dependent (DAG may be wrong!)")

    # Test 2: Watering ⊥ Disease | Leaf Moisture
    print("\n2. Testing: Watering ⊥ Disease | Leaf_Moisture")
    data['moisture_level'] = pd.cut(
        data['leaf_moisture_hours'], 
        bins=3, 
        labels=['low', 'med', 'high']
    )

    independence_holds = True
    for level in ['low', 'med', 'high']:
        subset = data[data['moisture_level'] == level]
        if len(subset) < 10:
            continue

        table = pd.crosstab(subset['watering_practice'], subset['disease_present'])
        chi2, p_val, dof, expected = chi2_contingency(table)

        print(f"   Moisture {level}: p-value = {p_val:.4f}", end="")
        if p_val > 0.05:
            print(" ✓")
        else:
            print(" ✗")
            independence_holds = False

    if independence_holds:
        print("   ✓ Conditional independence holds")
    else:
        print("   ✗ Conditional independence violated")

    print("\n" + "=" * 60)

def estimate_causal_effect(data, causal_graph):
    """Estimate causal effect using DoWhy."""

    print("\n" + "=" * 60)
    print("CAUSAL EFFECT ESTIMATION")
    print("=" * 60)

    # Create causal model
    model = CausalModel(
        data=data,
        treatment='leaf_moisture_hours',
        outcome='symptom_severity',
        graph=causal_graph,
        common_causes=['environmental_stress', 'watering_practice'],
        effect_modifiers=['plant_vigor']
    )

    # Identify causal effect
    identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
    print("\nIdentified Estimand:")
    print(identified_estimand)

    # Estimate effect
    estimate = model.estimate_effect(
        identified_estimand,
        method_name="backdoor.linear_regression"
    )

    print(f"\nCausal Effect: {estimate.value:.4f}")
    print(f"Interpretation: Each additional hour of leaf moisture")
    print(f"causes a {estimate.value:.4f} increase in symptom severity")

    # Refutation test
    print("\nRefutation Test (Random Common Cause):")
    refutation = model.refute_estimate(
        identified_estimand,
        estimate,
        method_name="random_common_cause"
    )
    print(refutation)

    return model, estimate

def main():
    """Run complete DAG analysis."""

    print("\n" + "=" * 60)
    print("BUILDING CAUSAL DAG: PLANT DISEASE DETECTION")
    print("=" * 60)

    # Generate data
    print("\nGenerating synthetic data (n=1000)...")
    data = generate_causal_data(n_samples=1000)

    print(f"\nData Summary:")
    print(f"Disease prevalence: {data['disease_present'].mean():.2%}")
    print(f"Mean symptom severity: {data['symptom_severity'].mean():.3f}")
    print(f"Mean leaf moisture: {data['leaf_moisture_hours'].mean():.2f} hours")

    # Validate DAG
    validate_dag(data)

    # Estimate causal effect
    model, estimate = estimate_causal_effect(data, causal_graph)

    print("\n" + "=" * 60)
    print("ANALYSIS COMPLETE")
    print("=" * 60)
    print("\nNext Steps:")
    print("1. Part 3: Use this DAG for counterfactual reasoning")
    print("2. Part 4: Design interventions based on causal effects")
    print("3. Part 5: Scale to production systems")

    return data, model, estimate

if __name__ == "__main__":
    data, model, estimate = main()

Save this as causal_dag.py and run:

python causal_dag.py

You've Built a Causal Model

Congratulations! You now have:

✅ A complete causal DAG for plant disease
✅ Understanding of confounders, mediators, colliders
✅ Working Python implementation with DoWhy
✅ Methods to validate your causal structure
✅ Template code you can adapt to any domain

This is the foundation. Everything we do next builds on this DAG.

What's Next: Counterfactual Reasoning

In Part 3 (Friday, Jan 16), we'll use this DAG to answer questions like:

"This plant has disease. Would it be healthy if I had watered less?"
"I applied intervention X. What would have happened without it?"
"Why did this specific plant get diseased when that one didn't?"

These are counterfactual questions—the most powerful form of causal reasoning.

We'll implement:

Counterfactual inference algorithms
"What if" scenario analysis
Personalized explanation generation
Individual treatment effect estimation

Your Homework Before Part 3

1. Run the code in this article

Generate the data
Build the DAG
Validate the structure
Estimate causal effects

2. Modify the DAG

Add a new variable (e.g., "Soil Quality")
Add corresponding arrows
Update the data generation
Test if it still validates

3. Apply to your domain

Think about a problem you're working on
Identify 5-7 key variables
Draw a DAG on paper
What causal questions would you want to answer?

4. Prepare questions

What's unclear about DAG construction?
What validation tests are you curious about?
What challenges do you foresee for your domain?

Bring these to Part 3. We're going deeper.

Series Navigation:

← Part 1: Why Causality Matters
Part 2: Building Your First Causal DAG ← You are here
Part 3: Counterfactual Reasoning (Jan 16)
Part 4: Intervention Design (Jan 21)
Part 5: Distributed Systems (Jan 23)

Code & Resources:

This is part of my research at NeoForge Labs on causal AI systems. Follow along as we build production-grade causal reasoning from scratch.

Questions? Drop them in the comments below. I read and respond to everything.

Found this useful? Share it with someone who's struggling with production ML failures. Let's build better AI together.

Command Palette

What Is a Causal DAG?

Graphs as Causal Models

Why Graphs?

Reading the Graph

The DAG Answers Intervention Questions

What DAGs Cannot Do

Building the Plant Disease DAG

Step-by-Step: From Mechanism to Graph

Step 1: Identify Root Causes (Exogenous Variables)

Step 2: Identify the Causal Mechanism

Step 3: Draw the Arrows

Step 4: Validate the Structure

Step 5: Name the Causal Roles

Our Final DAG

Implementing the DAG in Python

Coding Your DAG with DoWhy

Querying the Model

Common DAG Patterns & Pitfalls

Causal Patterns You Need to Know

Pattern 1: Confounding

Pattern 2: Mediation

Pattern 3: Collider Bias (The Trap)

Pattern 4: Selection Bias

DAG Validation Checklist

Testing Your DAG

How to Know If Your DAG Is Right

D-Separation: Reading Independence from Structure

Conditional Independence Tests

What If Tests Fail?

Advanced Validation: Falsification Tests

Practical Tips for DAG Construction

Start Simple, Iterate

Use Domain Expertise

Common Mistakes to Avoid

Complete Working Example

You've Built a Causal Model

What's Next: Counterfactual Reasoning

Your Homework Before Part 3

Comments

Causal AI

Part 3: Counterfactual Reasoning with Causal DAGs

More from this blog