Skip to main content

Command Palette

Search for a command to run...

Part 2: Building Your First Causal DAG

From Domain Knowledge to Graph Structure

Updated
17 min read
Part 2: Building Your First Causal DAG
K

Senior Software Engineer with a knack for Python, Golang, TypeScript, and Elixir. I am also a bit of a Rust enthusiast. I am excited by all things scalability and microservices. Join me on this journey to becoming a unicorn 10x Engineer.

In Part 1, you learned why causality matters. Correlation tells you what happens, but causation tells you why and what to do about it.

Today, we're building your first causal Directed Acyclic Graph (DAG)—the foundation of causal reasoning.

By the end of this article, you'll:

  • Understand what DAGs are and why they're powerful

  • Build a complete causal model for plant disease detection

  • Learn to identify confounders, mediators, and colliders

  • Know how to validate your causal assumptions

  • Have working code to implement your DAG in Python

No more theory. Let's build something real.

What you'll need:

  • Python 3.12+

  • Basic understanding of probability

  • Curiosity about how things actually work

Let's go.


What Is a Causal DAG?

Graphs as Causal Models

A Directed Acyclic Graph (DAG) is a visual representation of causal relationships:

Three key components:

1. Nodes (variables): Things that can change

  • Environmental temperature

  • Soil moisture

  • Plant health

  • Disease presence

2. Directed edges (arrows): Causal relationships

  • A → B means "A causes B"

  • Direction matters: Temperature → Disease ≠ Disease → Temperature

3. Acyclic (no loops): No circular causation

  • Can't have: A → B → C → A

  • Time flows forward, causes precede effects

Why Graphs?

Compact representation of causal knowledge:

Instead of writing:

If temperature is high AND humidity is high THEN moisture increases
If moisture is high AND air circulation is low THEN pathogen growth increases
If pathogen growth is high THEN disease risk increases
...

We draw:

DAG Knowledge Representation

The graph encodes:

  • Direct causes (arrows)

  • Indirect causes (paths)

  • Independence relationships (absence of arrows)

  • Causal mechanisms (structure)

Reading the Graph

From the DAG above, we can read:

Direct effects:

  • Temperature directly causes Moisture

  • Pathogen Growth directly causes Disease

Indirect effects:

  • Temperature indirectly affects Disease (via Moisture → Pathogen Growth)

  • Humidity indirectly affects Disease (via same path)

Independence (no arrow):

  • Temperature does NOT directly cause Disease

    • It only affects it through the moisture mechanism
  • Air Circulation does NOT affect Moisture

    • It only affects pathogen growth

This is powerful: The structure tells us what variables are related and HOW.

The DAG Answers Intervention Questions

Want to know: "What happens if I reduce humidity?"

Follow the arrows:

  1. Humidity ↓

  2. → Moisture ↓

  3. → Pathogen Growth ↓

  4. → Disease ↓

The causal path tells us the effect of our intervention.

Want to know: "What happens if I increase temperature?"

Check the paths:

  • Temperature → Moisture ↑

  • Moisture → Pathogen Growth (depends on pathogen type!)

    • For fungi that need moisture: Growth ↓

    • For heat-loving pathogens: Growth ↑

The DAG shows us we need domain knowledge to complete the model.

What DAGs Cannot Do

Important limitations:

1. DAGs don't learn from data alone

  • You need domain knowledge to draw arrows

  • Data can validate or reject your DAG

  • But structure comes from understanding mechanisms

2. DAGs assume causality is stable

  • Same causes → same effects

  • Holds across contexts (mostly)

  • May break with extreme distribution shift

3. DAGs can be wrong

  • Missing arrows = missed confounders

  • Wrong direction = incorrect causal reasoning

  • Validation is crucial

But: A good DAG, grounded in domain expertise, is far more reliable than pure correlation.


Building the Plant Disease DAG

Step-by-Step: From Mechanism to Graph

Let's build our plant disease causal model systematically.

Step 1: Identify Root Causes (Exogenous Variables)

What are the fundamental inputs we can observe or control?

1. Environmental Stress

  • Temperature extremes

  • Humidity levels

  • Light availability

  • Composite measure of environmental conditions

2. Watering Practice

  • Frequency

  • Amount

  • Under/optimal/overwatered

  • Farmer-controlled variable

3. Plant Vigor

  • Overall plant health

  • Genetic factors

  • Age and maturity

  • Baseline resilience

These are exogenous (external) variables—they're not caused by other variables in our model.

Step 2: Identify the Causal Mechanism

Question: How do root causes lead to disease?

Domain knowledge from plant pathology:

1. High environmental stress + watering → Leaf Moisture

  • Hot weather increases evaporation

  • Watering increases surface water

  • Combined: creates conditions for pathogens

2. Leaf Moisture → Pathogen Growth

  • Fungi need moisture to germinate

  • Bacteria need water film to enter plant

  • Critical threshold: ~6-8 hours of leaf wetness

3. Pathogen Growth → Disease

  • Sufficient pathogen load → infection

  • Varies by plant immunity

4. Disease + Plant Vigor → Symptom Severity

  • Same disease manifests differently

  • Healthy plants show mild symptoms

  • Weak plants show severe symptoms

5. Symptom Severity → Observable Symptoms

  • What we actually see

  • Yellowing, spots, wilting, etc.

Step 3: Draw the Arrows

Now we connect the dots:

Node Legend:

  • Cyan: Exogenous (controllable or observable inputs)

  • Purple: Intermediate mechanisms

  • Pink: Latent (unobserved) variable

Step 4: Validate the Structure

For each arrow, ask: "Does X directly cause Y?"

ArrowJustificationValidated?
Environmental Stress → Leaf MoistureTemperature/humidity affect surface water
Watering → Leaf MoistureDirect causal mechanism
Leaf Moisture → Pathogen GrowthPathogens need water
Pathogen Growth → DiseaseSufficient load → infection
Disease → Symptom SeverityDisease causes symptoms
Plant Vigor → Symptom SeverityVigor moderates expression
Symptom Severity → Observable SymptomsWhat we measure

Missing arrows (intentionally):

  • Environmental Stress → Disease?

    • NO direct arrow: stress affects disease ONLY through moisture
  • Watering → Disease?

    • NO direct arrow: watering affects disease ONLY through moisture
  • Plant Vigor → Disease?

    • NO direct arrow: vigor affects symptom severity, not disease presence

This is important! The absence of arrows encodes causal assumptions.

Step 5: Name the Causal Roles

Understanding special node types:

Confounders:

  • Variables that affect both treatment and outcome

  • Example: If Environmental Stress affected both Watering AND Disease directly

  • Our model: No confounders (by design for simplicity)

Mediators:

  • Variables on the causal path

  • Example: Leaf Moisture mediates Environmental Stress → Disease

  • Pathogen Growth mediates Leaf Moisture → Disease

Colliders:

  • Variables caused by multiple parents

  • Example: Symptom Severity is a collider (caused by Disease AND Plant Vigor)

  • Special property: conditioning on colliders can create spurious associations!

Effect Modifiers:

  • Variables that change the magnitude of effects

  • Example: Plant Vigor modifies how Disease translates to Symptoms

  • High vigor → mild symptoms even with disease

Our Final DAG

7 nodes, 7 edges, complete causal story:

Root CausesMechanismsObservable Outcomes

This is our working model. In Part 3, we'll use it for counterfactual reasoning. In Part 4, we'll design interventions.

But first, let's implement it in code.


Implementing the DAG in Python

Coding Your DAG with DoWhy

Let's make this concrete with Python code.

Install dependencies:

pip install dowhy pandas numpy networkx matplotlib

Define the DAG:

from dowhy import CausalModel
import pandas as pd
import numpy as np

# Define the causal graph
causal_graph = """
digraph {
    Environmental_Stress -> Leaf_Moisture;
    Watering_Practice -> Leaf_Moisture;
    Leaf_Moisture -> Pathogen_Growth;
    Pathogen_Growth -> Disease_Present;
    Disease_Present -> Symptom_Severity;
    Plant_Vigor -> Symptom_Severity;
    Symptom_Severity -> Observable_Symptoms;
}
"""

# Create sample data (we'll use synthetic for now)
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'environmental_stress': np.random.beta(2, 5, n_samples),  # 0-1 scale
    'watering_practice': np.random.choice([0, 1, 2], n_samples),  # 0=under, 1=optimal, 2=over
    'plant_vigor': np.random.beta(8, 2, n_samples),  # Usually healthy
    'leaf_moisture_hours': np.zeros(n_samples),  # We'll compute
    'pathogen_growth': np.zeros(n_samples),
    'disease_present': np.zeros(n_samples),
    'symptom_severity': np.zeros(n_samples),
})

# Generate data according to causal structure
for i in range(n_samples):
    # Leaf moisture depends on environmental stress and watering
    base_moisture = 5.0  # baseline
    stress_effect = data.loc[i, 'environmental_stress'] * 10
    watering_effect = [-3, 0, 5][data.loc[i, 'watering_practice']]

    data.loc[i, 'leaf_moisture_hours'] = np.clip(
        base_moisture + stress_effect + watering_effect + np.random.normal(0, 1),
        0, 24
    )

    # Pathogen growth depends on leaf moisture
    moisture = data.loc[i, 'leaf_moisture_hours']
    data.loc[i, 'pathogen_growth'] = np.clip(
        (moisture / 24) ** 1.5 + np.random.normal(0, 0.1),
        0, 1
    )

    # Disease depends on pathogen growth
    pathogen = data.loc[i, 'pathogen_growth']
    data.loc[i, 'disease_present'] = 1 if pathogen > 0.6 else 0

    # Symptom severity depends on disease and plant vigor
    disease = data.loc[i, 'disease_present']
    vigor = data.loc[i, 'plant_vigor']
    data.loc[i, 'symptom_severity'] = np.clip(
        disease * (1 - vigor * 0.5) + np.random.normal(0, 0.1),
        0, 1
    )

print(data.head())
print(f"\nDisease prevalence: {data['disease_present'].mean():.2%}")

Create the causal model:

model = CausalModel(
    data=data,
    treatment='leaf_moisture_hours',
    outcome='symptom_severity',
    graph=causal_graph,
    common_causes=['environmental_stress', 'watering_practice'],
    effect_modifiers=['plant_vigor']
)

# Visualize the graph
model.view_model()

# Identify the causal effect
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)

What this code does:

  1. Defines causal structure (the DAG)

  2. Generates synthetic data following causal equations

  3. Creates DoWhy model linking data to structure

  4. Identifies causal effect of leaf moisture on symptoms

The structural equations:

# These are the causal mechanisms encoded in the data generation:

Leaf_Moisture = f(Environmental_Stress, Watering_Practice, noise)
Pathogen_Growth = g(Leaf_Moisture, noise)
Disease = h(Pathogen_Growth, noise)
Symptom_Severity = j(Disease, Plant_Vigor, noise)

Each function represents a causal mechanism. The DAG shows which variables go into which functions.

Querying the Model

Now we can ask causal questions:

# Estimate causal effect
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression"
)

print(f"Causal effect of leaf moisture on symptom severity: {estimate.value:.4f}")
print(f"95% Confidence Interval: [{estimate.get_confidence_intervals()[0]:.4f}, {estimate.get_confidence_intervals()[1]:.4f}]")

This tells us: For every additional hour of leaf moisture, symptom severity increases by X.

That's a causal claim, not correlation!

Example output:

Causal effect of leaf moisture on symptom severity: 0.0234
95% Confidence Interval: [0.0198, 0.0271]

Interpretation: Each additional hour of leaf moisture causes 
a 2.34% increase in symptom severity (statistically significant).

We'll do much more with this in Part 3 (counterfactuals) and Part 4 (interventions).


Common DAG Patterns & Pitfalls

Causal Patterns You Need to Know

Understanding these patterns will help you build better DAGs.

Pattern 1: Confounding

Problem: Confounder causes both treatment and outcome, creating spurious association.

Example:

  • Season (C) → Watering frequency (T)

  • Season (C) → Disease prevalence (O)

You observe: More watering correlates with more disease.
Reality: It's because summer has both more watering AND more disease.

Solution: Control for confounders in analysis (we'll cover this in Part 4).

Pattern 2: Mediation

Definition: Mediator sits on the causal path between treatment and outcome.

Example in our DAG:

  • Watering → Leaf Moisture → Pathogen Growth → Disease

Leaf Moisture and Pathogen Growth are mediators.

Why it matters:

  • Total effect: Watering → Disease (full causal path)

  • Direct effect: None (Watering doesn't cause Disease except through mediators)

  • Mediated effect: Watering → Moisture → Pathogen → Disease

Intervention implications:

  • You can intervene at any point in the chain

  • Earlier intervention (reduce watering) prevents the entire cascade

  • Later intervention (fungicide for pathogen) only stops downstream effects

Pattern 3: Collider Bias (The Trap)

Critical property: A and B are independent, but become correlated if you condition on C!

Example:

  • Disease (A) → Symptom Severity (C)

  • Plant Vigor (B) → Symptom Severity (C)

Symptom Severity is a collider.

The trap: If you only analyze plants with severe symptoms (conditioning on collider), you'll find:

  • Plants with low vigor tend to have disease

  • Plants with high vigor tend NOT to have disease

But this is spurious! You selected on the outcome.

In reality:

  • Disease and Plant Vigor are independent (no arrow between them)

  • They only appear related when you filter by severe symptoms

Real-world example:

Imagine you're studying what makes successful entrepreneurs. You only survey people who became billionaires (conditioning on outcome).

You find: High risk-taking OR exceptional luck leads to billions.

Among billionaires:

  • High risk-takers had average luck

  • Low risk-takers had exceptional luck

Spurious correlation! Risk-taking and luck appear negatively correlated, but only because you conditioned on success (the collider).

How to avoid: Don't condition on colliders unless you have a good reason.

Pattern 4: Selection Bias

Similar to collider bias, but about sample selection:

Example: You train your model only on:

  • Plants brought to clinic (Selection)

  • Which happens when: Disease is visible OR plant is expensive

Now disease and plant value appear correlated—but only because of selection!

Solution: Be aware of how your sample was selected.

DAG Validation Checklist

Before trusting your DAG:

  • [ ] Every arrow represents a direct causal effect

  • [ ] Missing arrows represent true independence

  • [ ] No feedback loops (check: acyclic?)

  • [ ] Domain experts reviewed the structure

  • [ ] Edge cases considered (extreme values)

  • [ ] Alternative DAGs ruled out

  • [ ] Testable implications identified

  • [ ] Data validates independence claims


Testing Your DAG

How to Know If Your DAG Is Right

A DAG makes testable predictions about independence relationships.

D-Separation: Reading Independence from Structure

Two variables are d-separated if all paths between them are blocked.

Example from our DAG:

Question: Are Environmental Stress and Plant Vigor independent?

Answer: Yes, they're independent (no path connects them).

Testable prediction: In our data, Environmental Stress and Plant Vigor should be uncorrelated.

If we find correlation, our DAG is wrong!

# Test independence
from scipy.stats import pearsonr

correlation, p_value = pearsonr(
    data['environmental_stress'], 
    data['plant_vigor']
)

print(f"Correlation: {correlation:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value > 0.05:
    print("✓ Independent (DAG validated)")
else:
    print("✗ Dependent (DAG may be wrong)")

Conditional Independence Tests

More complex: variables may be independent given others.

Example:

Are Watering Practice and Disease independent given Leaf Moisture?

Claim: Watering affects Disease ONLY through Moisture.

Test: Given Moisture, Watering and Disease should be independent.

In notation: Watering ⊥ Disease | Moisture

Python test:

from scipy.stats import chi2_contingency

# Group by leaf moisture levels
data['moisture_level'] = pd.cut(
    data['leaf_moisture_hours'], 
    bins=3, 
    labels=['low', 'med', 'high']
)

# Within each moisture level, test independence
print("Testing: Watering ⊥ Disease | Moisture\n")

for level in ['low', 'med', 'high']:
    subset = data[data['moisture_level'] == level]

    # Contingency table: watering vs disease
    table = pd.crosstab(
        subset['watering_practice'], 
        subset['disease_present']
    )

    # Chi-square test
    chi2, p_value, dof, expected = chi2_contingency(table)

    print(f"Moisture {level}: p-value = {p_value:.4f}")
    if p_value > 0.05:
        print("  ✓ Independent (DAG validated)")
    else:
        print("  ✗ Dependent (DAG may be wrong)")
    print()

Expected output:

Testing: Watering ⊥ Disease | Moisture

Moisture low: p-value = 0.3421
  ✓ Independent (DAG validated)

Moisture med: p-value = 0.5634
  ✓ Independent (DAG validated)

Moisture high: p-value = 0.4523
  ✓ Independent (DAG validated)

If independence holds, our DAG structure is supported by data.

What If Tests Fail?

If your DAG fails independence tests:

1. Missing arrow: Add direct causal link

  • Example: Watering → Disease (direct effect we missed)

2. Wrong direction: Reverse an arrow

  • Example: Maybe Disease → Moisture (sick plants retain water?)

3. Missing confounder: Add common cause

  • Example: Season → both Watering AND Disease

4. Wrong assumptions: Reconsider causal mechanism

  • Example: Different disease types have different causal paths

Iterate: Build DAG → Test → Revise → Repeat

This is the scientific method applied to causal structure!

Advanced Validation: Falsification Tests

# DoWhy includes built-in refutation tests
from dowhy import CausalModel

# Refute by adding random common cause
refutation = model.refute_estimate(
    identified_estimand,
    estimate,
    method_name="random_common_cause"
)
print(refutation)

# Expected: Effect should remain stable
# If effect changes dramatically, DAG may be missing confounders

Practical Tips for DAG Construction

Start Simple, Iterate

Don't try to model everything at once:

  1. Start with 3-5 key variables

    • Treatment of interest

    • Outcome of interest

    • 1-3 confounders

  2. Add complexity gradually

    • Mediators

    • Effect modifiers

    • Additional confounders

  3. Test at each step

    • Validate new arrows

    • Check independence claims

    • Ensure model still makes sense

Use Domain Expertise

Best practices:

  • Interview domain experts: "What causes X?" "Does Y affect Z directly?"

  • Review literature: What causal mechanisms are established?

  • Start with consensus: Build on well-known relationships

  • Document assumptions: Write down why each arrow exists

  • Invite criticism: Ask skeptics "What am I missing?"

Common Mistakes to Avoid

1. Arrows everywhere

  • Don't connect everything

  • Missing arrows are meaningful (independence claims)

2. Correlation → Arrow

  • Just because X and Y correlate doesn't mean X → Y

  • Check for confounders first

3. Forgetting time

  • Causes must precede effects

  • Check temporal ordering

4. Ignoring mechanisms

  • Ask "HOW does X cause Y?"

  • If you can't explain it, maybe it's not causal

5. No validation

  • Always test your DAG

  • Data should support structure


Complete Working Example

Here's a complete, runnable script you can use as a template:

"""
Complete Causal DAG Implementation
Plant Disease Detection Example
"""

from dowhy import CausalModel
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2_contingency, pearsonr

# Set random seed for reproducibility
np.random.seed(42)

# Define causal graph
causal_graph = """
digraph {
    Environmental_Stress [label="Environmental Stress"];
    Watering_Practice [label="Watering Practice"];
    Plant_Vigor [label="Plant Vigor"];
    Leaf_Moisture [label="Leaf Moisture"];
    Pathogen_Growth [label="Pathogen Growth"];
    Disease_Present [label="Disease Present"];
    Symptom_Severity [label="Symptom Severity"];

    Environmental_Stress -> Leaf_Moisture;
    Watering_Practice -> Leaf_Moisture;
    Leaf_Moisture -> Pathogen_Growth;
    Pathogen_Growth -> Disease_Present;
    Disease_Present -> Symptom_Severity;
    Plant_Vigor -> Symptom_Severity;
}
"""

def generate_causal_data(n_samples=1000):
    """Generate data following the causal DAG structure."""

    data = pd.DataFrame({
        'environmental_stress': np.random.beta(2, 5, n_samples),
        'watering_practice': np.random.choice([0, 1, 2], n_samples),
        'plant_vigor': np.random.beta(8, 2, n_samples),
        'leaf_moisture_hours': np.zeros(n_samples),
        'pathogen_growth': np.zeros(n_samples),
        'disease_present': np.zeros(n_samples),
        'symptom_severity': np.zeros(n_samples),
    })

    for i in range(n_samples):
        # Causal mechanism 1: Environmental Stress + Watering → Leaf Moisture
        base_moisture = 5.0
        stress_effect = data.loc[i, 'environmental_stress'] * 10
        watering_effect = [-3, 0, 5][data.loc[i, 'watering_practice']]

        data.loc[i, 'leaf_moisture_hours'] = np.clip(
            base_moisture + stress_effect + watering_effect + np.random.normal(0, 1),
            0, 24
        )

        # Causal mechanism 2: Leaf Moisture → Pathogen Growth
        moisture = data.loc[i, 'leaf_moisture_hours']
        data.loc[i, 'pathogen_growth'] = np.clip(
            (moisture / 24) ** 1.5 + np.random.normal(0, 0.1),
            0, 1
        )

        # Causal mechanism 3: Pathogen Growth → Disease
        pathogen = data.loc[i, 'pathogen_growth']
        data.loc[i, 'disease_present'] = 1 if pathogen > 0.6 else 0

        # Causal mechanism 4: Disease + Plant Vigor → Symptom Severity
        disease = data.loc[i, 'disease_present']
        vigor = data.loc[i, 'plant_vigor']
        data.loc[i, 'symptom_severity'] = np.clip(
            disease * (1 - vigor * 0.5) + np.random.normal(0, 0.1),
            0, 1
        )

    return data

def validate_dag(data):
    """Run validation tests on the DAG structure."""

    print("=" * 60)
    print("DAG VALIDATION TESTS")
    print("=" * 60)

    # Test 1: Environmental Stress ⊥ Plant Vigor
    print("\n1. Testing: Environmental_Stress ⊥ Plant_Vigor")
    corr, p_val = pearsonr(data['environmental_stress'], data['plant_vigor'])
    print(f"   Correlation: {corr:.4f}, P-value: {p_val:.4f}")
    if p_val > 0.05:
        print("   ✓ Independent (as expected)")
    else:
        print("   ✗ Dependent (DAG may be wrong!)")

    # Test 2: Watering ⊥ Disease | Leaf Moisture
    print("\n2. Testing: Watering ⊥ Disease | Leaf_Moisture")
    data['moisture_level'] = pd.cut(
        data['leaf_moisture_hours'], 
        bins=3, 
        labels=['low', 'med', 'high']
    )

    independence_holds = True
    for level in ['low', 'med', 'high']:
        subset = data[data['moisture_level'] == level]
        if len(subset) < 10:
            continue

        table = pd.crosstab(subset['watering_practice'], subset['disease_present'])
        chi2, p_val, dof, expected = chi2_contingency(table)

        print(f"   Moisture {level}: p-value = {p_val:.4f}", end="")
        if p_val > 0.05:
            print(" ✓")
        else:
            print(" ✗")
            independence_holds = False

    if independence_holds:
        print("   ✓ Conditional independence holds")
    else:
        print("   ✗ Conditional independence violated")

    print("\n" + "=" * 60)

def estimate_causal_effect(data, causal_graph):
    """Estimate causal effect using DoWhy."""

    print("\n" + "=" * 60)
    print("CAUSAL EFFECT ESTIMATION")
    print("=" * 60)

    # Create causal model
    model = CausalModel(
        data=data,
        treatment='leaf_moisture_hours',
        outcome='symptom_severity',
        graph=causal_graph,
        common_causes=['environmental_stress', 'watering_practice'],
        effect_modifiers=['plant_vigor']
    )

    # Identify causal effect
    identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
    print("\nIdentified Estimand:")
    print(identified_estimand)

    # Estimate effect
    estimate = model.estimate_effect(
        identified_estimand,
        method_name="backdoor.linear_regression"
    )

    print(f"\nCausal Effect: {estimate.value:.4f}")
    print(f"Interpretation: Each additional hour of leaf moisture")
    print(f"causes a {estimate.value:.4f} increase in symptom severity")

    # Refutation test
    print("\nRefutation Test (Random Common Cause):")
    refutation = model.refute_estimate(
        identified_estimand,
        estimate,
        method_name="random_common_cause"
    )
    print(refutation)

    return model, estimate

def main():
    """Run complete DAG analysis."""

    print("\n" + "=" * 60)
    print("BUILDING CAUSAL DAG: PLANT DISEASE DETECTION")
    print("=" * 60)

    # Generate data
    print("\nGenerating synthetic data (n=1000)...")
    data = generate_causal_data(n_samples=1000)

    print(f"\nData Summary:")
    print(f"Disease prevalence: {data['disease_present'].mean():.2%}")
    print(f"Mean symptom severity: {data['symptom_severity'].mean():.3f}")
    print(f"Mean leaf moisture: {data['leaf_moisture_hours'].mean():.2f} hours")

    # Validate DAG
    validate_dag(data)

    # Estimate causal effect
    model, estimate = estimate_causal_effect(data, causal_graph)

    print("\n" + "=" * 60)
    print("ANALYSIS COMPLETE")
    print("=" * 60)
    print("\nNext Steps:")
    print("1. Part 3: Use this DAG for counterfactual reasoning")
    print("2. Part 4: Design interventions based on causal effects")
    print("3. Part 5: Scale to production systems")

    return data, model, estimate

if __name__ == "__main__":
    data, model, estimate = main()

Save this as causal_dag.py and run:

python causal_dag.py

You've Built a Causal Model

Congratulations! You now have:

✅ A complete causal DAG for plant disease
✅ Understanding of confounders, mediators, colliders
✅ Working Python implementation with DoWhy
✅ Methods to validate your causal structure
✅ Template code you can adapt to any domain

This is the foundation. Everything we do next builds on this DAG.


What's Next: Counterfactual Reasoning

In Part 3 (Friday, Jan 16), we'll use this DAG to answer questions like:

  • "This plant has disease. Would it be healthy if I had watered less?"

  • "I applied intervention X. What would have happened without it?"

  • "Why did this specific plant get diseased when that one didn't?"

These are counterfactual questions—the most powerful form of causal reasoning.

We'll implement:

  • Counterfactual inference algorithms

  • "What if" scenario analysis

  • Personalized explanation generation

  • Individual treatment effect estimation


Your Homework Before Part 3

1. Run the code in this article

  • Generate the data

  • Build the DAG

  • Validate the structure

  • Estimate causal effects

2. Modify the DAG

  • Add a new variable (e.g., "Soil Quality")

  • Add corresponding arrows

  • Update the data generation

  • Test if it still validates

3. Apply to your domain

  • Think about a problem you're working on

  • Identify 5-7 key variables

  • Draw a DAG on paper

  • What causal questions would you want to answer?

4. Prepare questions

  • What's unclear about DAG construction?

  • What validation tests are you curious about?

  • What challenges do you foresee for your domain?

Bring these to Part 3. We're going deeper.


Series Navigation:

Code & Resources:


This is part of my research at NeoForge Labs on causal AI systems. Follow along as we build production-grade causal reasoning from scratch.

Questions? Drop them in the comments below. I read and respond to everything.

Found this useful? Share it with someone who's struggling with production ML failures. Let's build better AI together.