Skip to content

BN Structure Learning -- Causal Discovery from Data

Demonstrates Bayesian Network structure learning: discovering causal relationships directly from observational CSV data, learning parameters, evaluating model fit, and running inference on the learned model. Two structure learning algorithms (Hill-Climb and K2) are compared to identify high-confidence causal links.

Discover causal structure hidden in your data — learn Bayesian network topology and parameters directly from CSV observations using Hill-Climb and K2 search algorithms.

Scenarios: golf · bmx · sourdough

Community — runs on the OSS / Community SDK edition.

Difficulty: Intermediate 🟦 · BN

  • Summary: Bayesian network structure learning from data
  • Scenario: Learn Bayesian network structure from observational data
  • tech_tags in manifest: BN — example id bn-structure-learning in conformance/examples_manifest.json.
  • SDK: Use an installed SDK tree (NXUSKIT_SDK_DIR, NXUSKIT_LIB_PATH as needed); test-examples.sh resolves Go/Rust/Python deps from that tree only — see README.md, scripts/setup-sdk.sh, and scripts/test-examples.sh.
  • Languages in this example: go, rust (paths under this directory; Python may live under a sibling python/ or shared reference per Language Implementations).
FeatureDescriptionRustGo
BnNetwork (empty)Create an empty network for structure learningnxuskit_bn_net_create()NewBnNetwork()
Hill-Climb SearchGreedy structure search with edge add/remove/reversenxuskit_bn_search_structure(..., "hill_climb", ...)net.SearchStructure(..., "hill_climb")
K2 SearchOrder-based structure search with variable orderingnxuskit_bn_search_structure(..., "k2", ...)net.SearchStructure(..., "k2")
BIC ScoringBayesian Information Criterion (penalizes complexity)scoring = "bic"Scoring: "bic"
BDeu ScoringBayesian Dirichlet equivalent uniform scoringscoring = "bdeu"Scoring: "bdeu"
MLE LearningMaximum Likelihood Estimation for CPT parametersnxuskit_bn_learn_mle()net.LearnMLE()
Log-LikelihoodEvaluate model fit against training datanxuskit_bn_log_likelihood()net.LogLikelihood()
VE InferenceVariable Elimination on the learned modelnxuskit_bn_infer(..., "ve", ...)net.Infer(ev, "ve")

BN

┌───────────┐ Column names ┌──────────────┐ Learned edges ┌───────────────┐
│ CSV Data │ ───────────────> │ Structure │ ─────────────────> │ Parameter │
│ (200 rows) │ │ Learning │ │ Learning │
└───────────┘ │ (HC / K2) │ │ (MLE) │
└──────────────┘ └───────┬───────┘
┌──────────────┐ Fit score ┌──────┴───────┐
│ Algorithm │ <───────────── │ Log- │
│ Comparison │ │ Likelihood │
└──────────────┘ └──────┬───────┘
┌──────┴───────┐
│ Inference │
│ (VE) │
└──────────────┘

Step 1 — Load CSV Data: Reads the scenario CSV file, discovers column names (which become BN variables) and row count.

Step 2 — Hill-Climb + BIC: Runs greedy structure search starting from an empty graph. At each step, the algorithm tries adding, removing, or reversing an edge, accepting the change that most improves the BIC score. BIC balances fit against model complexity via a log(N) penalty term.

Step 3 — K2 + BDeu: Runs order-based structure search using the CSV column ordering. K2 processes variables in order, greedily adding parent edges that improve the BDeu score. BDeu uses an equivalent sample size (ESS) hyperparameter that controls the strength of the prior.

Step 4 — MLE Parameter Learning: Fits conditional probability tables (CPTs) to the Hill-Climb structure using Maximum Likelihood Estimation with Laplace smoothing (pseudocount=1.0) to avoid zero probabilities.

Step 5 — Log-Likelihood Evaluation: Computes how well the learned model explains the training data. Per-sample log-likelihood allows comparison across different dataset sizes.

Step 6 — Inference: Runs Variable Elimination on the learned model with sample evidence to demonstrate that the learned network supports standard BN queries.

Step 7 — Algorithm Comparison: Compares edges discovered by both algorithms. Shared edges represent high-confidence causal relationships found independently by two different search strategies.

Attach an installed SDK (NXUSKIT_SDK_DIR). See the repository README.md and scripts/test-examples.sh.

Terminal window
# From `/examples/integrations/bn-structure-learning`:
cd rust && cargo build
cd go && make build
Terminal window
cd rust
cargo run -- --scenario golf
cargo run -- --scenario bmx --verbose
cargo run -- --scenario sourdough --step
Terminal window
cd go
make build
./bin/bn-structure-learning --scenario golf
./bin/bn-structure-learning --scenario bmx --verbose
./bin/bn-structure-learning --scenario sourdough --step

Or directly:

Terminal window
cd go
go run . --scenario golf

Models how weather, soil conditions, maintenance practices, and fertilizer affect golf course playing conditions. The data encodes realistic correlations: rainy weather increases soil moisture, which softens fairways; heavy fertilizer increases green speed; longer mowing increases rough thickness.

  • Variables: weather, soil_moisture, mowing, foot_traffic, fertilizer, green_speed, fairway_firmness, rough_thickness
  • Expected causal links: weather -> soil_moisture -> fairway_firmness, fertilizer -> green_speed, mowing -> rough_thickness
  • Inference demo: P(green_speed | weather=rainy, fertilizer=heavy)

Models how rider skill, technique, and jump characteristics affect BMX race outcomes. High skill correlates with perfect pump timing and fast speed; extreme jumps with low skill dramatically increase crash risk.

  • Variables: jump_height, berm_angle, pump_timing, speed, skill, lap_time, crash_risk, style_score
  • Expected causal links: skill -> pump_timing, skill -> jump_height, speed -> lap_time, jump_height + skill -> crash_risk
  • Inference demo: P(lap_time | skill=pro, pump_timing=perfect)

Models how feeding schedule, flour choice, temperature, and starter maturity affect sourdough bread characteristics. Warm temperatures with mature starters produce fast rises and dense bubbles; rye flour and infrequent feeding lead to sour flavors.

  • Variables: feeding_schedule, flour_type, ambient_temp, hydration, starter_age, rise_time, bubble_density, flavor_profile
  • Expected causal links: ambient_temp + starter_age -> rise_time -> bubble_density, flour_type -> flavor_profile, feeding_schedule -> starter_age
  • Inference demo: P(flavor_profile | flour_type=rye, ambient_temp=warm)
Terminal window
# Verbose mode -- show raw JSON results and intermediate data
cargo run -- --scenario golf --verbose # Rust
go run . --scenario golf --verbose # Go
# Step mode -- pause at each step with explanations
cargo run -- --scenario bmx --step # Rust
go run . --scenario bmx --step # Go
# Combined mode
cargo run -- --scenario sourdough --verbose --step
go run . --scenario sourdough --verbose --step

Or use environment variables:

Terminal window
export NXUSKIT_VERBOSE=1
export NXUSKIT_STEP=1
PropertyHill-ClimbK2
Search strategyGreedy local searchOrder-based forward search
Starting pointEmpty graphEmpty graph + variable ordering
OperationsAdd, remove, reverse edgesAdd parent edges only
Ordering requiredNoYes (results depend on ordering)
Score functionBIC (default)BDeu (default)
ComplexityO(n^2 * max_steps)O(n^2 * max_parents)
StrengthsFlexible, no ordering neededFast, principled Bayesian scoring
WeaknessesCan get stuck in local optimaSensitive to variable ordering

BIC (Bayesian Information Criterion): BIC = LL - (k/2) * ln(N) where LL is log-likelihood, k is the number of free parameters, and N is the sample size. Penalizes complexity more strongly with larger datasets.

BDeu (Bayesian Dirichlet equivalent uniform): A Bayesian score that uses a Dirichlet prior. The equivalent sample size (ESS) parameter controls prior strength: small ESS values prefer simpler structures, large ESS values are more permissive.

Maximum Likelihood Estimation counts co-occurrences in the data to estimate conditional probability tables. Laplace smoothing (pseudocount > 0) adds a small count to every cell, preventing zero probabilities that would make log-likelihood undefined.

Log-likelihood measures how well the model’s CPTs explain the observed data: LL = sum_i sum_j log P(x_ij | parents(x_j)). Higher (less negative) values indicate better fit. Per-sample log-likelihood (LL / N) normalizes for dataset size.

  • Header row: First row contains column names (become BN variable names)
  • Encoding: UTF-8 with LF line endings
  • Delimiter: Comma-separated values
  • Values: Categorical (discrete) values only for structure learning
  • Missing values: Rows with empty cells are skipped with a warning
  • Sorting: Primary sort by first column, secondary by second column
  1. Create a new directory under scenarios/
  2. Add a data.csv file with header row and at least 50 data rows
  3. Ensure correlations in the data reflect the causal structure you expect to discover
  4. Add the scenario configuration to scenario_config() (Rust) or knownScenarios (Go)
  5. Create expected-output.json with expected edge ranges and inference results
ApplicationHow this example applies
EpidemiologyDiscover disease risk factor relationships from patient records
ManufacturingIdentify root causes of defects from production data
FinanceMap causal relationships between economic indicators
GenomicsLearn gene regulatory networks from expression data
Quality controlFind which process parameters affect product quality
Terminal window
# Rust
cd rust && cargo test
# Go
cd go && go test -v

Each scenario includes an expected-output.json that describes expected edge count ranges, inference results, and fit evaluation bounds for regression testing.