Notesclub

created by hec & contributors

terms privacy

BDA-Cyber Chapter 5 — Eight SOCs: Hierarchical Incident Rates

ch05_eight_socs.livemd

Igor O'sten

@borodark

eXMC

Share to X

Share to Bluesky

More notebooks

BDA-Cyber Chapter 5 — Eight SOCs: Hierarchical Incident Rates

Setup

# CPU only — no GPU required
System.put_env("EXLA_CPU_ONLY", "true")
System.put_env("CUDA_VISIBLE_DEVICES", "")

Mix.install([
  {:exmc, path: Path.expand("../../", __DIR__)},
  {:exla, "~> 0.10"},
  {:kino_vega_lite, "~> 0.1"}
])

Application.put_env(:exla, :clients, host: [platform: :host])
Application.put_env(:exla, :default_client, :host)
Nx.default_backend(Nx.BinaryBackend)
Nx.Defn.default_options(compiler: EXLA, client: :host)

alias Exmc.{Builder, Sampler}
alias Exmc.Dist.{Normal, HalfNormal}
alias VegaLite, as: Vl
:ok

Why This Matters

A company has eight branch offices. Each office has a security operations team. Each team reports how many confirmed security incidents they had this quarter, and the uncertainty around that count (based on team size, investigation depth, and detection coverage).

Office	Incidents (y)	Uncertainty (σ)	Staff
HQ	28	8	120
Northeast	8	5	45
Southeast	3	7	30
Midwest	7	5	50
Southwest	2	4	35
West Coast	4	6	40
Northwest	18	5	55
Remote	12	9	20

You are the CISO. You need to answer three questions:

What is the true incident rate across the whole company? Not the naive average. Not HQ’s number alone. The population-level rate that accounts for the fact that some offices have better detection than others.
Is HQ actually worse, or do they just detect more? HQ reported 28 incidents. The next highest is 18. But HQ has 120 staff and better tooling — maybe they see more because they look more.
Should I allocate more security budget to the Southeast office? They reported only 3 incidents. Are they secure, or are they blind? With a staff of 30 and uncertainty of 7, their number is noisy.

These are exactly the questions that the 8-schools model was designed to answer. Each office is a “school.” Each incident count is a “coaching effect.” The hierarchical model says: the offices are different, but not unrelated. They share a common organizational context (same company, same threat landscape, same policies). Partial pooling borrows strength from all offices to estimate each one — pulling noisy estimates toward the group mean, while respecting precise estimates that resist pooling.

Three Plainly Wrong Approaches

Report each office separately. “HQ had 28 incidents.” But HQ’s uncertainty is 8 — the true rate could be 20 or 36. You can’t justify a budget increase based on a point estimate with 30% relative error.
Average everything. The company average is ~10 incidents. Apply that to every office. This ignores that HQ and Northwest are clearly running hotter than Southwest and Southeast. A uniform allocation based on the mean wastes money.
Ignore the small offices. “Southeast had only 3 incidents — not enough data to act on.” This throws away information. Three incidents at a 30-person office with poor detection means something different from three incidents at a 120-person office with excellent tooling.

The correct approach: a hierarchical model that infers both the population-level rate and each office’s deviation from it, with uncertainty on both.

The Model

$$ \begin{aligned} \mu & \sim \text{Normal}(0, 10) \ \tau & \sim \text{Half-Normal}(10) \ \theta_j & \sim \text{Normal}(\mu, \tau) & j &= 1, \dots, 8 \ y_j & \sim \text{Normal}(\theta_j, \sigma_j) & j &= 1, \dots, 8 \end{aligned} $$

The hyperprior μ is the company-wide baseline incident rate. τ is the between-office spread — how much offices differ from each other in true incident rate. Each θ_j is the true incident rate for office j. The σ_j are known measurement uncertainties per office (reflecting staff size, detection maturity, investigation depth).

The key tension: when τ → 0, all offices are the same (complete pooling). When τ → ∞, each office is independent (no pooling). The posterior on τ decides how much sharing the data support.

# Data — eight offices
y = [28.0, 8.0, 3.0, 7.0, 2.0, 4.0, 18.0, 12.0]
sigma = [8.0, 5.0, 7.0, 5.0, 4.0, 6.0, 5.0, 9.0]
n_offices = length(y)
office_labels = ["HQ", "NE", "SE", "MW", "SW", "WC", "NW", "Remote"]
staff_counts = [120, 45, 30, 50, 35, 40, 55, 20]

%{n_offices: n_offices, y_mean: Float.round(Enum.sum(y) / n_offices, 1)}

# Build the IR
ir = Builder.new_ir()

# Hyperpriors
ir = Builder.rv(ir, "mu", Normal, %{mu: Nx.tensor(0.0), sigma: Nx.tensor(10.0)})
ir = Builder.rv(ir, "tau", HalfNormal, %{sigma: Nx.tensor(10.0)}, transform: :log)

# Office-level effects via string parameter references
ir =
  Enum.reduce(0..(n_offices - 1), ir, fn j, ir ->
    Builder.rv(ir, "theta_#{j}", Normal, %{mu: "mu", sigma: "tau"})
  end)

# Observations: y_j ~ Normal(theta_j, sigma_j)
ir =
  Enum.reduce(0..(n_offices - 1), ir, fn j, ir ->
    yj = Enum.at(y, j)
    sj = Enum.at(sigma, j)
    ir = Builder.rv(ir, "y_#{j}", Normal, %{mu: "theta_#{j}", sigma: Nx.tensor(sj)})
    Builder.obs(ir, "y_#{j}_obs", "y_#{j}", Nx.tensor(yj))
  end)

%{nodes: map_size(ir.nodes), free_rvs: "μ + τ + #{n_offices} offices = #{2 + n_offices}"}

Ten free parameters: μ, τ, and θ_0, ..., θ_7. The posterior is 10-dimensional. This is where grid methods from Ch 2–4 stop working and NUTS takes over.

Sampling with NUTS (Non-Centered)

Non-centered parameterization rewrites θ_j = μ + τ × z_j where z_j ~ N(0,1). The “funnel” geometry — where τ → 0 forces all θ_j onto μ — disappears in (μ, τ, z) coordinates. eXMC applies this rewrite automatically.

{trace_ncp, stats_ncp} =
  Sampler.sample(ir, %{},
    num_samples: 500,
    num_warmup: 500,
    seed: 42,
    ncp: true
  )

%{
  divergences: stats_ncp.divergences,
  step_size: Float.round(stats_ncp.step_size, 4),
  message: "NUTS sampling complete"
}

Reading the Posterior

mu_samples = trace_ncp["mu"]
tau_samples = trace_ncp["tau"]

mu_mean = Nx.mean(mu_samples) |> Nx.to_number()
mu_sd = :math.sqrt(Nx.variance(mu_samples) |> Nx.to_number())
tau_mean = Nx.mean(tau_samples) |> Nx.to_number()
tau_sd = :math.sqrt(Nx.variance(tau_samples) |> Nx.to_number())

%{
  mu_mean: Float.round(mu_mean, 1),
  mu_sd: Float.round(mu_sd, 1),
  tau_mean: Float.round(tau_mean, 1),
  tau_sd: Float.round(tau_sd, 1),
  interpretation: "Company-wide baseline: ~#{round(mu_mean)} incidents, between-office spread: ~#{round(tau_mean)}"
}

The company-wide mean μ sits around 8–10 incidents. The between-office spread τ sits around 5–8. These hyperparameters govern the partial pooling: when τ is large relative to the σ_j, each office’s estimate is mostly determined by its own data. When τ is small, everyone gets pulled toward the company mean.

Per-Office Effects — Partial Pooling in Action

office_summary =
  for j <- 0..(n_offices - 1) do
    theta_j = trace_ncp["theta_#{j}"]
    mean = Nx.mean(theta_j) |> Nx.to_number()
    sd = :math.sqrt(Nx.variance(theta_j) |> Nx.to_number())

    raw_y = Enum.at(y, j)

    shrinkage =
      if raw_y != 0.0,
        do: Float.round((raw_y - mean) / raw_y, 3),
        else: 0.0

    %{
      office: Enum.at(office_labels, j),
      staff: Enum.at(staff_counts, j),
      raw_incidents: raw_y,
      uncertainty: Enum.at(sigma, j),
      posterior_mean: Float.round(mean, 1),
      posterior_sd: Float.round(sd, 1),
      shrinkage: shrinkage
    }
  end

office_summary

The shrinkage column tells the story:

HQ reported 28 incidents. After borrowing strength from the other offices, its posterior drops to ~15–18. The model says: “no other office sees rates that high — some of your count is noise from having better detection.” The shrinkage is ~35–45%.
Southeast reported 3 incidents with high uncertainty (σ = 7). Its posterior is pulled up toward the company mean — maybe to 6 or 7. The model says: “your count is noisy; the company-wide evidence suggests your true rate is probably higher than 3.”
Northwest reported 18 with tight uncertainty (σ = 5). Its posterior stays high — the data are precise enough to resist full pooling.

This is the answer to the CISO’s third question. Southeast’s low count doesn’t mean “secure.” It means “uncertain.” The posterior is wider than the other offices. The appropriate response is better detection coverage, not less budget.

shrinkage_data =
  Enum.flat_map(office_summary, fn s ->
    [
      %{office: s.office, type: "Raw count (y_j ± σ_j)", value: s.raw_incidents, error: s.uncertainty},
      %{office: s.office, type: "Posterior mean ± sd", value: s.posterior_mean, error: s.posterior_sd}
    ]
  end)

Vl.new(width: 600, height: 320, title: "Partial pooling: raw counts vs posterior estimates")
|> Vl.data_from_values(shrinkage_data)
|> Vl.mark(:point, size: 100, filled: true)
|> Vl.encode_field(:x, "office", type: :nominal, title: "Office",
  sort: office_labels
)
|> Vl.encode_field(:y, "value", type: :quantitative, title: "Incidents")
|> Vl.encode_field(:color, "type", type: :nominal)
|> Vl.encode_field(:y_error, "error", type: :quantitative)

Read the plot: raw estimates (blue) scatter widely. Posterior estimates (orange) compress toward the center. The offices with the noisiest data (Remote, Southeast) move the most. The offices with the tightest data (Southwest, Northwest) move the least. This is partial pooling — the defining feature of hierarchical Bayesian models.

The τ Posterior — How Different Are the Offices?

The hyperparameter τ controls everything. If τ ≈ 0, all offices have the same true rate. If τ is large, each office is on its own.

tau_data =
  Nx.to_list(tau_samples)
  |> Enum.map(fn v -> %{tau: v} end)

Vl.new(width: 600, height: 240, title: "Posterior of τ (between-office spread)")
|> Vl.data_from_values(tau_data)
|> Vl.mark(:bar, color: "#4c78a8", opacity: 0.7)
|> Vl.encode_field(:x, "tau",
  type: :quantitative,
  bin: %{maxbins: 30},
  title: "τ (between-office standard deviation)"
)
|> Vl.encode_field(:y, "tau", type: :quantitative, aggregate: :count)

The posterior of τ tells you whether the offices are genuinely different or whether the variation is just noise. If the bulk of the density sits above 3–4, there are real differences across offices. If it piles up near 0, the offices are all approximately the same.

Centered vs Non-Centered — The Funnel

Now sample the same model without NCP. The model is mathematically identical — the geometry is not.

{trace_centered, stats_centered} =
  Sampler.sample(ir, %{},
    num_samples: 500,
    num_warmup: 500,
    seed: 42,
    ncp: false
  )

%{
  divergences_ncp: stats_ncp.divergences,
  divergences_centered: stats_centered.divergences,
  step_size_ncp: Float.round(stats_ncp.step_size, 4),
  step_size_centered: Float.round(stats_centered.step_size, 4),
  winner: "NCP (fewer divergences, larger step size)"
}

The centered run produces more divergences and a smaller step size. The sampler is fighting the funnel — the narrow neck where τ → 0 forces all θ_j onto μ. NCP eliminates this geometry.

The security lesson: when incident counts are noisy relative to the between-office spread, NCP is essential. If you had 10,000 incidents per office with very tight σ_j, centered would be fine. With 2–28 incidents per office and σ of 4–9, the data are weak and NCP wins.

Trace Plots — Convergence Diagnostics

mu_trace =
  Nx.to_list(mu_samples)
  |> Enum.with_index()
  |> Enum.map(fn {v, i} -> %{iteration: i, value: v, parameter: "μ (company mean)"} end)

tau_trace =
  Nx.to_list(tau_samples)
  |> Enum.with_index()
  |> Enum.map(fn {v, i} -> %{iteration: i, value: v, parameter: "τ (between-office spread)"} end)

Vl.new(width: 600, height: 300, title: "Trace plots — NCP")
|> Vl.data_from_values(mu_trace ++ tau_trace)
|> Vl.mark(:line, stroke_width: 0.5, opacity: 0.7)
|> Vl.encode_field(:x, "iteration", type: :quantitative)
|> Vl.encode_field(:y, "value", type: :quantitative)
|> Vl.encode_field(:row, "parameter", type: :nominal)
|> Vl.encode_field(:color, "parameter", type: :nominal)

Healthy chains look like “fuzzy caterpillars” — white noise around a stable mean. A chain that drifts or sticks is diagnostic of convergence failure. With NCP on this problem, the chains should be well-behaved.

Probability Queries — What the CISO Needs

The posterior is not the deliverable. The deliverable is the answer to a question. Here are three that a CISO would actually ask:

# Q1: P(HQ > NW) — is HQ actually worse than Northwest?
hq_samples = Nx.to_list(trace_ncp["theta_0"])
nw_samples = Nx.to_list(trace_ncp["theta_6"])

p_hq_worse =
  Enum.zip(hq_samples, nw_samples)
  |> Enum.count(fn {hq, nw} -> hq > nw end)
  |> then(fn count -> count / length(hq_samples) end)

# Q2: P(SE > company mean) — is Southeast actually below average?
se_samples = Nx.to_list(trace_ncp["theta_2"])
mu_list = Nx.to_list(mu_samples)

p_se_above_avg =
  Enum.zip(se_samples, mu_list)
  |> Enum.count(fn {se, mu} -> se > mu end)
  |> then(fn count -> count / length(se_samples) end)

# Q3: P(any office > 20 incidents) — is any office in serious trouble?
p_any_above_20 =
  for _ <- 0..(length(hq_samples) - 1) do
    Enum.any?(0..(n_offices - 1), fn j ->
      Nx.to_number(trace_ncp["theta_#{j}"][0]) > 20.0
    end)
  end

%{
  p_hq_worse_than_nw: Float.round(p_hq_worse, 2),
  p_se_above_company_mean: Float.round(p_se_above_avg, 2),
  interpretation_hq: "#{round(p_hq_worse * 100)}% probability HQ's true rate exceeds NW's",
  interpretation_se: "#{round(p_se_above_avg * 100)}% probability SE is above company average"
}

These are calibrated probabilities, not point comparisons. “There is a 65% probability that HQ’s true incident rate is higher than Northwest’s” is a different sentence from “HQ had 28 incidents and NW had 18.” The first accounts for measurement uncertainty. The second doesn’t.

Industry-Level Hierarchical Model

The same model scales beyond one company. Verizon’s DBIR reports incident counts by industry sector — the data structure is identical:

Sector	Incidents	Reporting uncertainty
Healthcare	525	high (HIPAA mandates reporting)
Finance	690	medium (regulatory, good detection)
Retail	241	high (variable detection maturity)
Manufacturing	186	very high (OT environments, poor visibility)
Education	312	high (open networks, limited SOC)
Government	420	medium (mandated reporting, variable detection)
Tech	456	low (mature SOCs, good tooling)
Utilities	98	very high (ICS, air-gapped, limited monitoring)

The hierarchical model borrows strength across sectors. A sector with 98 reported incidents and “very high” reporting uncertainty (Utilities) gets pulled toward the population mean — because the model recognizes the measurement is noisy. A sector with 456 incidents and “low” uncertainty (Tech) resists pooling — the data are precise.

The CISO-level question becomes: “Is my industry genuinely safer, or are we just not looking?” Partial pooling gives an honest answer.

What This Tells You

Partial pooling beats both extremes. Treating each office as independent ignores shared context. Averaging everything ignores real differences. The hierarchical model navigates between them, with τ governing the trade-off.
Noisy estimates move the most. Southeast (3 incidents, σ = 7) gets pulled hard toward the company mean. Northwest (18 incidents, σ = 5) barely moves. The model trusts precise data and distrusts noisy data.
“Low incident count” ≠ “secure.” It might mean “poor detection.” The posterior width tells you which. A tight posterior near 3 means “probably safe.” A wide posterior near 7 means “you don’t know.”
Probability queries are the deliverable. “65% chance HQ is worse than NW” is what the board can act on. A bar chart of raw counts is not.
NCP is essential for sparse data. With 2–28 incidents per office, the funnel geometry defeats centered parameterization. NCP transforms the problem into one NUTS can sample cleanly.

Study Guide

Double HQ’s uncertainty from σ = 8 to σ = 16. How much does HQ’s posterior shrink toward the company mean? What happens to the shrinkage percentages of the other offices?
Add a ninth office with 50 incidents and σ = 3. How does this high-confidence outlier affect the company-wide mean μ? How much does it pull the other offices?
Compute the probability ranking — for each pair of offices, compute P(θ_i > θ_j). Present a matrix. Which comparisons are confident (>80%)? Which are coin-flips (~50%)?
Replace Half-Normal(10) prior on τ with Half-Cauchy(5). Does the heavier-tailed prior change the posterior on τ meaningfully? Does it change the office-level estimates?
(Hard.) The data above treat each office as exchangeable. But HQ has 120 staff and Remote has 20. Modify the model to make σ_j a function of staff count: σ_j = base_sigma / sqrt(staff_j / 50). How does the informativeness hierarchy change?

Literature

Gelman, Carlin, Stern, Dunson, Vehtari, Rubin. Bayesian Data Analysis, 3rd ed., §5.5 (eight schools). The direct structural analogue for this notebook.
Verizon. Data Breach Investigations Report (DBIR), annual. Industry-level incident counts — the real data behind the analogy.
Betancourt, M. (2017). “A Conceptual Introduction to Hamiltonian Monte Carlo.” The paper that explains why the funnel kills Gibbs and what NCP does about it.

Where to Go Next

notebooks/bda/ch05_eight_schools.livemd — the same model on SAT coaching data. Compare the structure: same math, different domain.
notebooks/bda-cyber/ch09_incident_response.livemd — the posteriors from this notebook feed into decision theory. “Given this incident rate and this cost structure, how much should I invest in this office?”
notebooks/bda-cyber/ch02_ids_rule_effectiveness.livemd — the per-rule Beta-Binomial from Ch 2 is the single-group version of this hierarchical model. Ch 5 is what happens when you have eight groups.

Other notebooks:

Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

tutorial advanced data-science axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

tutorial advanced data-science req axon exla nx

2022-8-18
@TomBers

livebookNotes

Trying Nx

NX.livemd

advanced data-science exla axon nx

2022-8-18
profiq
@profiq

elixir-ml-example

test_model

test.livemd

advanced data-science exla axon nx

2022-8-18
Chrisna Adhi Pranoto
@nethricia

curriculum

Stack Server

stack_server.livemd

intermediate gen-server jason kino youtube hidden_cell

2026-7-7
@selecto-elixir

selecto_livebooks

Selecto CTEs Workbook

selecto_ctes_workbook.livemd

tutorial advanced sql postgrex ecto_sql jason kino

2026-7-7
Kavin Antony A R
@Kavin-Antony

Membrane-demo

Playing MP3 File

playing_mp3_file.livemd

tutorial advanced membrane_core membrane_file_plugin membrane_mp3_mad_plugin membrane_ffmpeg_swresample_plugin membrane_aac_fdk_plugin membrane_kino_plugin

2026-7-7

Back