Notesclub

created by hec & contributors

terms privacy

BDA-Cyber Chapter 11 — Gibbs and Metropolis on Network Traffic

ch11_mcmc_traffic.livemd

Igor O'sten

@borodark

eXMC

Share to X

Share to Bluesky

More notebooks

BDA-Cyber Chapter 11 — Gibbs and Metropolis on Network Traffic

Setup

# CPU only — no GPU required
System.put_env("EXLA_CPU_ONLY", "true")
System.put_env("CUDA_VISIBLE_DEVICES", "")

Mix.install([
  {:exmc, path: Path.expand("../../", __DIR__)},
  {:exla, "~> 0.10"},
  {:kino_vega_lite, "~> 0.1"}
])

Application.put_env(:exla, :clients, host: [platform: :host])
Application.put_env(:exla, :default_client, :host)
Nx.default_backend(Nx.BinaryBackend)
Nx.Defn.default_options(compiler: EXLA, client: :host)

alias VegaLite, as: Vl
:ok

Why This Matters

Chapter 10 gave you rejection and importance sampling — and showed why they die in high dimensions. The fix: Markov Chain Monte Carlo. Build a Markov chain whose stationary distribution is the target posterior, then walk it. After enough steps, the chain visits each region in proportion to its density.

Two foundational MCMC algorithms:

Gibbs sampling: sample each variable from its full conditional, holding everything else fixed. Requires knowing the conditionals.
Metropolis sampling: propose a move, accept or reject based on the density ratio. Requires only evaluating the log density.

Every modern sampler — NUTS, HMC, slice sampling — descends from one of these. Implementing both from scratch tells you exactly what NUTS had to invent and why.

The target: a bivariate distribution representing correlated network features — connection duration and bytes transferred. These are correlated in real traffic (longer connections transfer more bytes), and that correlation is precisely what makes naive samplers struggle.

The Target — Correlated Network Features

A bivariate normal with correlation ρ = 0.85:

$$ \begin{pmatrix} \text{duration} \ \text{bytes} \end{pmatrix} \sim \mathrm{Normal}!\left( \begin{pmatrix} 30 \ 5000 \end{pmatrix},\; \begin{pmatrix} 100 & 850 \ 850 & 10000 \end{pmatrix} \right) $$

In seconds and bytes. Duration σ = 10, bytes σ = 100, correlation 0.85.

# Target parameters (standardized for sampling, then rescale)
mu1 = 0.0    # standardized duration
mu2 = 0.0    # standardized bytes
rho = 0.85
sigma1 = 1.0
sigma2 = 1.0

defmodule TrafficMCMC do
  def log_pdf(t1, t2, mu1, mu2, s1, s2, rho) do
    z1 = (t1 - mu1) / s1
    z2 = (t2 - mu2) / s2
    quad = (z1 * z1 - 2 * rho * z1 * z2 + z2 * z2) / (1 - rho * rho)
    -0.5 * quad - :math.log(2 * :math.pi() * s1 * s2 * :math.sqrt(1 - rho * rho))
  end
end

:ok

Gibbs Sampling

For a bivariate normal, the full conditionals are known:

$$ \theta_1 \mid \theta_2 \sim \text{Normal}!\left(\mu_1 + \rho \frac{\sigma_1}{\sigma_2}(\theta_2 - \mu_2),\; \sigma_1^2(1 - \rho^2)\right) $$

and symmetrically for θ₂ | θ₁. Gibbs alternates between these.

n_gibbs = 5_000
rng = :rand.seed_s(:exsss, 42)

# Conditional standard deviation
cond_sd = :math.sqrt(1 - rho * rho)

{gibbs_chain, _} =
  Enum.reduce(1..n_gibbs, {[{0.0, 0.0}], rng}, fn _, {[{t1, t2} | _] = chain, r} ->
    # Sample t1 | t2
    cond_mean_1 = mu1 + rho * (sigma1 / sigma2) * (t2 - mu2)
    {z, r} = :rand.normal_s(r)
    new_t1 = cond_mean_1 + cond_sd * z

    # Sample t2 | t1 (using updated t1)
    cond_mean_2 = mu2 + rho * (sigma2 / sigma1) * (new_t1 - mu1)
    {z, r} = :rand.normal_s(r)
    new_t2 = cond_mean_2 + cond_sd * z

    {[{new_t1, new_t2} | chain], r}
  end)

gibbs_chain = Enum.reverse(gibbs_chain)

# Summary statistics
t1_samples = Enum.map(gibbs_chain, &elem(&1, 0))
t2_samples = Enum.map(gibbs_chain, &elem(&1, 1))

gibbs_mean_1 = Enum.sum(t1_samples) / length(t1_samples)
gibbs_mean_2 = Enum.sum(t2_samples) / length(t2_samples)

# Empirical correlation
cov_12 =
  Enum.zip(t1_samples, t2_samples)
  |> Enum.reduce(0.0, fn {a, b}, acc -> acc + (a - gibbs_mean_1) * (b - gibbs_mean_2) end)
  |> then(fn s -> s / length(t1_samples) end)

var_1 = Enum.reduce(t1_samples, 0.0, fn a, acc -> acc + (a - gibbs_mean_1) ** 2 end) / length(t1_samples)
var_2 = Enum.reduce(t2_samples, 0.0, fn a, acc -> acc + (a - gibbs_mean_2) ** 2 end) / length(t2_samples)
emp_corr = cov_12 / (:math.sqrt(var_1) * :math.sqrt(var_2))

%{
  n_samples: length(gibbs_chain),
  mean_duration: Float.round(gibbs_mean_1, 3),
  mean_bytes: Float.round(gibbs_mean_2, 3),
  empirical_correlation: Float.round(emp_corr, 3),
  target_correlation: rho
}

gibbs_data =
  gibbs_chain
  |> Enum.take(2000)
  |> Enum.map(fn {t1, t2} -> %{duration: t1, bytes: t2} end)

Vl.new(width: 400, height: 400, title: "Gibbs samples — correlated network features")
|> Vl.data_from_values(gibbs_data)
|> Vl.mark(:circle, size: 4, opacity: 0.3, color: "#4c78a8")
|> Vl.encode_field(:x, "duration", type: :quantitative, title: "Duration (standardized)")
|> Vl.encode_field(:y, "bytes", type: :quantitative, title: "Bytes (standardized)")

The scatter plot shows the tilted ellipse of the bivariate normal. Gibbs recovers it correctly — the empirical correlation matches the target ρ.

But notice: Gibbs moves along axis-aligned conditionals. At high correlation (ρ = 0.85), each step moves mostly along one axis, making slow diagonal progress. The effective sample size per iteration is low. This is the seed of the problem that HMC solves.

Gibbs Trace Plot

trace_data =
  gibbs_chain
  |> Enum.with_index()
  |> Enum.flat_map(fn {{t1, t2}, i} ->
    [
      %{iteration: i, value: t1, variable: "Duration"},
      %{iteration: i, value: t2, variable: "Bytes"}
    ]
  end)
  |> Enum.take(4000)

Vl.new(width: 600, height: 300, title: "Gibbs trace plots")
|> Vl.data_from_values(trace_data)
|> Vl.mark(:line, stroke_width: 0.5, opacity: 0.5)
|> Vl.encode_field(:x, "iteration", type: :quantitative)
|> Vl.encode_field(:y, "value", type: :quantitative)
|> Vl.encode_field(:row, "variable", type: :nominal)
|> Vl.encode_field(:color, "variable", type: :nominal)

Metropolis Sampling

Metropolis doesn’t need full conditionals — just the ability to evaluate the log density at any point. Propose a random step, accept if it moves “uphill” (or randomly accept downhill with probability proportional to the density ratio).

n_metro = 10_000
proposal_sd = 0.5  # step size
rng = :rand.seed_s(:exsss, 42)

{metro_chain, n_accepted, _} =
  Enum.reduce(1..n_metro, {[{0.0, 0.0}], 0, rng}, fn _, {[{t1, t2} | _] = chain, n_acc, r} ->
    # Propose: symmetric random walk
    {z1, r} = :rand.normal_s(r)
    {z2, r} = :rand.normal_s(r)
    prop_t1 = t1 + proposal_sd * z1
    prop_t2 = t2 + proposal_sd * z2

    # Log acceptance ratio
    log_current = TrafficMCMC.log_pdf(t1, t2, mu1, mu2, sigma1, sigma2, rho)
    log_proposed = TrafficMCMC.log_pdf(prop_t1, prop_t2, mu1, mu2, sigma1, sigma2, rho)
    log_alpha = log_proposed - log_current

    {u, r} = :rand.uniform_s(r)

    if :math.log(u) < log_alpha do
      {[{prop_t1, prop_t2} | chain], n_acc + 1, r}
    else
      {[{t1, t2} | chain], n_acc, r}
    end
  end)

metro_chain = Enum.reverse(metro_chain)
accept_rate = n_accepted / n_metro

%{
  n_samples: length(metro_chain),
  acceptance_rate: "#{Float.round(accept_rate * 100, 1)}%",
  optimal_acceptance: "~23% for 2D (Roberts et al. 1997)",
  proposal_sd: proposal_sd
}

metro_data =
  metro_chain
  |> Enum.take(2000)
  |> Enum.map(fn {t1, t2} -> %{duration: t1, bytes: t2} end)

Vl.new(width: 400, height: 400, title: "Metropolis samples — proposal_sd=#{proposal_sd}")
|> Vl.data_from_values(metro_data)
|> Vl.mark(:circle, size: 4, opacity: 0.3, color: "#e45756")
|> Vl.encode_field(:x, "duration", type: :quantitative, title: "Duration (standardized)")
|> Vl.encode_field(:y, "bytes", type: :quantitative, title: "Bytes (standardized)")

Tuning the Proposal — Too Small, Too Large, Just Right

proposal_sds = [0.1, 0.5, 1.5, 5.0]

tuning_results =
  for psd <- proposal_sds do
    rng = :rand.seed_s(:exsss, 42)

    {chain, n_acc, _} =
      Enum.reduce(1..5000, {[{0.0, 0.0}], 0, rng}, fn _, {[{t1, t2} | _] = c, na, r} ->
        {z1, r} = :rand.normal_s(r)
        {z2, r} = :rand.normal_s(r)
        pt1 = t1 + psd * z1
        pt2 = t2 + psd * z2

        lc = TrafficMCMC.log_pdf(t1, t2, mu1, mu2, sigma1, sigma2, rho)
        lp = TrafficMCMC.log_pdf(pt1, pt2, mu1, mu2, sigma1, sigma2, rho)
        {u, r} = :rand.uniform_s(r)

        if :math.log(u) < (lp - lc) do
          {[{pt1, pt2} | c], na + 1, r}
        else
          {[{t1, t2} | c], na, r}
        end
      end)

    t1s = Enum.map(chain, &elem(&1, 0))
    # Crude ESS estimate: n / (1 + 2*sum of autocorrelations)
    # Approximate via lag-1 autocorrelation
    mean1 = Enum.sum(t1s) / length(t1s)
    var1 = Enum.reduce(t1s, 0.0, fn x, acc -> acc + (x - mean1) ** 2 end) / length(t1s)

    lag1 =
      t1s
      |> Enum.chunk_every(2, 1, :discard)
      |> Enum.reduce(0.0, fn [a, b], acc -> acc + (a - mean1) * (b - mean1) end)
      |> then(fn s -> s / (length(t1s) * var1) end)

    ess_approx = length(t1s) / (1 + 2 * max(lag1, 0))

    %{
      proposal_sd: psd,
      acceptance: "#{Float.round(n_acc / 5000 * 100, 1)}%",
      ess_approx: round(ess_approx)
    }
  end

tuning_results

proposal_sd	acceptance	ESS	Problem
0.1	~95%	low	Accepts everything, moves nowhere
0.5	~60%	moderate	OK but slow exploration
1.5	~25%	highest	Near optimal for 2D
5.0	~5%	low	Overshoots, rejects almost everything

The optimal acceptance rate for Metropolis on a d-dimensional target is ~23.4% (Roberts et al. 1997). The proposal_sd that achieves this produces the highest ESS — the best trade-off between exploration (large steps) and acceptance (not overshooting).

This is what NUTS automates. It finds the step size that gives ~65% acceptance (the HMC optimal), and it uses gradient information to propose moves that follow the posterior’s curvature instead of stumbling randomly. Every NUTS run you’ve done in Ch 5 was a descendant of this Metropolis loop — with the step size tuned automatically and the random walk replaced by Hamiltonian dynamics.

What This Tells You

Gibbs is elegant when you have conditionals. For conjugate models, each step is an exact draw. But axis-aligned moves struggle with correlation — and correlation is the norm in real data.
Metropolis is universal. It only needs log density evaluations. No conjugacy, no conditionals. But tuning the proposal is critical, and optimal tuning depends on the geometry of the target.
High correlation kills both. At ρ = 0.85, both Gibbs and Metropolis take many steps to cross the posterior diagonally. NUTS uses the gradient to follow the contour directly.
NUTS is Metropolis with a physics engine. Instead of random proposals, HMC simulates a particle sliding on the log-density surface. NUTS adds adaptive step size and a “no U-turn” stopping rule. The result: near-independent samples from complex posteriors. Every Sampler.sample/3 call in other notebooks runs this machinery.

Study Guide

Increase ρ to 0.95. How does the Gibbs ESS change? How does Metropolis ESS change at each proposal_sd? At what ρ does Gibbs become practically useless?
Replace the bivariate normal with a banana-shaped distribution: θ₂ = θ₁² + ε. Implement Metropolis on this target. Does the acceptance rate change? What about ESS?
Add the C2 beacon mode from Ch 10 (a second Gaussian at (3, 3)). Run Metropolis. Does the chain visit both modes? How many iterations does it take to jump between them?
(Hard.) Implement a simple Hamiltonian Monte Carlo sampler (leapfrog integrator + accept/reject) for the bivariate normal. Compare its ESS to Metropolis at the same number of evaluations. This is the step between Metropolis and NUTS.

Literature

Gelman et al. BDA3, §11.1–11.4 (Gibbs and Metropolis).
Roberts, G.O., Gelman, A., & Gilks, W.R. (1997). “Weak convergence and optimal scaling of random walk Metropolis algorithms.” Annals of Applied Probability. The 23.4% result.
Neal, R. (2011). “MCMC using Hamiltonian dynamics.” Handbook of MCMC. The bridge from Metropolis to HMC.

Where to Go Next

notebooks/bda/ch11_gibbs_metropolis.livemd — the same algorithms on a toy bivariate normal.
notebooks/bda-cyber/ch05_eight_socs.livemd — NUTS solving the problem that Gibbs and Metropolis cannot: the hierarchical funnel.

Other notebooks:

Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

tutorial advanced data-science axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

tutorial advanced data-science req axon exla nx

2022-8-18
@TomBers

livebookNotes

Trying Nx

NX.livemd

advanced data-science exla axon nx

2022-8-18
profiq
@profiq

elixir-ml-example

test_model

test.livemd

advanced data-science exla axon nx

2022-8-18
Chrisna Adhi Pranoto
@nethricia

curriculum

Start Here

start_here.livemd

tutorial advanced beginner jason kino youtube hidden_cell

2026-7-7
Andrey Marchenko
@anmarchenko

notebooks

Geonames

geonames.livemd

tutorial intermediate req explorer kino_explorer

2024-8-4
@alde103

Learning_Nx

Nx Tips

NxTips.livemd

tutorial advanced data-science nx axon table_rex exla

2026-7-7

Back