Hierarchical Weibull Reliability

notebooks/10_reliability_weibull.livemd

Igor O'sten

@borodark

eXMC

Share to X

Share to Bluesky

More notebooks

Hierarchical Weibull Reliability

Setup


# CPU only — no GPU required
System.put_env("EXLA_CPU_ONLY", "true")
System.put_env("CUDA_VISIBLE_DEVICES", "")
Mix.install([
  {:exmc, path: Path.expand("../", __DIR__)},
  {:exla, "~> 0.10"},
  {:kino_vega_lite, "~> 0.1"}
])
Application.put_env(:exla, :clients, host: [platform: :host])
Application.put_env(:exla, :default_client, :host)
Nx.default_backend(Nx.BinaryBackend)
Nx.Defn.default_options(compiler: EXLA, client: :host)

Why This Matters

You shipped 1,000 units. After 12 months, 47 have failed. The rest are still running. What is the failure rate? When should you expect the next warranty claim wave?

A naive estimate: 47/1000 = 4.7% failure rate. This ignores the 953 units that haven’t failed yet — they carry information too. Their survival is evidence that the failure rate isn’t as high as the early failures suggest. Ignoring them biases every estimate upward.

Survival analysis handles this through censoring: units still running are right-censored observations. The Weibull model captures the shape of the failure curve — increasing hazard (wear-out), decreasing hazard (infant mortality), or constant (random failures). A hierarchical Weibull pools across component types, so a new product with 5 failures borrows from the fleet’s 500. The posterior over shape and scale gives you P(failure before 24 months) — the number the warranty department actually needs.

The Reliability Model

Hierarchical Weibull reliability models failure times across component types with partial pooling of shape and scale parameters. This is a common industrial pattern for fleet-level reliability estimation where individual component types have limited data.

This is a d=44 model: 4 hyperparameters + 20 types × 2 NCP params (shape + scale).

Model structure:

log_k_mean ~ Normal(0.5, 1.0) — log-shape hyperprior mean
log_k_sigma ~ HalfCauchy(1.0) — log-shape spread
log_l_mean ~ Normal(2.0, 1.0) — log-scale hyperprior mean
log_l_sigma ~ HalfCauchy(1.0) — log-scale spread
k_raw_j, l_raw_j ~ Normal(0, 1) for j = 0..19 — NCP shape/scale params
k_j = exp(log_k_mean + log_k_sigma * k_raw_j) — Weibull shape for type j
lambda_j = exp(log_l_mean + log_l_sigma * l_raw_j) — Weibull scale for type j
t_{ij} ~ Weibull(k_j, lambda_j) with 20% right-censored observations

Code.require_file("../benchmark/reliability_data.exs", __DIR__)
Code.require_file("../benchmark/reliability_model.exs", __DIR__)

data = Exmc.Benchmark.ReliabilityData.generate(seed: 42)

IO.puts("Component types: #{data.n_types}")
IO.puts("Observations per type: #{data.n_per_type}")

# Show censoring summary
total_obs = data.n_types * data.n_per_type
total_cens = Enum.sum(Enum.map(data.obs_by_type, & &1.n_cens))
IO.puts("Total observations: #{total_obs} (#{total_cens} right-censored, #{Float.round(100 * total_cens / total_obs, 1)}%)")

Building and Sampling

The model uses the new Exmc.Dist.Weibull distribution with right-censoring support via Exmc.Dist.Censored. The Weibull survival function S(t) = exp(-(t/lambda)^k) gives the log-likelihood contribution for censored observations.

ir = Exmc.Benchmark.ReliabilityModel.build(data)
init = Exmc.Benchmark.ReliabilityModel.init_values(data)

IO.puts("Free parameters: #{map_size(init)}")

t0 = System.monotonic_time(:millisecond)

{trace, stats} =
  Exmc.NUTS.Sampler.sample(ir, init,
    num_warmup: 1000,
    num_samples: 1000,
    seed: 42,
    ncp: false
  )

wall_s = (System.monotonic_time(:millisecond) - t0) / 1000.0
IO.puts("Wall time: #{Float.round(wall_s, 1)}s")
IO.puts("Step size: #{Float.round(stats.step_size, 4)}")
IO.puts("Divergences: #{stats.divergences}")

Hyperprior Posteriors

alias Exmc.Diagnostics

hyperparams = Exmc.Benchmark.ReliabilityModel.reconstruct_params(trace, data)

for {name, tensor} <- hyperparams do
  samples = Nx.to_flat_list(tensor)
  mean = Enum.sum(samples) / length(samples)
  ess = Diagnostics.ess(samples)
  IO.puts("#{name}: mean=#{Float.round(mean, 3)}, ESS=#{Float.round(ess, 0)}")
end

Per-Type Shape and Scale Recovery

# Reconstruct per-type Weibull parameters from NCP trace
type_params =
  Enum.map(0..(data.n_types - 1), fn j ->
    log_k_mean = Nx.to_flat_list(trace["log_k_mean"])
    log_k_sigma = Nx.to_flat_list(trace["log_k_sigma"])
    k_raw = Nx.to_flat_list(trace["k_raw_#{j}"])

    log_l_mean = Nx.to_flat_list(trace["log_l_mean"])
    log_l_sigma = Nx.to_flat_list(trace["log_l_sigma"])
    l_raw = Nx.to_flat_list(trace["l_raw_#{j}"])

    k_samples =
      Enum.zip_with([log_k_mean, log_k_sigma, k_raw], fn [m, s, z] ->
        :math.exp(m + s * z)
      end)

    l_samples =
      Enum.zip_with([log_l_mean, log_l_sigma, l_raw], fn [m, s, z] ->
        :math.exp(m + s * z)
      end)

    k_mean = Enum.sum(k_samples) / length(k_samples)
    l_mean = Enum.sum(l_samples) / length(l_samples)

    %{
      type: j,
      k_post: k_mean,
      l_post: l_mean,
      k_true: Enum.at(data.true_shapes, j),
      l_true: Enum.at(data.true_scales, j)
    }
  end)

alias VegaLite, as: Vl

# Shape recovery plot
Vl.new(width: 500, height: 400, title: "Weibull Shape (k): True vs Posterior")
|> Vl.data_from_values(type_params)
|> Vl.mark(:circle, size: 80, opacity: 0.8)
|> Vl.encode_field(:x, "k_true", type: :quantitative, title: "True k")
|> Vl.encode_field(:y, "k_post", type: :quantitative, title: "Posterior Mean k")

# Scale recovery plot
Vl.new(width: 500, height: 400, title: "Weibull Scale (λ): True vs Posterior")
|> Vl.data_from_values(type_params)
|> Vl.mark(:circle, size: 80, opacity: 0.8, color: "orange")
|> Vl.encode_field(:x, "l_true", type: :quantitative, title: "True λ")
|> Vl.encode_field(:y, "l_post", type: :quantitative, title: "Posterior Mean λ")

Hazard Curves

The hazard function h(t) = (k/lambda) * (t/lambda)^(k-1) gives the instantaneous failure rate. For k > 1, hazard increases (wear-out); for k < 1, it decreases (infant mortality).

# Plot hazard curves for 5 selected component types
selected_types = [0, 4, 9, 14, 19]
t_grid = Enum.map(0..100, &(&1 * 0.3))

hazard_data =
  Enum.flat_map(selected_types, fn j ->
    tp = Enum.at(type_params, j)
    k = tp.k_post
    l = tp.l_post

    Enum.map(t_grid, fn t ->
      t = max(t, 0.01)
      h = (k / l) * :math.pow(t / l, k - 1)
      %{t: t, hazard: min(h, 5.0), type: "Type #{j} (k=#{Float.round(k, 1)})"}
    end)
  end)

Vl.new(width: 600, height: 400, title: "Estimated Hazard Curves by Component Type")
|> Vl.data_from_values(hazard_data)
|> Vl.mark(:line)
|> Vl.encode_field(:x, "t", type: :quantitative, title: "Time")
|> Vl.encode_field(:y, "hazard", type: :quantitative, title: "Hazard Rate h(t)")
|> Vl.encode_field(:color, "type", type: :nominal, title: "Component Type")

Other notebooks:

Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

tutorial advanced data-science axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

tutorial advanced data-science req axon exla nx

2022-8-18
@TomBers

livebookNotes

Trying Nx

NX.livemd

advanced data-science exla axon nx

2022-8-18
profiq
@profiq

elixir-ml-example

test_model

test.livemd

advanced data-science exla axon nx

2022-8-18
@DockYard-Academy

curriculum

Fibonacci Sequence

fibonacci.livemd

tutorial beginner algorithms jason kino youtube hidden_cell benchee

2026-7-14
Ryo Wakabayashi
@RyoWakabayashi

elixir-learning

ABC083B - Some Sums

submit_3.livemd

advanced algorithms nx exla kino

2026-7-7
Ryo Wakabayashi
@RyoWakabayashi

elixir-learning

ガンマ補正

gamma_correction.livemd

tutorial advanced data-science req evision kino nx kino_vega_lite

2026-7-7

Back