Powered by AppSignal & Oban Pro

Shift detection — EWMA charts

livebooks/02_shift_ewma.livemd

Shift detection — EWMA charts

The question: a metric that moved and stayed moved

You ship a deploy. Latency doesn’t spike — nothing pages — but it settles a little higher than before and stays there. A week later someone asks why the dashboard looks “off.” Each individual window looks fine: the new level is only about 1.5 standard deviations above where it used to sit, and a per-window “3-sigma tripwire” never sees a single point far enough out to complain.

A quick vocabulary pass, since the rest of this notebook leans on it:

  • mean (or target) — the average value when things are healthy.
  • standard deviation (written sigma, σ) — the typical size of the random wiggle around that mean. “1.5σ above target” means “one and a half typical wiggles above where it should be.”
  • variance — sigma squared; the same idea expressed as a squared quantity, which is what the math actually adds up.

This notebook builds the detector that catches the sustained moderate shift — the move that’s too small to trip a per-window alarm but too real to ignore.

Mix.install([
  {:mobius_smarts, path: Path.expand("..", __DIR__)},
  {:kino, "~> 0.14"},
  {:kino_vega_lite, "~> 0.1"}
])
alias VegaLite, as: Vl
alias MobiusSmarts.Detect.Shift

The impression: nudge a fraction of the way each window

The trick is to keep a running impression of the metric’s level. Every new window nudges the impression a fraction lambda of the way toward the new value:

impression = lambda * new_value + (1 - lambda) * previous_impression

In words: the impression is mostly its old self, with a small splash of the latest reading mixed in. Random noise nudges it up and down in equal measure, so those nudges cancel and the impression barely moves. This one line is the whole idea — it’s called an EWMA (exponentially weighted moving average). Here it is as a single Enum.scan over a noisy but flat series.

:rand.seed(:exsss, {1, 2, 3})

target = 50.0
sigma = 2.0
lambda = 0.2

flat = for _ <- 1..80, do: target + sigma * :rand.normal()

impression =
  Enum.scan(flat, target, fn x, prev -> lambda * x + (1.0 - lambda) * prev end)

raw_rows = for {v, i} <- Enum.with_index(flat), do: %{i: i, value: v, series: "raw"}

smoothed_rows =
  for {v, i} <- Enum.with_index(impression), do: %{i: i, value: v, series: "impression"}

Vl.new(width: 700, height: 300, title: "Noisy flat metric vs. its impression")
|> Vl.data_from_values(raw_rows ++ smoothed_rows)
|> Vl.mark(:line)
|> Vl.encode_field(:x, "i", type: :quantitative, title: "window")
|> Vl.encode_field(:y, "value", type: :quantitative, scale: [zero: false], title: "latency (ms)")
|> Vl.encode_field(:color, "series", type: :nominal)
|> Vl.encode(:opacity, value: 0.9)

Look at the two lines: the faint jagged line is the raw metric, the bold flatter line is the impression. The noise is gone but the level — about 50 — is intact. That bold line is what we’ll watch instead of the raw data.

Lambda intuition: how much memory?

lambda is the only knob that matters here. It decides how much of each new reading mixes in:

  • lambda = 1.0 — keep all of the new value, none of the old. The impression is the raw series: no memory at all.
  • small lambda (say 0.05) — keep almost all of the old impression. Long memory, very smooth, slow to react.

Same flat data, four values of lambda:

ewma = fn lam ->
  Enum.scan(flat, target, fn x, prev -> lam * x + (1.0 - lam) * prev end)
end

lambda_rows =
  for lam <- [0.05, 0.2, 0.5, 1.0],
      {v, i} <- Enum.with_index(ewma.(lam)) do
    %{i: i, value: v, lambda: "λ=#{lam}"}
  end

Vl.new(width: 700, height: 300, title: "Same data, four lambdas")
|> Vl.data_from_values(lambda_rows)
|> Vl.mark(:line)
|> Vl.encode_field(:x, "i", type: :quantitative, title: "window")
|> Vl.encode_field(:y, "value", type: :quantitative, scale: [zero: false], title: "latency (ms)")
|> Vl.encode_field(:color, "lambda", type: :nominal)

The λ=1.0 line is the raw series exactly — it has no memory, so it’s just as jagged as the input. As lambda shrinks the line gets smoother and lags further behind. The library default is 0.2: enough memory to erase noise, little enough lag to react within a handful of windows.

The band: how far can the impression wander by chance?

A smooth line is nice, but to detect anything we need to know when the impression has moved more than chance alone would explain. That’s the band — an upper and lower control limit (UCL / LCL) around the target.

The clever part: because we know the recipe that built the impression, we can compute exactly how far it can wander under pure noise at each step. The variance of the impression at time t is:

Var(z_t) = sigma² · (lambda / (2 - lambda)) · (1 - (1 - lambda)^(2t))

In words: the spread of the impression grows from nearly zero and levels off. The (1 - (1 - lambda)^(2t)) factor is small early — a two-window- old impression hasn’t had time to wander, so even a small deviation early on is meaningful — and climbs to 1 as t grows, at which point the spread settles to a fixed sigma² · lambda / (2 - lambda). The band is just the target plus/minus L times the square root of that variance (L = 3 by default, the usual “3-sigma” width).

Shift.chart/2 computes all of this. (We pass target:/sigma: explicitly because this notebook invents them; on real Mobius data, hand the map from MobiusSmarts.Detect.Jump.baseline/3 straight to Shift.chart(avgs, baseline: baseline) — it picks the right noise scale, sigma_avg, itself.) Here it is on a healthy series:

:rand.seed(:exsss, {171, 342, 513})

healthy = for _ <- 1..80, do: target + sigma * :rand.normal()
healthy_chart = Shift.chart(healthy, target: target, sigma: sigma)

hs = Nx.to_flat_list(healthy_chart.smoothed)
hu = Nx.to_flat_list(healthy_chart.ucl)
hl = Nx.to_flat_list(healthy_chart.lcl)

# Half-width grows from narrow to a fixed value:
IO.inspect(Float.round(Enum.at(hu, 0) - target, 3), label: "half-width at t=1")
IO.inspect(Float.round(Enum.at(hu, 1) - target, 3), label: "half-width at t=2")
IO.inspect(Float.round(List.last(hu) - target, 3), label: "half-width at t=80")

band_rows =
  for i <- 0..(length(hs) - 1) do
    %{i: i, impression: Enum.at(hs, i), ucl: Enum.at(hu, i), lcl: Enum.at(hl, i)}
  end

Vl.new(width: 700, height: 300, title: "Healthy series with its time-varying band")
|> Vl.data_from_values(band_rows)
|> Vl.layers([
  Vl.new()
  |> Vl.mark(:line, color: "#888")
  |> Vl.encode_field(:x, "i", type: :quantitative, title: "window")
  |> Vl.encode_field(:y, "ucl", type: :quantitative, scale: [zero: false]),
  Vl.new()
  |> Vl.mark(:line, color: "#888")
  |> Vl.encode_field(:x, "i")
  |> Vl.encode_field(:y, "lcl", type: :quantitative),
  Vl.new()
  |> Vl.mark(:line, color: "#1f77b4")
  |> Vl.encode_field(:x, "i")
  |> Vl.encode_field(:y, "impression", type: :quantitative, title: "latency (ms)")
])

Look at the grey lines: the band starts narrow (half-width about 1.2 at the first window) and flares open to its fixed width of 2.0 within a dozen windows. The blue impression stays comfortably inside — this series is healthy, so nothing fires.

The payoff: a sustained 1.5σ shift

Now the deploy story. The metric runs at target for 40 windows, then a deploy bumps it up by 1.5σ (3 ms) and it stays there. First, the naive per-window check that fails in the intro: is any single window more than 3σ from target?

shift_at = 40
shift_size_sigma = 1.5

:rand.seed(:exsss, {171, 342, 513})

shifted =
  for i <- 0..79 do
    base = if i >= shift_at, do: target + shift_size_sigma * sigma, else: target
    base + sigma * :rand.normal()
  end

# Naive 3-sigma per-window tripwire:
naive_alarms =
  shifted
  |> Enum.with_index()
  |> Enum.filter(fn {v, _i} -> abs(v - target) > 3 * sigma end)

IO.inspect(length(naive_alarms), label: "per-window 3σ alarms (whole series)")

Zero. The shift is real and sustained, but no single window pokes far enough out for a per-window tripwire to notice. Now the EWMA chart on the exact same data:

shift_chart = Shift.chart(shifted, target: target, sigma: sigma)

IO.inspect(shift_chart.first_violation, label: "first_violation index")
IO.inspect(shift_chart.first_violation - shift_at, label: "windows after the shift")

ss = Nx.to_flat_list(shift_chart.smoothed)
su = Nx.to_flat_list(shift_chart.ucl)
sl = Nx.to_flat_list(shift_chart.lcl)
sv = Nx.to_flat_list(shift_chart.violations)

base_rows =
  for i <- 0..(length(ss) - 1) do
    %{i: i, impression: Enum.at(ss, i), ucl: Enum.at(su, i), lcl: Enum.at(sl, i)}
  end

violation_rows =
  for {1, i} <- Enum.zip(sv, 0..(length(sv) - 1)), do: %{i: i, value: Enum.at(ss, i)}

fv_rows = [%{i: shift_chart.first_violation}]

Vl.new(width: 700, height: 340, title: "Sustained 1.5σ shift caught by the EWMA chart")
|> Vl.layers([
  Vl.new()
  |> Vl.data_from_values(base_rows)
  |> Vl.mark(:line, color: "#888")
  |> Vl.encode_field(:x, "i", type: :quantitative, title: "window")
  |> Vl.encode_field(:y, "ucl", type: :quantitative, scale: [zero: false]),
  Vl.new()
  |> Vl.data_from_values(base_rows)
  |> Vl.mark(:line, color: "#888")
  |> Vl.encode_field(:x, "i")
  |> Vl.encode_field(:y, "lcl", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(base_rows)
  |> Vl.mark(:line, color: "#1f77b4")
  |> Vl.encode_field(:x, "i")
  |> Vl.encode_field(:y, "impression", type: :quantitative, title: "latency (ms)"),
  Vl.new()
  |> Vl.data_from_values(violation_rows)
  |> Vl.mark(:point, color: "#d62728", filled: true, size: 45)
  |> Vl.encode_field(:x, "i")
  |> Vl.encode_field(:y, "value", type: :quantitative),
  Vl.new()
  |> Vl.data_from_values(fv_rows)
  |> Vl.mark(:rule, color: "#d62728", stroke_dash: [4, 4])
  |> Vl.encode_field(:x, "i", type: :quantitative)
])

Look at where the blue impression line crosses the upper grey band: the red dashed rule marks first_violation at window 42 — just 2 windows after the shift landed at window 40. The red dots are every window the impression sits outside the band. The per-window check saw nothing; the impression was dragged steadily off target and tripped the band almost immediately.

The tradeoff: detection delay vs. shift size

Smaller lambda smooths harder and can catch smaller shifts, but it reacts later. Bigger shifts are caught faster. Move the sliders, then re-run the two cells below to see the delay change. The defaults reproduce the case above (lambda 0.2, shift 1.5σ).

lambda_input = Kino.Input.range("lambda", min: 0.05, max: 1.0, step: 0.05, default: 0.2)
shift_input = Kino.Input.range("shift size (sigma)", min: 0.5, max: 3.0, step: 0.5, default: 1.5)
chosen_lambda = Kino.Input.read(lambda_input)
chosen_shift = Kino.Input.read(shift_input)

:rand.seed(:exsss, {171, 342, 513})

tuned_series =
  for i <- 0..79 do
    base = if i >= shift_at, do: target + chosen_shift * sigma, else: target
    base + sigma * :rand.normal()
  end

tuned = Shift.chart(tuned_series, target: target, sigma: sigma, lambda: chosen_lambda)

delay =
  case tuned.first_violation do
    nil -> "never detected"
    fv -> "#{fv - shift_at} windows after the shift"
  end

Kino.DataTable.new([
  %{lambda: chosen_lambda, shift_sigma: chosen_shift, first_violation: tuned.first_violation, delay: delay}
])

Try lambda = 1.0 (no memory) with the default 1.5σ shift: it reports “never detected” — with no memory the impression is just the raw series, and we already saw the raw series never trips a 3σ band. Now drop the shift to 0.5σ at lambda = 0.2: also never detected, because half a sigma is buried in the noise. Small lambda plus a real sustained shift is the sweet spot.

Bonus: the impression is your dashboard line

There’s a free win in all of this. The impression isn’t only a detector input — it’s the natural line to draw on a dashboard. It’s the raw metric with the noise taken out, so a human glancing at it sees the level and the trend without the jitter. The bold line in the very first chart was already a better dashboard than the raw data underneath it.

Streaming form: one value at a time

In production you don’t have the whole series — values arrive one window at a time. Shift.new/1 and Shift.step/2 carry the impression forward with O(1) state per metric (just the current impression and a step counter), so it fits comfortably on a device. Here we replay the shifted series through the streaming API and print the step where the status first turns non-:ok.

state0 = Shift.new(target: target, sigma: sigma)

{first_streaming_violation, _final_state} =
  shifted
  |> Enum.with_index()
  |> Enum.reduce({nil, state0}, fn {x, idx}, {found, state} ->
    {status, next_state} = Shift.step(state, x)
    found = if found == nil and status != :ok, do: {idx, status}, else: found
    {found, next_state}
  end)

IO.inspect(first_streaming_violation, label: "{step, status} of first streaming violation")

It reports {42, :upper_violation} — the same window 42 the batch chart/2 found, reached by folding one value at a time instead of looking at the whole series at once.

Blind spots: where Shift hands off to its siblings

Shift sits in the middle of a size/speed gradient. It deliberately ignores two things its siblings catch.

A single huge spike barely moves the impression. Because each window only contributes a fraction lambda, one freak reading gets diluted. Here a healthy series gets one 4σ spike at window 40 — a value of 58 ms, which a per-window check flags loudly:

:rand.seed(:exsss, {7, 14, 21})

healthy2 = for _ <- 1..80, do: target + sigma * :rand.normal()
spike_idx = 40
spiked = List.update_at(healthy2, spike_idx, fn _ -> target + 4.0 * sigma end)

raw_spike_alarms = Enum.count(spiked, fn v -> abs(v - target) > 3 * sigma end)
spike_chart = Shift.chart(spiked, target: target, sigma: sigma)
sp = Nx.to_flat_list(spike_chart.smoothed)

IO.inspect(Enum.at(spiked, spike_idx), label: "raw value at the spike (ms)")
IO.inspect(raw_spike_alarms, label: "per-window 3σ alarms")
IO.inspect(Float.round(Enum.at(sp, spike_idx) - target, 2), label: "impression rise at the spike")
IO.inspect(spike_chart.first_violation, label: "EWMA first_violation")

The raw value hits 58 ms (4σ out) and a per-window tripwire fires once. But the impression only lifts about 1.33 ms — well inside its band of 2.0 — so Shift.chart returns nil: no violation. That’s by design. A one-window spike is MobiusSmarts.Detect.Jump‘s job (Shewhart chart, no memory, full reaction to single points) — see 01_jump_shewhart_charts.livemd.

spike_rows = for {v, i} <- Enum.with_index(spiked), do: %{i: i, value: v, series: "raw"}

impression_rows =
  for {v, i} <- Enum.with_index(sp), do: %{i: i, value: v, series: "impression"}

band_only = for i <- 0..79, do: %{i: i, ucl: Enum.at(Nx.to_flat_list(spike_chart.ucl), i)}

Vl.new(width: 700, height: 300, title: "A 4σ spike: loud in the raw line, a shrug in the impression")
|> Vl.layers([
  Vl.new()
  |> Vl.data_from_values(spike_rows ++ impression_rows)
  |> Vl.mark(:line)
  |> Vl.encode_field(:x, "i", type: :quantitative, title: "window")
  |> Vl.encode_field(:y, "value", type: :quantitative, scale: [zero: false], title: "latency (ms)")
  |> Vl.encode_field(:color, "series", type: :nominal),
  Vl.new()
  |> Vl.data_from_values(band_only)
  |> Vl.mark(:line, color: "#888", stroke_dash: [4, 4])
  |> Vl.encode_field(:x, "i")
  |> Vl.encode_field(:y, "ucl", type: :quantitative)
])

Look at window 40: the raw line leaps to the top of the chart while the impression only ticks up and stays under the dashed band, then settles right back. One spike isn’t a shift.

A very tiny slow drift takes ages. At the other end, a creep of a fraction of a sigma over many windows nudges the impression so gently it may never clear the band. That’s MobiusSmarts.Detect.Drift‘s job (CUSUM, full memory — it accumulates tiny deviations instead of discounting them) — see 03_drift_cusum.livemd. Run all three in parallel and each covers the others’ blind spot.