Powered by AppSignal & Oban Pro

DistilBERT question answering on Emily

livebooks/distilbert_qa.livemd

DistilBERT question answering on Emily

Mix.install(
  [
    {:emily, "~> 0.7"},
    {:bumblebee, "~> 0.7"},
    {:tokenizers, "~> 0.5"},
    {:nx, "~> 0.12"},
    {:kino, "~> 0.14"}
  ],
  config: [
    nx: [default_backend: Emily.Backend]
  ]
)

Overview

This notebook runs a DistilBERT question-answering pipeline on Emily.Backend. The backend is installed as the Nx global default by the Mix.install/2 config above, so every subsequent Nx call dispatches to MLX without further setup.

The featurizer, tokenizer, and model all come from Bumblebee. The only integration with Emily is the Mix.install config line and, optionally, the Emily.Compiler attachment further down.

Loading the model

{:ok, model_info} =
  Bumblebee.load_model({:hf, "distilbert-base-uncased-distilled-squad"})

{:ok, tokenizer} =
  Bumblebee.load_tokenizer({:hf, "distilbert-base-uncased-distilled-squad"})

The checkpoint is ~250 MB on first fetch; subsequent runs use the Bumblebee cache at ~/.cache/bumblebee.

Building a serving

serving =
  Bumblebee.Text.question_answering(model_info, tokenizer,
    defn_options: [compiler: Emily.Compiler, native: true, native_fallback: :raise]
  )

Emily.Compiler pins the result backend to Emily.Backend and caps partition concurrency at 1 (use Emily.Stream for per-process concurrency — see the other notebook). native: true lowers the whole forward through Emily’s native Expr compiler — one NIF replay per call rather than op-by-op dispatch — and native_fallback: :raise makes that fail loudly instead of silently degrading to the evaluator. This pipeline lowers fully native, so :raise never trips.

Running a query

context =
  "Elixir is a dynamic, functional programming language that runs on the Erlang VM. " <>
    "It was created by José Valim in 2011."

question = "Who created Elixir?"

Nx.Serving.run(serving, %{question: question, context: context})

The expected output is a map shaped like

%{
  results: [
    %{text: "José Valim", start: _, end: _, score: _}
  ]
}

Telemetry

Under native: true the forward is a single NIF replay, so the op-by-op [:emily, :eval, :stop] span never fires — there’s no per-op boundary to time. The native-compiler event to watch instead is [:emily, :compiler, :fallback]: a tripwire that fires only if an op can’t lower and routes through the evaluator. Attach it, run the forward, then read Emily.Memory.stats/0 (which itself emits [:emily, :memory, :stats]):

:telemetry.attach(
  "distilbert-qa-fallback",
  [:emily, :compiler, :fallback],
  fn _event, %{count: count}, %{reason: reason}, _config ->
    IO.puts("native fallback (#{count}): #{reason}")
  end,
  nil
)

Emily.Memory.reset_peak()
Nx.Serving.run(serving, %{question: question, context: context})
%{active: active, peak: peak} = Emily.Memory.stats()

IO.puts("no fallback above => forward lowered fully native")

IO.puts(
  "MLX memory — active #{div(active, 1024 * 1024)} MiB, " <>
    "peak #{div(peak, 1024 * 1024)} MiB"
)

See Emily.Telemetry for the full event catalogue, including the [:emily, :fallback, *] span that fires whenever an op routes through Nx.BinaryBackend.