DistilBERT question answering on Emily
Mix.install(
[
{:emily, "~> 0.7"},
{:bumblebee, "~> 0.7"},
{:tokenizers, "~> 0.5"},
{:nx, "~> 0.12"},
{:kino, "~> 0.14"}
],
config: [
nx: [default_backend: Emily.Backend]
]
)
Overview
This notebook runs a DistilBERT question-answering pipeline on
Emily.Backend. The backend is installed as the Nx global default by
the Mix.install/2 config above, so every subsequent Nx call
dispatches to MLX without further setup.
The featurizer, tokenizer, and model all come from Bumblebee. The
only integration with Emily is the Mix.install config line and,
optionally, the Emily.Compiler attachment further down.
Loading the model
{:ok, model_info} =
Bumblebee.load_model({:hf, "distilbert-base-uncased-distilled-squad"})
{:ok, tokenizer} =
Bumblebee.load_tokenizer({:hf, "distilbert-base-uncased-distilled-squad"})
The checkpoint is ~250 MB on first fetch; subsequent runs use the
Bumblebee cache at ~/.cache/bumblebee.
Building a serving
serving =
Bumblebee.Text.question_answering(model_info, tokenizer,
defn_options: [compiler: Emily.Compiler, native: true, native_fallback: :raise]
)
Emily.Compiler pins the result backend to Emily.Backend and caps
partition concurrency at 1 (use Emily.Stream for per-process
concurrency — see the other notebook). native: true lowers the
whole forward through Emily’s native Expr compiler — one NIF replay
per call rather than op-by-op dispatch — and native_fallback: :raise
makes that fail loudly instead of silently degrading to the
evaluator. This pipeline lowers fully native, so :raise never trips.
Running a query
context =
"Elixir is a dynamic, functional programming language that runs on the Erlang VM. " <>
"It was created by José Valim in 2011."
question = "Who created Elixir?"
Nx.Serving.run(serving, %{question: question, context: context})
The expected output is a map shaped like
%{
results: [
%{text: "José Valim", start: _, end: _, score: _}
]
}
Telemetry
Under native: true the forward is a single NIF replay, so the
op-by-op [:emily, :eval, :stop] span never fires — there’s no
per-op boundary to time. The native-compiler event to watch instead
is [:emily, :compiler, :fallback]: a tripwire that fires only if an
op can’t lower and routes through the evaluator. Attach it, run the
forward, then read Emily.Memory.stats/0 (which itself emits
[:emily, :memory, :stats]):
:telemetry.attach(
"distilbert-qa-fallback",
[:emily, :compiler, :fallback],
fn _event, %{count: count}, %{reason: reason}, _config ->
IO.puts("native fallback (#{count}): #{reason}")
end,
nil
)
Emily.Memory.reset_peak()
Nx.Serving.run(serving, %{question: question, context: context})
%{active: active, peak: peak} = Emily.Memory.stats()
IO.puts("no fallback above => forward lowered fully native")
IO.puts(
"MLX memory — active #{div(active, 1024 * 1024)} MiB, " <>
"peak #{div(peak, 1024 * 1024)} MiB"
)
See Emily.Telemetry for the full event catalogue, including the
[:emily, :fallback, *] span that fires whenever an op routes
through Nx.BinaryBackend.