Powered by AppSignal & Oban Pro

ModernBERT sequence classification on Emily

modernbert_classification.livemd

ModernBERT sequence classification on Emily

Mix.install(
  [
    {:emily, "~> 0.4"},
    {:bumblebee, "~> 0.7"},
    {:tokenizers, "~> 0.5"},
    {:nx, "~> 0.12"},
    {:kino, "~> 0.14"}
  ],
  config: [
    nx: [default_backend: Emily.Backend]
  ]
)

Overview

This notebook runs a ModernBERT sequence-classification fine-tune on Emily.Backend. ModernBERT is one of the three new model families that landed with Bumblebee 0.7 — a long-context (8192 token) BERT successor with RoPE positional encoding, GeGLU activations, and alternating local/global attention. It’s the most interesting of the three for Emily because it’s the first encoder with rotary embeddings to land in the conformance suite, and it exercises both local and global attention paths in a single forward.

The integration with Emily is the Mix.install config line above; no further setup is required.

Loading the model

repo = "tasksource/ModernBERT-base-nli"

{:ok, model_info} = Bumblebee.load_model({:hf, repo})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, repo})

tasksource/ModernBERT-base-nli is an NLI fine-tune (entailment / neutral / contradiction across MNLI, ANLI, FEVER, …). Swap in any ModernBERT-based *ForSequenceClassification checkpoint — the Bumblebee 0.7 auto-detect resolves the architecture and the serving pipeline is identical. The base checkpoint is ~600 MB on first fetch.

Building a classification serving

serving =
  Bumblebee.Text.text_classification(model_info, tokenizer,
    top_k: 3,
    defn_options: [compiler: Emily.Compiler]
  )

Emily.Compiler pins the result backend to Emily.Backend and caps partition concurrency at 1. For an encoder this size, a single MLX command queue saturates the GPU on a single inference job — use Emily.Stream if you need parallel inferences.

Classifying a few examples

ModernBERT NLI fine-tunes expect the sentence-pair format [SEP]. Bumblebee’s tokenizer accepts a {premise, hypothesis} tuple and emits the right framing.

examples = [
  {"Elixir runs on the BEAM virtual machine.", "Elixir runs on a virtual machine."},
  {"Elixir runs on the BEAM virtual machine.", "Elixir is a compiled language with no runtime."},
  {"Cats sleep most of the day.", "Cats are nocturnal predators."}
]

Nx.Serving.run(serving, examples)

Each result is a %{predictions: [%{label: _, score: _} | _]} map. With this fine-tune the top label will be one of "entailment", "neutral", or "contradiction".

Local-vs-global attention path

ModernBERT alternates :sliding_attention (local window, default 128 tokens) with :full_attention (global, every block sees every other) across its 22 layers. Both paths run on Emily.Backend through the standard scaled-dot-product attention kernel — the local window is realised as an additive mask, so there’s no separate code path on the Emily side. Long-document inputs (~2k+ tokens) are where the saved compute on local layers actually shows up in throughput.

Telemetry

Emily emits :telemetry events at the evaluation boundary. Attach a handler to sample timing for each forward pass:

:telemetry.attach(
  "modernbert-cls",
  [:emily, :eval, :stop],
  fn _event, %{duration: duration}, _meta, _config ->
    ms = System.convert_time_unit(duration, :native, :millisecond)
    IO.puts("eval #{ms} ms")
  end,
  nil
)

Nx.Serving.run(serving, hd(examples))

See Emily.Telemetry for the full event catalogue, including the [:emily, :block, :fallback] event that fires whenever an op routes through Nx.BinaryBackend.