Powered by AppSignal & Oban Pro

SmolLM3-3B chat generation on Emily

notebooks/smollm3_chat.livemd

SmolLM3-3B chat generation on Emily

Mix.install(
  [
    {:emily, "~> 0.4"},
    {:bumblebee, "~> 0.7"},
    {:tokenizers, "~> 0.5"},
    {:nx, "~> 0.12"},
    {:kino, "~> 0.14"}
  ],
  config: [
    nx: [default_backend: Emily.Backend]
  ]
)

Overview

This notebook loads HuggingFaceTB/SmolLM3-3B through Bumblebee and greedy-decodes a chat-style completion on Emily.Backend. SmolLM3 is one of the three new model families that landed with Bumblebee 0.7 — a 3B-parameter Llama-style decoder with GQA, RoPE, and RMSNorm. It targets the “small but capable” niche between Qwen3-0.6B and Qwen3-4B.

The checkpoint is ~6 GB on first fetch. Budget several minutes for the cold run.

Loading the model

{:ok, model_info} =
  Bumblebee.load_model({:hf, "HuggingFaceTB/SmolLM3-3B"})

{:ok, tokenizer} =
  Bumblebee.load_tokenizer({:hf, "HuggingFaceTB/SmolLM3-3B"})

{:ok, generation_config} =
  Bumblebee.load_generation_config({:hf, "HuggingFaceTB/SmolLM3-3B"})

Bumblebee 0.7’s auto-detect maps SmolLM3ForCausalLM to {Bumblebee.Text.SmolLm3, :for_causal_language_modeling}, so no module:/architecture: overrides are needed.

Building a generation serving

config =
  Bumblebee.configure(generation_config,
    max_new_tokens: 128,
    strategy: %{type: :greedy_search}
  )

serving =
  Bumblebee.Text.generation(model_info, tokenizer, config,
    defn_options: [compiler: Emily.Compiler]
  )

Emily.Compiler pins the result backend to Emily.Backend and caps partition concurrency at 1. Per-process parallelism is the job of Emily.Stream (see further down).

Running a chat-style completion

SmolLM3 is instruction-tuned around the <|im_start|> / <|im_end|> chat template. Bumblebee 0.7 doesn’t ship a Jinja runtime, so the framing has to be assembled in plain Elixir — the helper below mirrors the official chat template closely enough for the common system / user / assistant case.

defmodule ChatPrompt do
  def build(messages, opts \\ []) do
    think? = Keyword.get(opts, :think, false)
    body = Enum.map_join(messages, "\n", &amp;format/1)
    think_tag = if think?, do: "", else: "\n\n\n\n"
    body <> "\n<|im_start|>assistant\n" <> think_tag
  end

  defp format(%{role: role, content: content}) do
    "<|im_start|>" <> role <> "\n" <> content <> "<|im_end|>"
  end
end

messages = [
  %{role: "system", content: "You are a terse assistant. Answer in one sentence."},
  %{role: "user", content: "Why is Elixir well suited to building chat servers?"}
]

prompt = ChatPrompt.build(messages)

%{results: [%{text: reply}]} = Nx.Serving.run(serving, prompt)

reply

> Reasoning mode. SmolLM3 is a hybrid reasoning model — by > default it emits a block before the answer. > The helper above injects an empty \n\n after the > assistant role to short-circuit that prelude, which is what you > want for plain chat. Pass think: true to keep the > chain-of-thought output.

Concurrent serving via Emily.Stream

For concurrent inference on a shared model, each worker should own its own MLX command queue. Emily.Stream.with_stream/2 does that per-process:

task1 =
  Task.async(fn ->
    Emily.Stream.with_stream(Emily.Stream.new(:gpu), fn ->
      Nx.Serving.run(serving, prompt)
    end)
  end)

task2 =
  Task.async(fn ->
    Emily.Stream.with_stream(Emily.Stream.new(:gpu), fn ->
      Nx.Serving.run(serving, prompt)
    end)
  end)

{Task.await(task1, :infinity), Task.await(task2, :infinity)}

Weights are shared across streams — no duplication — so the memory cost of adding a stream is the Metal command buffer, not the model. Create streams once at worker init, not per-request.

Telemetry

Emily emits :telemetry events at the evaluation boundary. Attach a handler to sample timing for each forward pass:

:telemetry.attach(
  "smollm3-eval",
  [:emily, :eval, :stop],
  fn _event, %{duration: duration}, _meta, _config ->
    ms = System.convert_time_unit(duration, :native, :millisecond)
    IO.puts("eval #{ms} ms")
  end,
  nil
)

Nx.Serving.run(serving, prompt)

See Emily.Telemetry for the full event catalogue.