Powered by AppSignal & Oban Pro

Image Captioning

livebooks/captioning.livemd

Image Captioning

> #### Livebook Desktop users {: .info} > > Livebook Desktop launches as a GUI app and does not inherit your terminal’s PATH. If Mix.install below fails with “cargo: command not found” or similar, your Rust toolchain isn’t visible to Livebook Desktop. Fix by creating ~/.livebookdesktop.sh with at minimum: > > sh > export PATH="$HOME/.cargo/bin:$PATH" > # If you use mise/asdf, also activate them here. > > > Restart Livebook Desktop after editing the file. See the project README for details on which toolchains are needed.

Mix.install(
  [
    {:image_vision, "~> 0.2"},
    {:kino, "~> 0.14"},
    # Captioning uses a Bumblebee serving (BLIP).
    {:bumblebee, "~> 0.6"},
    {:nx, "~> 0.10"},
    {:exla, "~> 0.10"}
  ],
  config: [
    # Tell ImageVision.Application to start the captioning serving
    # under its own supervisor. BLIP weights (~990 MB) download from
    # HuggingFace on first run — by far the heaviest of the library's
    # default models. Allow several minutes for the first cell below.
    image_vision: [captioner: [autostart: true]]
  ]
)
# Use EXLA as the Nx backend for any tensor work in this notebook.
Nx.global_default_backend(EXLA.Backend)

Image.Captioning describes an image in plain English. Useful for accessibility (alt text), image search, content indexing, and tooling that needs a quick natural-language summary. Default model is BLIP base (BSD-3-Clause, ~990 MB).

Upload an image

input = Kino.Input.image("Image to describe")
image =
  input
  |> Kino.Input.read()
  |> Image.from_kino!()

Generate a caption

caption = Image.Captioning.caption(image)

Kino.Layout.grid(
  [
    Kino.Layout.grid([image, Kino.Markdown.new("**Original**")], boxed: true),
    Kino.Markdown.new("**Caption**\n\n> #{caption}")
  ],
  columns: 2
)

Caption a few

Try several images and see how the model describes them:

inputs = Kino.Input.image("Add another image to caption", multiple: true)
case Kino.Input.read(inputs) do
  [] ->
    Kino.Markdown.new("_Upload one or more images above to see captions._")

  files ->
    rows =
      for file <- files do
        img = Image.from_kino!(file)
        text = Image.Captioning.caption(img)
        %{"Image" => img, "Caption" => text}
      end

    Kino.DataTable.new(rows)
end

Why the first call is slow

A captioner is much heavier than a classifier. The encoder runs once per image, but the text decoder runs autoregressively — one forward pass per generated token. A typical 20-word caption is ~20 forward passes. The serving warm-up loads ~990 MB of weights and JIT-compiles the decoder, which takes 10–30 seconds on first call. Subsequent calls reuse the loaded model and are much faster.