Image Captioning

livebooks/captioning.livemd

Elixir Image

@elixir-image

image_vision

Share to X

Share to Bluesky

More notebooks

Image Captioning

Section

> #### Livebook Desktop users > > Livebook Desktop launches as a GUI app and does not inherit your terminal’s PATH. If Mix.install below fails with “cargo: command not found” or similar, your Rust toolchain isn’t visible to Livebook Desktop. Fix by creating ~/.livebookdesktop.sh with at minimum: > > sh > export PATH="$HOME/.cargo/bin:$PATH" > # If you use mise/asdf, also activate them here. > > > Restart Livebook Desktop after editing the file. See the project README for details on which toolchains are needed.

Mix.install(
  [
    {:image_vision, "~> 0.2"},
    {:kino, "~> 0.14"},
    # Captioning uses a Bumblebee serving (BLIP).
    {:bumblebee, "~> 0.6"},
    {:nx, "~> 0.10"},
    {:exla, "~> 0.10"}
  ],
  config: [
    # Tell ImageVision.Application to start the captioning serving
    # under its own supervisor. BLIP weights (~990 MB) download from
    # HuggingFace on first run — by far the heaviest of the library's
    # default models. Allow several minutes for the first cell below.
    image_vision: [captioner: [autostart: true]]
  ]
)

# Use EXLA as the Nx backend for any tensor work in this notebook.
Nx.global_default_backend(EXLA.Backend)

Image.Captioning describes an image in plain English. Useful for accessibility (alt text), image search, content indexing, and tooling that needs a quick natural-language summary. Default model is BLIP base (BSD-3-Clause, ~990 MB).

Upload an image

input = Kino.Input.image("Image to describe")

image =
  input
  |> Kino.Input.read()
  |> Image.from_kino!()

Generate a caption

caption = Image.Captioning.caption(image)

Kino.Layout.grid(
  [
    Kino.Layout.grid([image, Kino.Markdown.new("**Original**")], boxed: true),
    Kino.Markdown.new("**Caption**\n\n> #{caption}")
  ],
  columns: 2
)

Why the first call is slow

A captioner is much heavier than a classifier. The encoder runs once per image, but the text decoder runs autoregressively — one forward pass per generated token. A typical 20-word caption is ~20 forward passes. The serving warm-up loads ~990 MB of weights and JIT-compiles the decoder, which takes 10–30 seconds on first call. Subsequent calls reuse the loaded model and are much faster.

Other notebooks:

Michal Slaski
@michalslaski

livebook_examples

Salary predictions

salary_prediction.livemd

advanced data-science exla axon nx

2022-8-18
Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

tutorial advanced data-science axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

tutorial advanced data-science req axon exla nx

2022-8-18
Yejun Su
@goofansu

ogp

ogp

ogp.livemd

tutorial intermediate ogp kino

2022-8-18
@DeSchoel

Elixir_Curriculum

Comparison Operators

comparison_operators.livemd

tutorial beginner jason kino youtube hidden_cell

2026-1-10
Nicolò G.
@nickgnd

programming-machine-learn...

Chapter 16: A Deeper Kind of Network

16_deeper.livemd

tutorial advanced data-science exla nx axon kino kino_vega_lite vega_lite table_rex

2023-3-14
Cocoa
@cocoa-xu

evision

U2Net Human Segmentation

dnn-u2net_human_seg.livemd

advanced data-science evision req kino

2023-5-8

Back