Zero-Shot Classification

livebooks/zero_shot.livemd

Elixir Image

@elixir-image

image_vision

Share to X

Share to Bluesky

More notebooks

Zero-Shot Classification

> #### Livebook Desktop users {: .info} > > Livebook Desktop launches as a GUI app and does not inherit your terminal’s PATH. If Mix.install below fails with “cargo: command not found” or similar, your Rust toolchain isn’t visible to Livebook Desktop. Fix by creating ~/.livebookdesktop.sh with at minimum: > > sh > export PATH="$HOME/.cargo/bin:$PATH" > # If you use mise/asdf, also activate them here. > > > Restart Livebook Desktop after editing the file. See the project README for details on which toolchains are needed.

Mix.install([
  {:image_vision, "~> 0.2"},
  {:kino, "~> 0.14"},
  # Zero-shot uses a Bumblebee CLIP model directly via the persistent_term
  # cache. No serving config needed — first call loads the model lazily.
  {:bumblebee, "~> 0.6"},
  {:nx, "~> 0.10"},
  {:exla, "~> 0.10"}
])

# Use EXLA as the Nx backend so CLIP runs on the optimised path.
Nx.global_default_backend(EXLA.Backend)

Image.ZeroShot classifies an image against arbitrary labels you supply at call time, no retraining needed. Where Image.Classification is locked to its 1000 ImageNet classes, here you decide the label space per call. Powered by OpenAI CLIP ViT-B/32 (MIT, ~600 MB; downloads on first call).

Upload an image and pick labels

form =
  Kino.Control.form(
    [
      image: Kino.Input.image("Image to classify"),
      labels:
        Kino.Input.text("Candidate labels (comma-separated)",
          default: "a dog, a cat, a horse, a car, a bicycle"
        )
    ],
    submit: "Classify"
  )

Run the classifier

form
|> Kino.Control.stream()
|> Kino.listen(fn %{data: %{image: img_input, labels: labels_input}} ->
  if img_input == nil or labels_input in [nil, ""] do
    Kino.Markdown.new("_Upload an image and supply at least one label, then submit._")
  else
    image = Image.from_kino!(img_input)

    labels =
      labels_input
      |> String.split(",")
      |> Enum.map(&amp;String.trim/1)
      |> Enum.reject(&amp;(&amp;1 == ""))

    rows =
      image
      |> Image.ZeroShot.classify(labels)
      |> Enum.map(fn %{label: label, score: score} ->
        %{"Label" => label, "Confidence" => "#{Float.round(score * 100, 1)}%"}
      end)

    Kino.Layout.grid(
      [
        Kino.Layout.grid([image, Kino.Markdown.new("**Original**")], boxed: true),
        Kino.DataTable.new(rows)
      ],
      columns: 2
    )
  end
end)

Image-to-image similarity

CLIP also embeds images in the same space as text, so you can ask “how similar are these two pictures?” — useful for “find similar images” without standing up a vector database.

similarity_form =
  Kino.Control.form(
    [
      image1: Kino.Input.image("First image"),
      image2: Kino.Input.image("Second image")
    ],
    submit: "Compare"
  )

similarity_form
|> Kino.Control.stream()
|> Kino.listen(fn %{data: %{image1: a, image2: b}} ->
  if a &amp;&amp; b do
    img1 = Image.from_kino!(a)
    img2 = Image.from_kino!(b)
    score = Image.ZeroShot.similarity(img1, img2)

    Kino.Layout.grid(
      [
        Kino.Layout.grid([img1, Kino.Markdown.new("**Image 1**")], boxed: true),
        Kino.Layout.grid([img2, Kino.Markdown.new("**Image 2**")], boxed: true),
        Kino.Markdown.new("**Similarity:** #{Float.round(score, 3)}")
      ],
      columns: 2
    )
  else
    Kino.Markdown.new("_Upload two images, then submit._")
  end
end)

Higher similarity (closer to 1.0) means more visually/semantically similar in CLIP’s learned embedding space. Two photos of different dogs typically score higher than a dog and a car, even when pixel-level differences are large.

Tips for better accuracy

CLIP was trained on natural-language captions, so it understands sentences much better than bare nouns. The default prompt template "a photo of {label}" already does this for you. If your labels are domain-specific (medical scans, technical diagrams, document scans), a domain-tailored template helps:

Image.ZeroShot.classify(image, ["normal", "fracture", "dislocation"],
  template: "an X-ray showing {label}")

If your labels are already full sentences, disable the template:

Image.ZeroShot.classify(image,
  ["a person riding a horse", "a person walking a dog"],
  template: nil)

Other notebooks:

Michal Slaski
@michalslaski

livebook_examples

Salary predictions

salary_prediction.livemd

advanced data-science exla axon nx

2022-8-18
Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

tutorial advanced data-science axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

tutorial advanced data-science req axon exla nx

2022-8-18
Yejun Su
@goofansu

ogp

ogp

ogp.livemd

tutorial intermediate ogp kino

2022-8-18
Ammar Massoud
@ammar-mohamed-massoud

Dockyard-Academy

Regex

regex.livemd

tutorial intermediate jason kino youtube hidden_cell

2026-5-19
Ammar Massoud
@ammar-mohamed-massoud

Dockyard-Academy

Blog: Authentication

blog_authentication.livemd

tutorial intermediate jason kino youtube hidden_cell

2026-5-23
Theo Dowling
@theodowling

advent-of-code-2023

Advent of Code - Day 3

day3.livemd

tutorial algorithms intermediate testing kino_aoc benchee kino_benchee

2023-12-11

Back