Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Chroma Example

livebooks/chroma-example.livemd

Chroma Example

Mix.install(
  [
    {:chroma, github: "3zcurdia/chroma"},
    {:req, "~> 0.3.11"},
    {:floki, "~> 0.34.3"},
    {:bumblebee, "~> 0.3.1"},
    {:exla, "~> 0.6.0"}
  ],
  config: [chroma: [host: "http://localhost:8000", api_base: "api", api_version: "v1"]],
  config: [nx: [default_backend: EXLA.Backend]]
)

Database

To verfiy your client has been configured you can run api_url to inspect the default values

Chroma.api_url()

You can easly fetch the current database version

Chroma.Database.version()

The heartbeat it will help you to verify the server is running correctly

Chroma.Database.heartbeat()

Collection

A collection it allow you to group your embeddings under a common namespace and you could do all CRUD operations

{:ok, collection} =
  Chroma.Collection.get_or_create("example", %{
    type: "test",
    source: "https://example.com"
  })
Chroma.Collection.modify(collection, metadata: %{b: 2})
Chroma.Collection.list()
Chroma.Collection.delete(collection)

Embeddings

Let’s prepare an example to use embeddings. We need a collection of documents to populate our collection.

In this example we will fetch most of elixir documentation as text documents.

pages = [
  "Atom",
  "Base",
  "Bitwise",
  "Date",
  "DateTime",
  "Exception",
  "Float",
  "Function",
  "Integer",
  "Module",
  "NaiveDateTime",
  "Record",
  "Regex",
  "String",
  "Time",
  "Tuple",
  "URI",
  "Access",
  "Date.Range",
  "Enum",
  "Keyword",
  "List",
  "Map",
  "MapSet",
  "Range",
  "Stream"
]

Now that we have a list of keys lets prepare some metadata

metadata =
  pages
  |> Enum.map(fn page ->
    %{module: page, source_url: "https://hexdocs.pm/elixir/1.15.4/#{page}.html"}
  end)

First we will fetch all pages and parse the content into an array of strings

documents =
  metadata
  |> Task.async_stream(fn %{source_url: url} ->
    res = Req.get!(url)

    res.body
    |> Floki.parse_document!()
    |> Floki.find("#content")
    |> Floki.text()
  end)
  |> Stream.map(fn {:ok, txt} ->
    txt
    |> String.replace_prefix("SettingsView Source", "")
    |> String.replace("(Elixir v1.15.4)", ": ", global: false)
  end)
  |> Enum.to_list()

To generate the embedding tensor for each document, we will setup bumblebee

{:ok, model_info} = Bumblebee.load_model({:hf, "intfloat/e5-large-v2"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "intfloat/e5-large-v2"})

serving =
  Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer,
    compile: [batch_size: 8, sequence_length: 100],
    defn_options: [compiler: EXLA]
  )

Now that we loaded and configured our model, we could generate their embeddings

embeddings =
  documents
  |> Enum.map(fn doc -> Nx.Serving.run(serving, doc) end)
  |> Enum.map(fn %{embedding: embedding} -> Nx.to_list(embedding) end)

Now that we have everything ready, lets add the embeddings to our collection

{:ok, collection} = Chroma.Collection.get_or_create("elixir-lang", %{"hnsw:space" => "cosine"})

Chroma.Collection.add(
  collection,
  %{
    documents: documents,
    embeddings: embeddings,
    metadata: metadata,
    ids: pages
  }
)

To ensure our collection has been populated you can run a count

Chroma.Collection.count(collection)

Now that our collection has been populated, lets query

%{embedding: tensor_embeddings} = Nx.Serving.run(serving, "what is an atom?")
query_embeddings = tensor_embeddings |> Nx.to_list()
{:ok, result} =
  Chroma.Collection.query(
    collection,
    results: 3,
    query_embeddings: [query_embeddings]
  )