Chroma Example
Mix.install(
[
{:chroma, github: "3zcurdia/chroma"},
{:req, "~> 0.3.11"},
{:floki, "~> 0.34.3"},
{:bumblebee, "~> 0.3.1"},
{:exla, "~> 0.6.0"}
],
config: [chroma: [host: "http://localhost:8000", api_base: "api", api_version: "v1"]],
config: [nx: [default_backend: EXLA.Backend]]
)
Database
To verfiy your client has been configured you can run api_url
to inspect the default values
Chroma.api_url()
You can easly fetch the current database version
Chroma.Database.version()
The heartbeat it will help you to verify the server is running correctly
Chroma.Database.heartbeat()
Collection
A collection it allow you to group your embeddings under a common namespace and you could do all CRUD operations
{:ok, collection} =
Chroma.Collection.get_or_create("example", %{
type: "test",
source: "https://example.com"
})
Chroma.Collection.modify(collection, metadata: %{b: 2})
Chroma.Collection.list()
Chroma.Collection.delete(collection)
Embeddings
Let’s prepare an example to use embeddings. We need a collection of documents to populate our collection.
In this example we will fetch most of elixir documentation as text documents.
pages = [
"Atom",
"Base",
"Bitwise",
"Date",
"DateTime",
"Exception",
"Float",
"Function",
"Integer",
"Module",
"NaiveDateTime",
"Record",
"Regex",
"String",
"Time",
"Tuple",
"URI",
"Access",
"Date.Range",
"Enum",
"Keyword",
"List",
"Map",
"MapSet",
"Range",
"Stream"
]
Now that we have a list of keys lets prepare some metadata
metadata =
pages
|> Enum.map(fn page ->
%{module: page, source_url: "https://hexdocs.pm/elixir/1.15.4/#{page}.html"}
end)
First we will fetch all pages and parse the content into an array of strings
documents =
metadata
|> Task.async_stream(fn %{source_url: url} ->
res = Req.get!(url)
res.body
|> Floki.parse_document!()
|> Floki.find("#content")
|> Floki.text()
end)
|> Stream.map(fn {:ok, txt} ->
txt
|> String.replace_prefix("SettingsView Source", "")
|> String.replace("(Elixir v1.15.4)", ": ", global: false)
end)
|> Enum.to_list()
To generate the embedding tensor for each document, we will setup bumblebee
{:ok, model_info} = Bumblebee.load_model({:hf, "intfloat/e5-large-v2"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "intfloat/e5-large-v2"})
serving =
Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer,
compile: [batch_size: 8, sequence_length: 100],
defn_options: [compiler: EXLA]
)
Now that we loaded and configured our model, we could generate their embeddings
embeddings =
documents
|> Enum.map(fn doc -> Nx.Serving.run(serving, doc) end)
|> Enum.map(fn %{embedding: embedding} -> Nx.to_list(embedding) end)
Now that we have everything ready, lets add the embeddings to our collection
{:ok, collection} = Chroma.Collection.get_or_create("elixir-lang", %{"hnsw:space" => "cosine"})
Chroma.Collection.add(
collection,
%{
documents: documents,
embeddings: embeddings,
metadata: metadata,
ids: pages
}
)
To ensure our collection has been populated you can run a count
Chroma.Collection.count(collection)
Now that our collection has been populated, lets query
%{embedding: tensor_embeddings} = Nx.Serving.run(serving, "what is an atom?")
query_embeddings = tensor_embeddings |> Nx.to_list()
{:ok, result} =
Chroma.Collection.query(
collection,
results: 3,
query_embeddings: [query_embeddings]
)