LLMs

notebooks/llms.livemd

Numerical Elixir (Nx)

@elixir-nx

bumblebee

Share to X

Share to Bluesky

More notebooks

LLMs

Mix.install([
  {:bumblebee, "~> 0.6.0"},
  {:nx, "~> 0.9.0"},
  {:exla, "~> 0.9.0"},
  {:kino, "~> 0.14.0"}
])

Nx.global_default_backend({EXLA.Backend, client: :host})

Introduction

In this notebook we outline the general setup for running a Large Language Model (LLM).

Llama 2

In this section we look at running Meta’s Llama model, specifically Llama 2, one of the most powerful open source Large Language Models (LLMs).

> Note: this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16GiB of VRAM.

In order to load Llama 2, you need to ask for access on meta-llama/Llama-2-7b-chat-hf. Once you are granted access, generate a HuggingFace auth token and put it in a HF_TOKEN Livebook secret.

Let’s load the model and create a serving for text generation:

hf_token = System.fetch_env!("LB_HF_TOKEN")
repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token}

{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

:ok

generation_config =
  Bumblebee.configure(generation_config,
    max_new_tokens: 256,
    strategy: %{type: :multinomial_sampling, top_p: 0.6}
  )

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 1028],
    stream: true,
    defn_options: [compiler: EXLA]
  )

# Should be supervised
Kino.start_child({Nx.Serving, name: Llama, serving: serving})

Note that we load the parameters directly onto the GPU with Bumblebee.load_model(..., backend: EXLA.Backend) and with defn_options: [compiler: EXLA] we tell the serving to compile and run computations on the GPU as well.

We adjust the generation config to use a non-deterministic generation strategy, so that the model is able to produce a slightly different output every time.

As for the other options, we specify :compile with fixed shapes, so that the model is compiled only once and inputs are always padded to match these shapes. We also enable :stream to receive text chunks as the generation is progressing.

user_input = Kino.Input.textarea("User prompt", default: "What is love?")

user = Kino.Input.read(user_input)

prompt = """
[INST] <>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<>
#{user} [/INST] \
"""

Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&amp;IO.write/1)

Mistral

We can easily test other LLMs, we just need to change the repository and possibly adjust the prompt template. In this example we run the Mistral model.

Just like Llama, Mistral now also requires users to request access to their models, so make sure you are granted access to the model, then generate a HuggingFace auth token and put it in a HF_TOKEN Livebook secret.

hf_token = System.fetch_env!('LB_HF_TOKEN')
repo = {:hf, "mistralai/Mistral-7B-Instruct-v0.2", auth_token: hf_token}

{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

:ok

generation_config =
  Bumblebee.configure(generation_config,
    max_new_tokens: 256,
    strategy: %{type: :multinomial_sampling, top_p: 0.6}
  )

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 512],
    stream: true,
    defn_options: [compiler: EXLA]
  )

# Should be supervised
Kino.start_child({Nx.Serving, name: Mistral, serving: serving})

prompt = """
[INST] What is your favourite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!
[INST] Do you have mayonnaise recipes? [/INST]\
"""

Nx.Serving.batched_run(Mistral, prompt) |> Enum.each(&amp;IO.write/1)

Other notebooks:

Michal Slaski
@michalslaski

livebook_examples

Salary predictions

salary_prediction.livemd

advanced data-science exla axon nx

2022-8-18
Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

tutorial advanced data-science axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

tutorial advanced data-science req axon exla nx

2022-8-18
Yejun Su
@goofansu

ogp

ogp

ogp.livemd

tutorial intermediate ogp kino

2022-8-18
@DeSchoel

Elixir_Curriculum

Benchmarks

deprecated_benchmarks.livemd

tutorial advanced algorithms intermediate data-structures jason kino youtube hidden_cell benchee

2026-1-10
thanos vassilakis
@thanos

ExZarr

Earthmover CNG 2025: Datacubes and Chunk-Based Com...

earthmover_datacube.livemd

tutorial advanced data-science ex_zarr nx kino kino_vega_lite

2026-2-1
Matt Willy
@TheEndIsNear

elixir-ml

Don't Repeat Yourself

dont-repeat-yourself.livemd

tutorial advanced data-science axon_onnx axon nx exla stb_image kino

2024-3-26

Back