Local Instructor w/ llama.cpp

pages/llm-providers/llama-cpp.livemd

Thomas Millar

@thmsmlr

instructor_ex

Share to X

Share to Bluesky

More notebooks

Local Instructor w/ llama.cpp

Mix.install(
  [
    {:instructor, path: Path.expand("../../", __DIR__)},
    {:kino_shell, "~> 0.1.2"}
  ],
  config: [
    instructor: [
      adapter: Instructor.Adapters.Llamacpp,
      llamacpp: [
        chat_template: :mistral_instruct
      ]
    ]
  ]
)

Setting up llama.cpp

llama.cpp is a great way to run models locally. Head on over to the repo and install it on your system.

Next, we’ll need to download a GGUF compatible model to run with llama.cpp. As of today, I recommend using qwen-2.5-7b. It’s a great model that is small enough to run locally.

> A note on quantization: When you go search for GGUF models you’ll see a lot of suffixes like Q4_K_M and F8. These are just different compression techniques, called quantization, that allow the model to take up dramatically less memory at the cost of some accuracy. There are many different methods of quantization that have different performance tradeoffs. However, it’s generally recommended to run the largest model you can fit into your GPUs VRAM. Going over FP8 is generally unnecessary and at that point, you should be considering models with a larger number of parameters.

To start the llama server, run llama-server --port 8080 -ngl 999 -hf Qwen/Qwen2.5-7B-Instruct-GGUF. This will automatically download the model weights and start a server to run them. -ngl 999 is just a flag to tell llama how many layers of the neural network to offload to the GPU. 999, is effectively saying, run the entire model on the GPU.

Then with that running in the background. You can use Instructor as you normally would!

There are three ways to configure Instructor to use llama.cpp,

via Mix.install([...], [instructor: [adapter: Instructor.Adapters.Llamacpp, llamacpp: [...]]])
via config :instructor, adapter: Instructor.Adapters.Ollama, llamacpp: [...]
At runtime via Instructor.chat_completion(..., config)

For brevity, in this livebook, we’ll configure it at runtime

config = [
  adapter: Instructor.Adapters.Llamacpp
]

defmodule President do
  use Ecto.Schema

  @primary_key false
  embedded_schema do
    field(:first_name, :string)
    field(:last_name, :string)
    field(:entered_office_date, :date)
  end
end

Instructor.chat_completion(
  [
    response_model: President,
    mode: :json_schema,
    messages: [
      %{role: "user", content: "Who was the first president of the United States?"}
    ]
  ],
  config
)

{:ok,
 %President{first_name: "George", last_name: "Washington", entered_office_date: ~D[1789-04-01]}}

And there you have it. You’re running Instructor against a locally running large language model. At zero incremental cost to you.

Other notebooks:

Cocoa
@cocoa-xu

tflite_elixir

Human pose estimation on image sequence

pose_estimation_image_sequence.livemd

tflite_elixir evision req kino

2023-7-17
Carlo Gilmar
@carlogilmar

ml_study_group

Machine Learning Chapter 1

chapter1.livemd

axon nx explorer kino

2025-4-15
@aos

ml-nx-axon

Machine Learning in Elixir (Ch. 1-5)

ml.livemd

axon nx explorer kino scholar exla benchee scidata stb_image vega_lite kino_vega_lite

2024-10-9
Lars Wikman
@lawik

varberg_livebook

Cellular Modem Demo (QMI)

cellular_demo_qmi.livemd

circuits_gpio toolshed vintage_net_qmi

2023-1-21
mrdotb
@mrdotb

resvg_nif

Resvg example

example.livemd

resvg req kino

2023-7-13
Todd Pickell
@tapickell

nx_ml_learning

Chapt 6 notebook

chapt6.livemd

benchee explorer stb_image axon bumblebee exla kino kino_vega_lite nx scholar scidata table_rex tucan vega_lite

2024-8-29
Ryo Wakabayashi
@RyoWakabayashi

elixir-learning

KinoMembrane

kino_membrane.livemd

kino kino_membrane membrane_hackney_plugin membrane_ffmpeg_swresample_plugin membrane_mp3_mad_plugin membrane_aac_fdk_plugin membrane_audio_mix_plugin membrane_tee_plugin membrane_kino_plugin

2024-11-14

Back