Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Local Instructor w/ llama.cpp

pages/llm-providers/llama-cpp.livemd

Local Instructor w/ llama.cpp

Mix.install(
  [
    {:instructor, path: Path.expand("../../", __DIR__)},
    {:kino_shell, "~> 0.1.2"}
  ],
  config: [
    instructor: [
      adapter: Instructor.Adapters.Llamacpp,
      llamacpp: [
        chat_template: :mistral_instruct
      ]
    ]
  ]
)

Setting up llama.cpp

llama.cpp is a great way to run models locally. Head on over to the repo and install it on your system.

Next, we’ll need to download a GGUF compatible model to run with llama.cpp. As of today, I recommend using qwen-2.5-7b. It’s a great model that is small enough to run locally.

> A note on quantization: When you go search for GGUF models you’ll see a lot of suffixes like Q4_K_M and F8. These are just different compression techniques, called quantization, that allow the model to take up dramatically less memory at the cost of some accuracy. There are many different methods of quantization that have different performance tradeoffs. However, it’s generally recommended to run the largest model you can fit into your GPUs VRAM. Going over FP8 is generally unnecessary and at that point, you should be considering models with a larger number of parameters.

To start the llama server, run llama-server --port 8080 -ngl 999 -hf Qwen/Qwen2.5-7B-Instruct-GGUF. This will automatically download the model weights and start a server to run them. -ngl 999 is just a flag to tell llama how many layers of the neural network to offload to the GPU. 999, is effectively saying, run the entire model on the GPU.

Then with that running in the background. You can use Instructor as you normally would!

There are three ways to configure Instructor to use llama.cpp,

  1. via Mix.install([...], [instructor: [adapter: Instructor.Adapters.Llamacpp, llamacpp: [...]]])
  2. via config :instructor, adapter: Instructor.Adapters.Ollama, llamacpp: [...]
  3. At runtime via Instructor.chat_completion(..., config)

For brevity, in this livebook, we’ll configure it at runtime

config = [
  adapter: Instructor.Adapters.Llamacpp
]

defmodule President do
  use Ecto.Schema

  @primary_key false
  embedded_schema do
    field(:first_name, :string)
    field(:last_name, :string)
    field(:entered_office_date, :date)
  end
end

Instructor.chat_completion(
  [
    response_model: President,
    mode: :json_schema,
    messages: [
      %{role: "user", content: "Who was the first president of the United States?"}
    ]
  ],
  config
)
{:ok,
 %President{first_name: "George", last_name: "Washington", entered_office_date: ~D[1789-04-01]}}

And there you have it. You’re running Instructor against a locally running large language model. At zero incremental cost to you.