Local Instructor w/ llama.cpp
Mix.install(
[
{:instructor, path: Path.expand("../../", __DIR__)},
{:kino_shell, "~> 0.1.2"}
],
config: [
instructor: [
adapter: Instructor.Adapters.Llamacpp,
llamacpp: [
chat_template: :mistral_instruct
]
]
]
)
Setting up llama.cpp
llama.cpp is a great way to run models locally. Head on over to the repo and install it on your system.
Next, we’ll need to download a GGUF compatible model to run with llama.cpp. As of today, I recommend using qwen-2.5-7b. It’s a great model that is small enough to run locally.
> A note on quantization: When you go search for GGUF models you’ll see a lot of suffixes like Q4_K_M
and F8
. These are just different compression techniques, called quantization, that allow the model to take up dramatically less memory at the cost of some accuracy. There are many different methods of quantization that have different performance tradeoffs. However, it’s generally recommended to run the largest model you can fit into your GPUs VRAM. Going over FP8 is generally unnecessary and at that point, you should be considering models with a larger number of parameters.
To start the llama server, run llama-server --port 8080 -ngl 999 -hf Qwen/Qwen2.5-7B-Instruct-GGUF
. This will automatically download the model weights and start a server to run them. -ngl 999
is just a flag to tell llama how many layers of the neural network to offload to the GPU. 999, is effectively saying, run the entire model on the GPU.
Then with that running in the background. You can use Instructor as you normally would!
There are three ways to configure Instructor to use llama.cpp,
-
via
Mix.install([...], [instructor: [adapter: Instructor.Adapters.Llamacpp, llamacpp: [...]]])
-
via
config :instructor, adapter: Instructor.Adapters.Ollama, llamacpp: [...]
-
At runtime via
Instructor.chat_completion(..., config)
For brevity, in this livebook, we’ll configure it at runtime
config = [
adapter: Instructor.Adapters.Llamacpp
]
defmodule President do
use Ecto.Schema
@primary_key false
embedded_schema do
field(:first_name, :string)
field(:last_name, :string)
field(:entered_office_date, :date)
end
end
Instructor.chat_completion(
[
response_model: President,
mode: :json_schema,
messages: [
%{role: "user", content: "Who was the first president of the United States?"}
]
],
config
)
{:ok,
%President{first_name: "George", last_name: "Washington", entered_office_date: ~D[1789-04-01]}}
And there you have it. You’re running Instructor against a locally running large language model. At zero incremental cost to you.