Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Local Instructor w/ llama.cpp

pages/llm-providers/llama-cpp.livemd

Local Instructor w/ llama.cpp

Mix.install(
  [
    {:instructor, path: Path.expand("../../", __DIR__)},
    {:kino_shell, "~> 0.1.2"}
  ],
  config: [
    instructor: [
      adapter: Instructor.Adapters.Llamacpp,
      llamacpp: [
        chat_template: :mistral_instruct
      ]
    ]
  ]
)

Setting up llama.cpp

The open source community has been hard at work trying to dethrone OpenAI. It turns out today there are hundreds of models that you can download on HuggingFace and run locally on your machine if you have the right hardware. One of the main ways to run these models locally is through the great project llama.cpp. You’d be surprised what a standard Macbook/Linux machine can actually run.

Instructor is designed in a way where you can swap out the provider of the LLM. Internally, it’s just implemented with behavior. You can customize it by changing the configuration. In fact, look at the Mix.install of this live book to see how that’s done.

config :instructor, adapter: Instructor.Adapters.Llamacpp
config :instructor, llamacpp: [chat_template: :mistral_instruct]

As of today, instructor doesn’t actually run the LLM inside the BEAM. Instead, it calls out to a locally running web server that is part of the llama.cpp project. Luckily installation and configuration is easy.

Somewhere on your machine clone the Lama.CPP repo and just run make,

{_, 0} =
  System.cmd(
    "bash",
    [
      "-lc",
      """
      cd /Users/thomas/code/llama.cpp
      make
      """
    ],
    into: IO.stream()
  )

:ok
I llama.cpp build info: 
I UNAME_S:   Darwin
I UNAME_P:   arm
I UNAME_M:   arm64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion 
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi
I NVCCFLAGS:  
I LDFLAGS:   -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
I CC:        Apple clang version 15.0.0 (clang-1500.1.0.2.5)
I CXX:       Apple clang version 15.0.0 (clang-1500.1.0.2.5)

make: Nothing to be done for `default'.
:ok

Next, we need to actually download a model to run. One important thing to note is that llama.cpp only runs models in the GGUF file format. However, there is a great active open source community that is constantly porting the new models over to this format. Anytime you’re looking for a model to run, just Google the name of the model then GGUF, and you’ll usually get a result from some fellow named TheBloke.

On a fairly modest machine, you should be able to run a 7B model that is quantitized. For our example, we’re going to run mistral-7b-instruct-v0.2.Q5_K_S.

> A note on quantization: That Q5_K_S bit on the model represents the quantization of the model. Without getting into too much detail, roughly this represents how compressed the model is. The more compressed the model is, the lower the file size and the less RAM it takes to run. But there is slight loss in performance. If you’re running a MacBook, I would suggest running the Q5 or the Q6 version of the models.

Download the model somewhere on your hard drive and then we can set up the local Llama server to run against it.

frame = Kino.Frame.new() |> Kino.render()

command =
  "/Users/thomas/code/llama.cpp/server -np 4 -cb -v -m ~/Downloads/mistral-7b-instruct-v0.2.Q6_K.gguf"

child_spec =
  Task.child_spec(fn ->
    KinoShell.print_to_frame(frame, "[KinoShell]: Starting - #{command}
")

    status_code =
      KinoShell.exec("/bin/bash", ["-lc", command], fn data ->
        KinoShell.print_to_frame(frame, data)
      end)

    color =
      if status_code == 0 do
        :yellow
      else
        :red
      end

    KinoShell.print_to_frame(frame, [
      color,
      "[KinoShell]: Command shutdown with #{status_code}
"
    ])
  end)

Kino.start_child(%{child_spec | restart: :temporary})
Kino.nothing()
defmodule President do
  use Ecto.Schema

  @primary_key false
  embedded_schema do
    field(:first_name, :string)
    field(:last_name, :string)
    field(:entered_office_date, :date)
  end
end

Instructor.chat_completion(
  response_model: President,
  messages: [
    %{role: "user", content: "Who was the first president of the United States?"}
  ]
)
{:ok,
 %President{first_name: "George", last_name: "Washington", entered_office_date: ~D[1789-04-30]}}

And there you have it. You’re running Instructor against a locally running large language model. At zero incremental cost to you.