Local Instructor w/ vLLM

pages/llm-providers/vllm.livemd

Thomas Millar

@thmsmlr

instructor_ex

Share to X

Share to Bluesky

More notebooks

Local Instructor w/ vLLM

Mix.install(
  [
    {:instructor, path: Path.expand("../../", __DIR__)},
  ]
)

Introduction

When it comes to local inference, there are three main tools that people use: olama, llama.cpp, and in professional settings, vllm. There are obviously more, but these are the main ones that people use. vLLM is a high performance inference server that can handle many concurrent requests in parallel. It uses this feature called grouped attention to make sure it can utilize the GPU’s VRAM very efficiently. On an RTX 4090, you can expect the LLM to push ~3000 tokens/sec on a 7B parameter model.

vLLM is a great option if you want to host a local LLM inference server on an old gaming machine. Or, in a corporate environment, you can host this on a GPU-optimized EC2 instance for completely private inference.

To install vLLM, head on over to the docks and run through their quick start guide.

Once installed, you can start up an OpenAI compliant inference server by running the following command,

$ vllm serve Qwen/Qwen2.5-1.5B-Instruct

This will download the model and start up your inference server. Instructor will plug in seamlessly when using the vLLM adapter.

There are three ways to configure Instructor to use vLLM,

via Mix.install([...], [instructor: [adapter: Instructor.Adapters.VLLM, vllm: [...]]])
via config :instructor, adapter: Instructor.Adapters.VLLM, vllm: [...]
At runtime via Instructor.chat_completion(..., config)

For brevity, in this livebook, we’ll configure it at runtime

config = [
  adapter: Instructor.Adapters.VLLM
]

defmodule President do
  use Ecto.Schema
  use Instructor

  @primary_key false
  embedded_schema do
    field(:first_name, :string)
    field(:last_name, :string)
    field(:entered_office_date, :date)
  end
end

Instructor.chat_completion(
  [
    model: "Qwen/Qwen2.5-1.5B-Instruct",
    mode: :json_schema,
    response_model: President,
    messages: [
      %{role: "user", content: "Who was the first president of the United States?"}
    ]
  ],
  config
)

Other notebooks:

aetherus
@Aetherus

advent-of-code

Advent of Code 2023 Day 18

day-18.livemd

tutorial advanced kino_aoc nx exla

2023-12-25
@alde103

Build_Large_Language_Mode...

Chapter 3: Coding Attention Mechanisms

ch3.livemd

tutorial advanced data-science nx exla axon tiktoken table_rex

2024-11-26
Masatoshi Nishiguchi
@mnishiguchi

livebooks

Nx.Tensorの真ん中を求める

nx_mean.livemd

advanced data-science nx exla evision kino

2024-2-28
Phil-Bastian Berndt
@pehbehbeh

adventofcode

Advent of Code 2023 - Day 07

07.livemd

tutorial data-science intermediate kino kino_aoc explorer kino_explorer

2023-12-11
Masatoshi Nishiguchi
@mnishiguchi

livebooks

OpenCV bounding box

opencv_bounding_box.livemd

tutorial advanced nx evision kino

2024-2-28
Livebook
@livebook-dev

livebook

Counting GitHub Stars

github_stars.livemd

tutorial intermediate kino req kino_vega_lite

2025-5-2
Membrane Framework
@membraneframework

membrane_demo

RTMP Sender

rtmp_sender.livemd

tutorial advanced membrane_core membrane_realtimer_plugin membrane_hackney_plugin membrane_rtmp_plugin

2024-1-30

Back