Converting Text into Explorer DataFrames

pages/cookbook/text-to-dataframes.livemd

Thomas Millar

@thmsmlr

instructor_ex

Share to X

Share to Bluesky

More notebooks

Converting Text into Explorer DataFrames

Mix.install(
  [
    {:instructor, path: Path.expand("../../", __DIR__)},
    {:explorer, "~> 0.7.2"},
    {:kino, "~> 0.12.0"},
    {:kino_explorer, "~> 0.1.13"}
  ],
  config: [
    instructor: [
      adapter: Instructor.Adapters.OpenAI,
      openai: [api_key: System.fetch_env!("LB_OPENAI_API_KEY")]
    ]
  ]
)

Motivation

Sometimes we come across text that can be naturally extracted into structured data. Although we don’t necessarily know how many tables, and what the schemas are represented in the data. Using Instructor we can let the LLM determine the tables and schemas for us, pulling the results into a dataframe which we can then use for further analysis.

The Schema

Notice from this example we want resulting data structure to have data frames. That’s not a default Ecto type so we’re gonna have to do things a little different here.

Luckily, Ecto allows us to define our own custom types in which we can override how the values are casted. And Instructor provides us a behavior which tells us how to represent it in a JSONSchema so the LLM can understand it.

In our case, we’re going to create a type for a table of data, which will be represented as a CSV string with respect to the LLM. Our custom type will then parse the string when it is casted by Ecto and parse the CSV into an Explorer.DataFrame.

defmodule Ecto.CSVDataFrame do
  use Ecto.Type
  use Instructor.EctoType

  def type, do: :string

  def cast(csv_str) when is_binary(csv_str) do
    df = Explorer.DataFrame.load_csv!(csv_str)
    {:ok, df}
  end

  def cast(%Explorer.DataFrame{} = df), do: {:ok, df}
  def cast(_), do: :error

  def to_json_schema(),
    do: %{
      type: "string",
      description: "A CSV representation of a data table"
    }

  def dump(x), do: {:ok, x}
  def load(x), do: {:ok, x}
end

{:module, Ecto.CSVDataFrame, <<70, 79, 82, 49, 0, 0, 17, ...>>, {:load, 1}}

Now that we have our data type, we can use it in our embedded ecto schema, just as we would have in any other example. If everything works correctly, we should get some data frames out.

defmodule Database do
  use Ecto.Schema
  use Instructor

  @llm_doc """
  The extracted database will contain one or more tables with data as csv formatted with ',' delimiters
  """
  @primary_key false
  embedded_schema do
    embeds_many :tables, DataFrame, primary_key: false do
      field(:name, :string)
      field(:data, Ecto.CSVDataFrame)
    end
  end
end

{:module, Database, <<70, 79, 82, 49, 0, 0, 43, ...>>, :ok}

Perfect. And now let’s test it out on some example text.

extract_df = fn text ->
  Instructor.chat_completion(
    model: "gpt-4o-mini",
    response_model: Database,
    messages: [
      %{
        role: "system",
        content: """
        Map this data into one or more dataframes and correctly define the columns and data
        """
      },
      %{
        role: "user",
        content: "#{text}"
      }
    ]
  )
end

{:ok, db} =
  extract_df.("""
    My name is John and I am 25 years old. I live in
    New York and I like to play basketball. His name is
    Mike and he is 30 years old. He lives in San Francisco
    and he likes to play baseball. Sarah is 20 years old
    and she lives in Los Angeles. She likes to play tennis.
    Her name is Mary and she is 35 years old.
    She lives in Chicago.
    
    On one team 'Tigers' the captain is John and there are 12 players.
    On the other team 'Lions' the captain is Mike and there are 10 players.
  """)

Kino.Layout.tabs(
  for table <- db.tables do
    {table.name, table.data}
  end
)

Other notebooks:

Yejun Su
@goofansu

ogp

ogp

ogp.livemd

tutorial intermediate ogp kino

2022-8-18
@TomBers

livebookNotes

Attractors

attractors.livemd

advanced data-science decimal vega_lite kino

2022-8-18
Kevin Pan
@feng19

spider_man

ElixirJobs

elixirjobs.livemd

tutorial advanced spider_man floki nimble_csv kino

2022-8-18
@TomBers

livebookNotes

Fun with Graphs

graphs.livemd

tutorial advanced intermediate vega_lite kino math

2022-8-18
@DockYard-Academy

curriculum

Drill: Replace `nil`s In A List With Elements From...

deprecated_drill-enum2-replace-nils.livemd

tutorial beginner jason kino youtube hidden_cell

2023-3-21
Andrey Marchenko
@anmarchenko

notebooks

Exploring Explorer

explorer.livemd

tutorial advanced data-science req explorer kino_explorer

2024-7-2
Samuel Willis
@SamuelWillis

website

Website Metrics

metrics.livemd

tutorial data-science intermediate kino kino_vega_lite kino_explorer

2024-11-4

Back