Converting Text into Explorer DataFrames
Mix.install(
[
{:instructor_lite, path: Path.expand("../../", __DIR__)},
{:req, "~> 0.5"},
{:explorer, "~> 0.7.2"},
{:kino, "~> 0.12.0"},
{:kino_explorer, "~> 0.1.13"}
]
)
Motivation
Sometimes we come across text that can be naturally extracted into structured data. Although we don’t necessarily know how many tables, and what the schemas are represented in the data. Using Instructor we can let the LLM determine the tables and schemas for us, pulling the results into a dataframe which we can then use for further analysis.
Setup
In order to run code in this notebook, you need to add your OpenAI API key as an OPENAI_KEY
Livebook secret. It will then be accessible through an environmental variable.
secret_key = System.fetch_env!("LB_OPENAI_KEY")
:ok
:ok
The Schema
Notice from this example we want resulting data structure to have data frames. That’s not a default Ecto type so we’re gonna have to do things a little different here.
Luckily, Ecto allows us to define our own custom types in which we can override how the values are casted.
In our case, we’re going to create a type for a table of data, which will be represented as a CSV string with respect to the LLM. Our custom type will then parse the string when it is casted by Ecto and parse the CSV into an Explorer.DataFrame
.
defmodule Ecto.CSVDataFrame do
use Ecto.Type
def type, do: :string
def cast(csv_str) when is_binary(csv_str) do
df = Explorer.DataFrame.load_csv!(csv_str)
{:ok, df}
end
def cast(%Explorer.DataFrame{} = df), do: {:ok, df}
def cast(_), do: :error
def dump(x), do: {:ok, x}
def load(x), do: {:ok, x}
end
{:module, Ecto.CSVDataFrame, <<70, 79, 82, 49, 0, 0, 11, ...>>, {:load, 1}}
Now that we have our data type, we can use it in our embedded Ecto schema, just as we would have in any other example. The built-in InstructorLite.JSONSchema
module won’t help us here, as it doesn’t know how to convert a custom Ecto type, so we’ll craft our own schema in the json_schema
callback. If everything works correctly, we should get some data frames out.
defmodule Database do
use Ecto.Schema
use InstructorLite.Instruction
@notes """
The extracted database will contain one or more tables with data as csv formatted with ',' delimiters
"""
@primary_key false
embedded_schema do
embeds_many :tables, DataFrame, primary_key: false do
field(:name, :string)
field(:data, Ecto.CSVDataFrame)
end
end
@impl InstructorLite.Instruction
def json_schema() do
%{
type: "object",
description: "",
title: "Database",
required: [:tables],
"$defs": %{
"DataFrame" => %{
type: "object",
description: "",
title: "DataFrame",
required: [:data, :name],
properties: %{data: %{type: "string"}, name: %{type: "string"}},
additionalProperties: false
}
},
properties: %{tables: %{type: "array", items: %{"$ref": "#/$defs/DataFrame"}}},
additionalProperties: false
}
end
end
{:module, Database, <<70, 79, 82, 49, 0, 0, 18, ...>>, {:json_schema, 0}}
Perfect. And now let’s test it out on some example text.
extract_df = fn text ->
InstructorLite.instruct(%{
model: "gpt-4o-mini",
messages: [
%{
role: "system",
content: """
Map this data into one or more dataframes and correctly define the columns and data
"""
},
%{
role: "user",
content: "#{text}"
}
]
},
response_model: Database,
adapter_context: [api_key: secret_key]
)
end
{:ok, db} =
extract_df.("""
My name is John and I am 25 years old. I live in
New York and I like to play basketball. His name is
Mike and he is 30 years old. He lives in San Francisco
and he likes to play baseball. Sarah is 20 years old
and she lives in Los Angeles. She likes to play tennis.
Her name is Mary and she is 35 years old.
She lives in Chicago.
On one team 'Tigers' the captain is John and there are 12 players.
On the other team 'Lions' the captain is Mike and there are 10 players.
""")
Kino.Layout.tabs(
for table <- db.tables do
{table.name, table.data}
end
)