Explorer vs. Pandas
# If your livebook is inside a sub directory like `lib/` or `notebooks/`:
my_app_root = Path.join(__DIR__, "..")
Mix.install(
[
{:eds, path: my_app_root, env: :dev},
{:kino_vega_lite, "~> 0.1.7"}
],
config_path: Path.join(my_app_root, "config/config.exs"),
lockfile: Path.join(my_app_root, "mix.lock")
)
DataFrame
About
A DataFrame
is a 2-dimensional representation of data. It’s one of the building blocks of data science.
{:ok, df} = Explorer.DataFrame.from_csv("data/data.csv")
Inspecting the DataFrame
Explorer.DataFrame.dtypes(df)
# Columns
df.names
# Number of Rows
Explorer.DataFrame.n_rows(df)
# Getting mean values from a column
map = Explorer.DataFrame.to_series(df)
Explorer.Series.mean(map["Pulse"])
Viewing and Selecting Data
Selecting a single column looks exactly as it would in Pandas.
{:ok, df} = Explorer.DataFrame.from_csv("data/data.csv")
df["Pulse"]
# Or, use `pull/2` to extract the series
Explorer.DataFrame.pull(df, "Pulse")
# Select the given columns
Explorer.DataFrame.select(df, ["Pulse", "Calories"])
Explorer.DataFrame.table(df)
Use limit to simulate head
In Pandas, you make frequent use of .head()
to inspect a table. Explorer.DataFrame
also has a head/2
function, but if you’re wanting to inspect a table, you can use table/2
and set the :limit
option.
Explorer.DataFrame.table(df, limit: 7)
.loc and .iloc
These don’t exist in Explorer
the same way they exist in Python’s Pandas. In Pandas, a loc
lookup will fetch any value(s) with the given label (there might be more than one!), whereas iloc
refers solely to the integer index (i.e. the position) in the series.
What is confusing here is that a Pandas series actually maintains 2 distinct lists for each series (!!): in addition to the list of data, there is also a list of “labels”, which are referred to as “indexes” (e.g. by Panda’s “index” attribute). Tricky!
Explorer.Series.at/2
operates like Pandas iloc
in that deals with the sequential numerical index (just like its cousin Enum.at/2
).
s = Explorer.Series.from_list(["dog", "cat", "bird", "pig", "snake"])
Explorer.Series.at(s, 2)
To achieve .loc
functionality in Elixir, you’d have to explicitly create and maintain 2 distinct series, e.g. within a DataFrame
. Explicit is good, because who would have thunk a separate series of “labels” was hiding inside a regular Pandas series?
Here’s an example of creating a DataFrame
with 2 series, one is the obvious data (animals), and the other is the not-so-obvious “loc” labels.
require Explorer.DataFrame
df = Explorer.DataFrame.new(animals: ["dog", "cat", "bird", "pig", "snake"], loc: [0, 3, 5, 3, 9])
Explorer.DataFrame.filter(df, loc == 3)
require Explorer.DataFrame
animals_series = Explorer.Series.from_list(["dog", "cat", "bird", "pig", "snake"])
loc_series = Explorer.Series.from_list([0, 3, 5, 3, 9])
df = Explorer.DataFrame.new(animals: animals_series, loc: loc_series)
Explorer.DataFrame.filter(df, loc == 3)
Filtering
In Pandas, you might put some filter logic in between the square brackets, e.g.
# a python DataFrame "query"
df[ df["Pulse"] < 100 ]
In Explorer
, you perform your query more explicitly. Remember that you usually must require Explorer.DataFrame
before constructing your query.
It’s important to use the col/1
helper to wrap your column names because much of the time, the column name won’t be a valid Elixir variable. E.g. capitalized words? Nope. Just use col/1
.
require Explorer.DataFrame
{:ok, df} = Explorer.DataFrame.from_csv("data/data.csv")
Explorer.DataFrame.filter(df, col("Pulse") < 100)
Crosstab
Pandas features a helpful crosstab
function which computes a simple cross tabulation of two (or more) factors.
Hmmmm…..
Groupby
With Pandas DataFrames, you can peform an analysis on a field by grouping on another. For example, if you want to see the mean odometer value broken down by the make of car:
cars.groupby(["Make"]).mean()
With Explorer
, there is a group_by/2
function, but you have to explicitly add the aggregations you want. These are, after all, essentially “new” columns in a derivative dataframe.
require Explorer.DataFrame
{:ok, df} = Explorer.DataFrame.from_csv("data/car-sales.csv")
df
|> Explorer.DataFrame.group_by("Make")
|> Explorer.DataFrame.summarise(
mean_doors: mean(col("Doors")),
mean_odometer: mean(col("Odometer (KM)"))
)
Plots and Visualizations
How about visualizing your data? Pandas offers some simple ways to graph data, e.g. produce a simple bar chart:
car_sales["Odometer (KM)"].plot()
VegaLite.new(title: "Odometer")
|> VegaLite.data_from_values(df, only: ["Odometer (KM)"])
|> VegaLite.mark(:bar)
|> VegaLite.encode_field(:y, "Odometer (KM)", type: :quantitative)
Histogram
How about a histogram?
VegaLite.new()
|> VegaLite.data_from_values(df, only: ["Odometer (KM)"])
|> VegaLite.mark(:line)
|> VegaLite.encode(:x, aggregate: :count)
|> VegaLite.encode_field(:y, "Odometer (KM)", type: :quantitative)
Mutate
Modify a column
It’s common for CSVs to contain strings, when you might need to represent numbers as integers. E.g. financial data might be listed as $4,000
instead of 4000
.
Here, we essentially need to do a mapping operation on all the values in a column, i.e. transform each value in a series.
require Explorer.DataFrame
{:ok, df} = Explorer.DataFrame.from_csv("data/car-sales.csv")
# Explorer.DataFrame.mutate_with(df, &[c: Explorer.Series.add(&1["a"], &1["b"])])
Explorer.DataFrame.mutate_with(df, fn x ->
[
{:PriceMOD,
Enum.map(
x["Price"],
fn p ->
p |> String.trim_leading("$") |> String.replace(",", "") |> Float.parse() |> elem(0)
end
)}
]
end)
{:ok, df} = Explorer.DataFrame.from_csv("data/car-sales.csv")
Explorer.DataFrame.put(
df,
"Price",
Explorer.Series.transform(
df["Price"],
fn "$" <> n -> n |> String.replace(",", "") |> Float.parse() |> elem(0) end
)
)
Downcase
Here’s how you would convert a column to lowercase. Note the weird requirement to reference "Make"
(the column name) as an atom in one case and then wrapped using col/1
in another.
{:ok, df} = Explorer.DataFrame.from_csv("data/car-sales.csv")
Explorer.DataFrame.mutate(df, Make: downcase(col("Make")))
Missing Values
There are a few ways to deal with missing values. If you want to set them all to constants, you can provide a value to the fill_missing/2
function:
{:ok, df} = Explorer.DataFrame.from_csv("data/car-sales-missing-data.csv")
df = Explorer.DataFrame.put(df, "Odometer", Explorer.Series.fill_missing(df["Odometer"], 666))
Explorer.DataFrame.table(df, limit: :infinity)
Or, you can insert a calculated value, such as the mean:
{:ok, df} = Explorer.DataFrame.from_csv("data/car-sales-missing-data.csv")
df = Explorer.DataFrame.put(df, "Odometer", Explorer.Series.fill_missing(df["Odometer"], :mean))
Explorer.DataFrame.table(df, limit: :infinity)
{:ok, df} = Explorer.DataFrame.from_csv("data/car-sales-missing-data.csv")
df = Explorer.DataFrame.drop_nil(df)
Explorer.DataFrame.table(df, limit: :infinity)