Ten Minutes to Explorer
Mix.install([
{:explorer, "~> 0.7.0"},
{:kino, "~> 0.10.0"}
])
Introduction
Explorer is a dataframe library for Elixir. A dataframe is a common data structure used in data analysis. It is a two-dimensional table composed of columns and rows similar to a SQL table or a spreadsheet.
Explorer’s aim is to provide a simple and powerful API for manipulating dataframes. It takes influences mainly
from the tidyverse
, but if you’ve used other dataframe libraries like pandas
you shouldn’t have too much trouble working with Explorer.
This document is meant to give you a crash course in using Explorer. More in-depth documentation can be found in the relevant sections of the docs.
We strongly recommend you run this livebook locally so you can see the outputs and play with the inputs!
Reading and writing data
Data can be read from delimited files (like CSV), NDJSON, Parquet, and the Arrow IPC (feather) format. You can also load in data from a map or keyword list of columns with Explorer.DataFrame.new/1
.
For CSV, your ‘usual suspects’ of options are available:
-
delimiter
- A single character used to separate fields within a record. (default:","
) -
dtypes
- A keyword list of[column_name: dtype]
. If a type is not specified for a column, it is imputed from the first 1000 rows. (default:[]
) -
header
- Does the file have a header of column names as the first row or not? (default:true
) -
max_rows
- Maximum number of lines to read. (default:nil
) -
nil_values
- A list of strings that should be interpreted as a nil values. (default:[]
) -
skip_rows
- The number of lines to skip at the beginning of the file. (default:0
) -
columns
- A list of column names to keep. If present, only these columns are read into the dataframe. (default:nil
)
Explorer
also has multiple example datasets built in, which you can load from the Explorer.Datasets
module like so:
df = Explorer.Datasets.fossil_fuels()
You’ll notice that the output looks slightly different than many dataframe libraries. Explorer
takes inspiration on this front from glimpse
in R. A benefit to this approach is that you will rarely need to elide columns.
If you’d like to see a table with your data, take a look at Kino Explorer, that provides a rich table with filtering and sorting.
Writing files is very similar to reading them. The options are a little more limited:
-
header
- Should the column names be written as the first line of the file? (default:true
) -
delimiter
- A single character used to separate fields within a record. (default:","
)
First, let’s add some useful aliases:
alias Explorer.DataFrame
alias Explorer.Series
And then write to a file of your choosing:
input = Kino.Input.text("Filename")
filename = Kino.Input.read(input)
DataFrame.to_csv(df, filename)
Working with Series
Explorer
, like Polars
, works up from the concept of a Series
. In many ways, you can think of a dataframe as a row-aligned map of Series
. These are like vectors
in R or series
in Pandas.
Explorer
supports the following Series
dtypes
:
-
:binary
- Binaries (sequences of bytes) -
:boolean
- Boolean -
:category
- Strings but represented internally as integers -
:date
- Date type that unwraps toElixir.Date
-
{:datetime, precision}
- DateTime type with millisecond/microsecond/nanosecond precision that unwraps toElixir.NaiveDateTime
-
{:duration, precision}
- Duration type with millisecond/microsecond/nanosecond precision that unwraps toExplorer.Duration
-
:float
- 64-bit floating point number -
:integer
- 64-bit signed integer -
:string
- UTF-8 encoded binary -
:time
- Time type that unwraps toElixir.Time
Series
can be constructed from Elixir basic types. For example:
s1 = Series.from_list([1, 2, 3])
s2 = Series.from_list(["a", "b", "c"])
s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])
You’ll notice that the dtype
and size of the Series
are at the top of the printed value. You can get those programmatically as well.
Series.dtype(s3)
Series.size(s3)
And the printed values max out at 50:
1..100 |> Enum.to_list() |> Series.from_list()
Series are also nullable.
s = Series.from_list([1.0, 2.0, nil, nil, 5.0])
And you can fill in those missing values using one of the following strategies:
-
:forward
- replace nil with the previous value -
:backward
- replace nil with the next value -
:max
- replace nil with the series maximum -
:min
- replace nil with the series minimum -
:mean
- replace nil with the series mean
Series.fill_missing(s, :forward)
In the case of mixed numeric types (i.e. integers and floats), Series
will downcast to a float:
Series.from_list([1, 2.0])
In all other cases, Series
must all be of the same dtype
or else you’ll get an ArgumentError
.
Series.from_list([1, 2, 3, "a"])
One of the goals of Explorer
is useful error messages. If you look at the error above, you get:
> the value “a” does not match the inferred series dtype :integer
Hopefully this makes abundantly clear what’s going on.
Series
also implements the Access
protocol. You can slice and dice in many ways:
s = 1..10 |> Enum.to_list() |> Series.from_list()
s[1]
s[-1]
s[0..4]
s[[0, 4, 4]]
And of course, you can convert back to an Elixir list.
Series.to_list(s)
Explorer
comparisons return boolean series. We will talk more
about boolean series later.
s = 1..11 |> Enum.to_list() |> Series.from_list()
s1 = 11..1 |> Enum.to_list() |> Series.from_list()
Series.equal(s, s1)
Series.equal(s, 5)
Series.not_equal(s, 10)
Series.greater_equal(s, 4)
Explorer
supports arithmetic.
Series.add(s, s1)
Series.subtract(s, 4)
Series.multiply(s, s1)
Remember those helpful errors? We’ve tried to add those throughout. So if you try to do arithmetic with mismatching dtypes:
s = Series.from_list([1, 2, 3])
s1 = Series.from_list([1.0, 2.0, 3.0])
Series.add(s, s1)
Just kidding! Integers and floats will downcast to floats. Let’s try again:
s = Series.from_list([1, 2, 3])
s1 = Series.from_list(["a", "b", "c"])
Series.add(s, s1)
You can flip them around.
s = Series.from_list([1, 2, 3, 4])
Series.reverse(s)
And sort.
1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list() |> Series.sort()
Or argsort.
s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
ids = Series.argsort(s) |> Series.to_list()
Which you can pass to Explorer.Series.slice/2
if you want the sorted values.
Series.slice(s, ids)
You can calculate cumulative values.
s = 1..100 |> Enum.to_list() |> Series.from_list()
Series.cumulative_sum(s)
Or rolling ones.
Series.window_sum(s, 4)
You can count and list unique values.
s = Series.from_list(["a", "b", "b", "c", "c", "c"])
Series.distinct(s)
Series.n_distinct(s)
And you can even get a dataframe showing the frequencies for each distinct value.
Series.frequencies(s)
Back to those boolean series returned by comparison functions like equal
and not_equal
.
These boolean series can be combined with other functions to perform conditional operations.
s1 = Series.from_list(["It", "was", "the", "best", "of", "times"])
s1 |> Series.equal("best") |> Series.select("worst", s1)
Working with DataFrames
A DataFrame
is really just a collection of Series
of the same size. Which is why you can create a DataFrame
from a Keyword
list.
DataFrame.new(a: [1, 2, 3], b: ["a", "b", "c"])
Similarly to Series
, the Inspect
implementation prints some info at the top and to the left. At the top we see the shape of the dataframe (rows and columns) and then for each column we see the name, dtype, and first five values. We can see a bit more from that built-in dataset we loaded in earlier.
DataFrame.dummies(df, "country")
You will also see grouping information there, but we’ll get to that later. You can get the info yourself directly:
DataFrame.names(df)
DataFrame.dtypes(df)
DataFrame.shape(df)
{DataFrame.n_rows(df), DataFrame.n_columns(df)}
We can grab the head.
DataFrame.head(df)
Or the tail. Let’s get a few more values from the tail.
DataFrame.tail(df, 10)
Verbs and macros
In Explorer
, like in dplyr
, we have five main verbs to work with dataframes:
- select
- filter
- mutate
- arrange
- summarise
We are going to explore then in this notebook, but first we need to “require”
the Explorer.DataFrame
module in order to load the macros needed for these verbs.
I want to take the opportunity to create a shorter alias for the DataFrame
module,
called DF
:
require DataFrame, as: DF
From now on we are using the shorter version, DF
, to refer to the required Explorer.DataFrame
module.
Select
Let’s jump right into it. We can select columns pretty simply.
DF.select(df, ["year", "country"])
But Elixir gives us some superpowers. In R there’s tidy-select
. I don’t think we need that in Elixir. Anywhere in Explorer
where you need to pass a list of column names, you can also execute a filtering callback on the column names. It’s just an anonymous function passed to df |> DataFrame.names() |> Enum.filter(callback_here)
.
DF.select(df, &String.ends_with?(&1, "fuel"))
Want all but some columns? discard/2
performs the opposite of select/2
.
DF.discard(df, &String.ends_with?(&1, "fuel"))
Filter
The next verb we’ll look at is filter
.
This is implemented using a macro, so it’s possible to use expressions like you would if comparing variables in Elixir:
DF.filter(df, country == "BRAZIL")
Using complex filters is also possible:
DF.filter(df, country == "ALGERIA" and year > 2012)
You can also write the same filter without the macro, by using the callback version function which is filter_with/2
:
DF.filter_with(df, fn ldf ->
ldf["country"]
|> Series.equal("ALGERIA")
|> Series.and(Series.greater(ldf["year"], 2012))
end)
By the way, all the Explorer.DataFrame
macros have a correspondent function that accepts a callback.
In fact, our macros are implemented using those functions.
The filter_with/2
function is going to use a virtual representation of the dataframe
that we call a “lazy frame”. With lazy frames you can´t access the
series contents, but every operation will be optimized and run only once.
Remember those helpful error messages?
DF.filter(df, cuontry == "BRAZIL")
Mutate
A common task in data analysis is to add columns or change existing ones. Mutate is a handy verb.
DF.mutate(df, new_column: solid_fuel + cement)
Did you catch that? You can pass in new columns as keyword arguments. It also works to transform existing columns.
DF.mutate(df,
gas_fuel: Series.cast(gas_fuel, :float),
gas_and_liquid_fuel: gas_fuel + liquid_fuel
)
DataFrame.mutate/2
is flexible though. You may not always want to use keyword arguments. Given that column names are String.t()
, it may make more sense to use a map.
DF.mutate(df, %{"gas_fuel" => gas_fuel - 10})
DF.transmute/2
, which is DF.mutate/2
that only retains the specified columns, is forthcoming.
Arrange
Sorting the dataframe is pretty straightforward.
DF.arrange(df, year)
But it comes with some tricks up its sleeve.
DF.arrange(df, asc: total, desc: year)
As the examples show, arrange/2
is a macro, and therefore you can use some functions to arrange your dataframe:
DF.arrange(df, asc: Series.window_sum(total, 2))
Sort operations happen left to right. And keyword list args permit specifying the direction.
Distinct
Okay, as expected here too. Very straightforward.
DF.distinct(df, ["year", "country"])
You can specify whether to keep the other columns as well, so the first row of each distinct value is kept:
DF.distinct(df, ["country"], keep_all: true)
Rename
Rename can take either a list of new names or a callback that is passed to Enum.map/2
against the names. You can also use a map or keyword args to rename specific columns.
DF.rename(df, year: "year_test")
DF.rename_with(df, &(&1 <> "_test"))
Dummies
This is fun! We can get dummy variables for unique values.
DF.dummies(df, ["year"])
IO.inspect(df)
DF.dummies(df, ["country", "per_capita"])
Sampling
Random samples can give us a percent or a specific number of samples, with or without replacement, and the function is seedable.
DF.sample(df, 10)
DF.sample(df, 0.4)
Trying for those helpful error messages again.
DF.sample(df, 10000)
DF.sample(df, 10000, replace: true)
Pull and slice
Slicing and dicing can be done with the Access
protocol or with explicit pull/slice/take functions.
df["year"]
DF.pull(df, "year")
df[["year", "country"]]
DF.slice(df, [1, 20, 50])
Negative offsets work for slice!
DF.slice(df, -10, 5)
DF.slice(df, 10, 5)
Slice also works with ranges:
DF.slice(df, 12..42)
Pivot
We can pivot_longer/3
and pivot_wider/4
. These are inspired by tidyr.
There are some shortcomings in pivot_wider/4
related to polars
. The select
option must select only columns of numeric type.
DF.pivot_longer(df, ["year", "country"], select: &String.ends_with?(&1, "fuel"))
DF.pivot_wider(df, "country", "total", id_columns: ["year"])
Let’s make those names look nicer!
tidy_names = fn name ->
name
|> String.downcase()
|> String.replace(~r/\s/, " ")
|> String.replace(~r/[^A-Za-z\s]/, "")
|> String.replace(" ", "_")
end
df
|> DF.pivot_wider("country", "total", id_columns: ["year"])
|> DF.rename_with(tidy_names)
Joins
Joining is fast and easy. You can specify the columns to join on and how to join. Polars even supports cartesian (cross) joins, so Explorer
does too.
df1 = DF.select(df, ["year", "country", "total"])
df2 = DF.select(df, ["year", "country", "cement"])
DF.join(df1, df2)
df3 = df |> DF.select(["year", "cement"]) |> DF.slice(0, 500)
DF.join(df1, df3, how: :left)
Grouping
Explorer
supports groupby operations. They’re limited based on what’s possible in Polars, but they do most of what you need to do.
grouped = DF.group_by(df, ["country"])
Notice that the Inspect
call now shows groups
as well as rows
and columns
. You can, of course, get them explicitly.
DF.groups(grouped)
And you can ungroup explicitly.
DF.ungroup(grouped)
But what we care about the most is aggregating! Let’s see which country has the max per_capita
value.
grouped
|> DF.summarise(max_per_capita: max(per_capita))
|> DF.arrange(desc: max_per_capita)
Qatar it is.
You may noticed that we are using max/1
inside the summarise
macro. This is possible because we expose all functions from the Series
module. You can use the following aggregations inside summarise:
-
min/1
- Take the minimum value within the group. SeeExplorer.Series.min/1
. -
max/1
- Take the maximum value within the group. SeeExplorer.Series.max/1
. -
sum/1
- Take the sum of the series within the group. SeeExplorer.Series.sum/1
. -
mean/1
- Take the mean of the series within the group. SeeExplorer.Series.mean/1
. -
median/1
- Take the median of the series within the group. SeeExplorer.Series.median/1
. -
first/1
- Take the first value within the group. SeeExplorer.Series.first/1
. -
last/1
- Take the last value within the group. SeeExplorer.Series.last/1
. -
count/1
- Count the number of rows per group. -
n_unique/1
- Count the number of unique rows per group.
The API is similar to mutate
: you can use keyword args or a map and specify aggregations to use.
DF.summarise(grouped, min_per_capita: min(per_capita), min_total: min(total))
Speaking of mutate
, it’s ‘group-aware’. As are arrange
, distinct
, and n_rows
.
DF.mutate(grouped, total_window_sum: window_sum(total, 3), rows_in_group: count(country))
It’s also possible to use aggregations inside other functions:
grouped
|> DF.summarise(greater_than_9: greater(max(per_capita), 9.0), per_capita_max: max(per_capita))
|> DataFrame.arrange(desc: per_capita_max)
That’s it!
And not. This is certainly not exhaustive, but I hope it gives you a good idea of what can be done and what the ‘flavour’ of the API is like. I’d love contributions and issues raised where you find them!