Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Ten Minutes To Explorer

elixir/learn_explorer.livemd

Ten Minutes To Explorer

Mix.install([
  {:kino_explorer, "~> 0.1.4"}
])
:ok

Reading and Writing Data

Explore the built-in dataset of fossils

fossil_fuels = Explorer.Datasets.fossil_fuels()
require Explorer.DataFrame, as: DF
alias Explorer.Series
Explorer.Series

Now convert this DataFrame to a CSV format and write it to a file path on the local machine

input = Kino.Input.text("File Path")
file_path = Kino.Input.read(input)
fossil_fuels |> DF.to_csv(file_path)
:ok

Working with Series

Like Polars, Explorer works up from a concept of a Series. A Series is a data structure representing a single-dimensional list of data.

Explorer supports the following Series data types:

  • :float
  • :integer
  • :boolean
  • :string
  • :date - Elixir.Date
  • :datetime - Elixir.NaiveDateTime

Values within a Series must be of the same data type

s1 = Series.from_list([1, 2, 3])
s2 = Series.from_list(["a", "b", "c"])
s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])

Often datasets will contain "null" values. To accomodate for such missing data, Series can also be nullable

s = Series.from_list([1.0, 2.0, nil, nil, 5.0])

Strategies for filling missing values:

  • :forward - replace nil with the previous value
  • :backward - replace nil with the next value
  • :max - replace nil with the series maximum
  • :min - replace nil with the series minimum
  • :mean - replace nil with the series mean
s |> Series.fill_missing(:forward)

Series are also “comparable” to one another

s = 1..11 |> Enum.to_list() |> Series.from_list()
s1 = 11..1 |> Enum.to_list() |> Series.from_list()
Series.equal(s, s1)
Series.equal(s, 5)
Series.not_equal(s, 10)
Series.greater(s, 5)

Arithmetic is also supported for Series

Series.add(s, s1)
Series.multiply(s, 3)

Other Series operations

1..100
|> Enum.to_list()
|> Enum.shuffle()
|> Series.from_list()
|> Series.sort()
s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
ids = s |> Series.argsort() |> Series.to_list()
[49, 0, 60, 81, 91, 14, 39, 63, 20, 67, 45, 38, 68, 79, 6, 16, 97, 25, 85, 7, 73, 4, 83, 19, 76, 92,
 48, 43, 80, 56, 47, 96, 70, 75, 50, 55, 72, 77, 13, 46, 87, 11, 61, 17, 12, 54, 88, 37, 74, 89,
 ...]
Series.slice(s, ids)
s = ["a", "b", "c", "c", "d", "d", "e"] |> Series.from_list() |> Series.distinct()

Working with DataFrames

DataFrame is a collection of Series of the same size

This has the implication that a DataFrame can be created from a Keyword list

df = DF.new(a: [1, 2, 3], b: ["a", "b", "c"])

A DataFrame has grouping information within the data structure that can be extracted with functions in the DataFrame module

DF.names(df)
["a", "b"]
DF.shape(df)
{3, 2}
{DF.n_rows(df), DF.n_columns(df)}
{3, 2}

The DataFrame module is “more than” a set of functions to operate on DataFrame structures. It also exports expressive “verbs” or macros that can be very useful when writing programs

Verbs and Macros

The five main “verbs” to work with dataframes:

  • select
  • filter
  • mutate
  • arrange
  • summarise

Select

An explicit way to select particular columns from a DataFrame would be to pass a list of the string values, representing the column names, to the select/2 function

With the power of pattern matching in Elixir, a callback function can also be passed as the second argument to select/2 to allow for more dynamic selections of data

fossil_fuels |> DF.select(["year", "country"])
fossil_fuels |> DF.select(&String.ends_with?(&1, "fuel"))

The opposite of select/2 is discard/2

fossil_fuels |> DF.discard(&String.ends_with?(&1, "fuel"))

Filter

In the DataFrame module there is a filter/2 function but to express the filter function that should occur macros can be used to produce a very readable implementation

Here the country variable is “infered” at runtime using the column name and accessible at the top level as if the variable is defined locally

fossil_fuels |> DF.filter(country == "BRAZIL")
fossil_fuels |> DF.filter(country == "ALGERIA" and year > 2012)

The same filters can be written without macros by using the callback version of filter/2, called filter_with/2

All Explorer.DataFrame macros have a corresponding function that accepts a callback

fossil_fuels
|> DF.filter_with(fn ldf ->
  ldf["country"]
  |> Series.equal("ALGERIA")
  |> Series.and(Series.greater(ldf["year"], 2012))
end)

When using macros, if a column name is mistyped, a helpful error message is shown

fossil_fuels |> DF.filter(contry == "ALGERIA")

Mutate

A common task might be to add columns or change data within existing ones

fossil_fuels |> DF.mutate(new_column: solid_fuel + cement)
fossil_fuels
|> DF.mutate(
  gas_fuel: Series.cast(gas_fuel, :float),
  gas_and_liquid_fuel: gas_fuel + liquid_fuel
)

Arrange

Sorting a DataFrame is straightforward

fossil_fuels |> DF.arrange(year)
fossil_fuels |> DF.arrange(asc: total, desc: year)
fossil_fuels |> DF.arrange(asc: Series.window_sum(total, 2))