Ten Minutes To Explorer
Mix.install([
{:kino_explorer, "~> 0.1.4"}
])
:ok
Reading and Writing Data
Explore the built-in dataset of fossils
fossil_fuels = Explorer.Datasets.fossil_fuels()
require Explorer.DataFrame, as: DF
alias Explorer.Series
Explorer.Series
Now convert this DataFrame
to a CSV format and write it to a file path on the local machine
input = Kino.Input.text("File Path")
file_path = Kino.Input.read(input)
fossil_fuels |> DF.to_csv(file_path)
:ok
Working with Series
Like Polars
, Explorer
works up from a concept of a Series
. A Series
is a data structure representing a single-dimensional list of data.
Explorer
supports the following Series
data types:
-
:float
-
:integer
-
:boolean
-
:string
-
:date
-Elixir.Date
-
:datetime
-Elixir.NaiveDateTime
Values within a Series
must be of the same data type
s1 = Series.from_list([1, 2, 3])
s2 = Series.from_list(["a", "b", "c"])
s3 = Series.from_list([~D[2011-01-01], ~D[1965-01-21]])
Often datasets will contain "null"
values. To accomodate for such missing data, Series
can also be nullable
s = Series.from_list([1.0, 2.0, nil, nil, 5.0])
Strategies for filling missing values:
-
:forward
- replacenil
with the previous value -
:backward
- replacenil
with the next value -
:max
- replacenil
with the series maximum -
:min
- replacenil
with the series minimum -
:mean
- replacenil
with the series mean
s |> Series.fill_missing(:forward)
Series
are also “comparable” to one another
s = 1..11 |> Enum.to_list() |> Series.from_list()
s1 = 11..1 |> Enum.to_list() |> Series.from_list()
Series.equal(s, s1)
Series.equal(s, 5)
Series.not_equal(s, 10)
Series.greater(s, 5)
Arithmetic is also supported for Series
Series.add(s, s1)
Series.multiply(s, 3)
Other Series
operations
1..100
|> Enum.to_list()
|> Enum.shuffle()
|> Series.from_list()
|> Series.sort()
s = 1..100 |> Enum.to_list() |> Enum.shuffle() |> Series.from_list()
ids = s |> Series.argsort() |> Series.to_list()
[49, 0, 60, 81, 91, 14, 39, 63, 20, 67, 45, 38, 68, 79, 6, 16, 97, 25, 85, 7, 73, 4, 83, 19, 76, 92,
48, 43, 80, 56, 47, 96, 70, 75, 50, 55, 72, 77, 13, 46, 87, 11, 61, 17, 12, 54, 88, 37, 74, 89,
...]
Series.slice(s, ids)
s = ["a", "b", "c", "c", "d", "d", "e"] |> Series.from_list() |> Series.distinct()
Working with DataFrames
DataFrame
is a collection of Series
of the same size
This has the implication that a DataFrame
can be created from a Keyword
list
df = DF.new(a: [1, 2, 3], b: ["a", "b", "c"])
A DataFrame
has grouping information within the data structure that can be extracted with functions in the DataFrame
module
DF.names(df)
["a", "b"]
DF.shape(df)
{3, 2}
{DF.n_rows(df), DF.n_columns(df)}
{3, 2}
The DataFrame
module is “more than” a set of functions to operate on DataFrame
structures. It also exports expressive “verbs” or macros that can be very useful when writing programs
Verbs and Macros
The five main “verbs” to work with dataframes:
-
select
-
filter
-
mutate
-
arrange
-
summarise
Select
An explicit way to select particular columns from a DataFrame
would be to pass a list of the string values, representing the column names, to the select/2
function
With the power of pattern matching in Elixir, a callback function can also be passed as the second argument to select/2
to allow for more dynamic selections of data
fossil_fuels |> DF.select(["year", "country"])
fossil_fuels |> DF.select(&String.ends_with?(&1, "fuel"))
The opposite of select/2
is discard/2
fossil_fuels |> DF.discard(&String.ends_with?(&1, "fuel"))
Filter
In the DataFrame
module there is a filter/2
function but to express the filter function that should occur macros can be used to produce a very readable implementation
Here the country
variable is “infered” at runtime using the column name and accessible at the top level as if the variable is defined locally
fossil_fuels |> DF.filter(country == "BRAZIL")
fossil_fuels |> DF.filter(country == "ALGERIA" and year > 2012)
The same filters can be written without macros by using the callback version of filter/2
, called filter_with/2
All Explorer.DataFrame
macros have a corresponding function that accepts a callback
fossil_fuels
|> DF.filter_with(fn ldf ->
ldf["country"]
|> Series.equal("ALGERIA")
|> Series.and(Series.greater(ldf["year"], 2012))
end)
When using macros, if a column name is mistyped, a helpful error message is shown
fossil_fuels |> DF.filter(contry == "ALGERIA")
Mutate
A common task might be to add columns or change data within existing ones
fossil_fuels |> DF.mutate(new_column: solid_fuel + cement)
fossil_fuels
|> DF.mutate(
gas_fuel: Series.cast(gas_fuel, :float),
gas_and_liquid_fuel: gas_fuel + liquid_fuel
)
Arrange
Sorting a DataFrame
is straightforward
fossil_fuels |> DF.arrange(year)
fossil_fuels |> DF.arrange(asc: total, desc: year)
fossil_fuels |> DF.arrange(asc: Series.window_sum(total, 2))