Data transform with Explorer
Mix.install([
{:kino_explorer, "~> 0.1.11"}
])
Introduction
To explore and transform data in Livebook we use two libraries:
-
The
explorer
package brings series (one-dimensional) and dataframes (two-dimensional) for fast data exploration to Elixir. -
The
kino_explorer
package automatically renders anExplorer.DataFrame
orExplorer.Series
as a data table.
We will make extensive use of Explorer.DataFrame
‘s functions, so it’s handy
to alias the module to something shorter. We will also
require
the Explorer.DataFrame
module to use its
querying facilities:
alias Explorer.DataFrame, as: DF
require Explorer.DataFrame
All set, let’s go.
Quick introduction to Explorer and data tables
In short, Explorer is the DataFrame library for Elixir. It brings the essential data analysis, exploration, and transformation tools to the Elixir ecosystem, while the integration provided by KinoExplorer makes it effortless to visualize and interact with data through data tables.
Data tables offers a variety of features to make viewing data more convenient. You can easily select cells, columns or even the entire table, switch between paging and infinite scrolling, sort the columns or search for specific data. Keyboard shortcuts, including copy selection, are also available. Let’s render a DataFrame to see an example:
Explorer.Datasets.fossil_fuels()
|> DF.filter(contains(country, "A") and year < 2013)
|> DF.select(["year", "country", "total"])
The Data transform smart cell
Data tables let us quickly view the raw data, without modifying it, while the Data transform cell allows us to transform it, creating insightful and flexible data pipelines and seeing the results on the fly.
Before we explore its features, we need some data to work with.
teams =
DF.new(
weekday: [
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday"
],
team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
hour: [10, 9, 10, 10, 11, 15, nil, 16, 14, 16]
)
weekdays =
teams
|> DF.mutate(hour: fill_missing(hour, :forward))
|> DF.filter(hour > 10 and team != "B")
|> DF.arrange(asc: weekday)
|> DF.pivot_wider("team", "hour")
Let’s break down what happened in the previous Data transform cell.
Currently, the Data transform cell supports several operations, such as sorting
,
filter
, pivot_wider
, and more. Each operation has its own colored card and
you can move the cards to reorder the operations and see the changes in real time.
Except for pivot_wider
, you can have multiple operations of any type. If two
similar operations are in a row, they are grouped, and a single query command is
generated for them in the code. However, you still have individual control over each.
You can also use the toggle button to enable/disable the operations. This is particularly useful for seeing the implications of each step in your pipeline without having to rewrite it.
The initial state is purely a suggestion. You can easily add and remove operations to get the pipeline that meets your needs.
Finally, the assign to
field allows you to save the resulting DataFrame in a
variable for later use in the notebook or in conjunction with other Smart Cells.
For example, to plot a chart using the Chart cell.
Shortcuts
Key Combo | Description |
---|---|
↑ ↓ ← → | Moves the currently selected cell and clears other selections |
shift + ↑ ↓ ← → | Extends the current selection range in the direction pressed |
alt + ↑ ↓ ← → | Moves the currently selected cell and retains the current selection |
ctrl + ↑ ↓ ← → | Moves the selection as far as possible in the direction pressed |
ctrl + shift + ↑ ↓ ← → | Extends the selection as far as possible in the direction pressed |
shift + home end | Extends the selection as far as possible in the direction pressed |
ctrl + A | Selects all cells |
shift + ␣ | Selects the current row |
ctrl + ␣ | Selects the current col |
esc | Clear the current selection |
ctrl + C | Copies the current selection |
ctrl + home end | Moves the selection to the first/last cell in the data table |
ctrl + shift + home end | Extends the selection to the first/last cell in the data table |