Getting Started
Mix.install([{:dux, "~> 0.2.0"}])
Meet the Penguins
Dux is a DuckDB-native dataframe library for Elixir. Let’s explore it with a real dataset — 344 penguins from Palmer Station, Antarctica.
require Dux
penguins = Dux.Datasets.penguins()
> #### Lazy by default {: .info}
>
> Nothing has hit DuckDB yet. penguins is a %Dux{} struct holding a
> source and zero operations. Everything is lazy until you call compute/1,
> to_rows/1, or to_columns/1.
Your First Pipeline
Pipelines chain verbs — each returns a new %Dux{}. The require Dux above
unlocks expression macros where bare identifiers become column names.
penguins
|> Dux.filter(species == "Gentoo" and body_mass_g > 5000)
|> Dux.select([:species, :island, :body_mass_g, :sex])
|> Dux.sort_by(desc: :body_mass_g)
|> Dux.head(5)
Materialization
Pipelines are lazy — nothing executes until you materialise:
-
compute/1— execute and return a%Dux{}with the data -
to_rows/1— execute and return a list of maps -
to_columns/1— execute and return a map of lists
penguins
|> Dux.filter(species == "Adelie")
|> Dux.head(3)
|> Dux.compute()
penguins
|> Dux.filter(species == "Adelie")
|> Dux.select([:island, :bill_length_mm])
|> Dux.head(3)
|> Dux.to_rows()
penguins
|> Dux.filter(species == "Chinstrap")
|> Dux.select([:species, :body_mass_g])
|> Dux.head(5)
|> Dux.to_columns()
Aggregation
Group rows and compute summaries. The expression syntax lets you write
sum(body_mass_g) directly:
penguins
|> Dux.drop_nil([:body_mass_g, :sex])
|> Dux.group_by([:species, :sex])
|> Dux.summarise(
count: count(body_mass_g),
avg_mass: avg(body_mass_g),
max_mass: max(body_mass_g)
)
|> Dux.sort_by([:species, :sex])
|> Dux.compute()
> #### Raw SQL alternative {: .tip}
>
> Every expression verb has a _with variant that accepts raw DuckDB SQL strings:
>
> elixir > Dux.summarise_with(df, avg_mass: "AVG(body_mass_g)") >
>
> Use this for DuckDB functions the expression macro doesn’t cover.
Computed Columns
mutate/2 adds or replaces columns:
penguins
|> Dux.drop_nil([:bill_length_mm, :bill_depth_mm])
|> Dux.mutate(bill_ratio: bill_length_mm / bill_depth_mm)
|> Dux.select([:species, :bill_length_mm, :bill_depth_mm, :bill_ratio])
|> Dux.sort_by(desc: :bill_ratio)
|> Dux.head(5)
|> Dux.compute()
Interpolation
Use ^ to interpolate Elixir values into expressions as parameter bindings
(safe from SQL injection):
min_mass = 4500
penguins
|> Dux.filter(body_mass_g > ^min_mass)
|> Dux.group_by(:species)
|> Dux.summarise(n: count(species))
|> Dux.sort_by(desc: :n)
|> Dux.compute()
Conditional Expressions
Elixir’s cond compiles to SQL CASE WHEN — classify rows directly in your pipeline:
penguins
|> Dux.drop_nil([:body_mass_g])
|> Dux.mutate(
size: cond do
body_mass_g > 5000 -> "large"
body_mass_g > 3500 -> "medium"
true -> "small"
end
)
|> Dux.group_by(:size)
|> Dux.summarise(n: count(species))
|> Dux.compute()
For simple two-branch logic, use if/else:
penguins
|> Dux.drop_nil([:body_mass_g])
|> Dux.mutate(heavy: if(body_mass_g > 4500, do: "yes", else: "no"))
|> Dux.group_by(:heavy)
|> Dux.summarise(n: count(species))
|> Dux.compute()
Membership Tests
Use in to filter by a set of values:
penguins
|> Dux.filter(species in ["Adelie", "Chinstrap"])
|> Dux.group_by(:species)
|> Dux.summarise(n: count(species))
|> Dux.compute()
Works with interpolated lists too:
target_islands = ["Biscoe", "Dream"]
penguins
|> Dux.filter(island in ^target_islands)
|> Dux.group_by(:island)
|> Dux.summarise(n: count(species))
|> Dux.compute()
Using DuckDB Functions
All DuckDB functions work inside the query macro — they pass through as SQL function calls. Here are some useful ones:
# String functions
penguins
|> Dux.mutate(
species_lower: lower(species),
first_char: left(species, 1)
)
|> Dux.select([:species, :species_lower, :first_char])
|> Dux.distinct()
|> Dux.compute()
# Math and rounding
penguins
|> Dux.drop_nil([:bill_length_mm, :bill_depth_mm])
|> Dux.mutate(
bill_ratio: round(bill_length_mm / bill_depth_mm, 2),
log_mass: round(ln(body_mass_g), 2)
)
|> Dux.select([:species, :bill_ratio, :log_mass])
|> Dux.head(5)
|> Dux.compute()
# Null handling with coalesce
penguins
|> Dux.mutate(safe_sex: coalesce(sex, "unknown"))
|> Dux.group_by(:safe_sex)
|> Dux.summarise(n: count(species))
|> Dux.compute()
> #### Any DuckDB function works {: .info}
>
> lower(), upper(), trim(), year(), month(), date_diff(),
> regexp_matches(), list_extract(), struct_extract(), coalesce(),
> cast(), and hundreds more.
> If DuckDB has it, you can call it in a Dux expression.
See the SQL
Every pipeline compiles to SQL CTEs. Use sql_preview/1 to inspect
what DuckDB will execute:
penguins
|> Dux.filter(species == "Gentoo")
|> Dux.mutate(mass_kg: body_mass_g / 1000.0)
|> Dux.group_by(:island)
|> Dux.summarise(avg_kg: avg(mass_kg))
|> Dux.sql_preview(pretty: true)
|> IO.puts()
Reading & Writing Files
Dux reads CSV, Parquet, and NDJSON via DuckDB — including S3 and HTTP URLs:
# Write the filtered penguins to Parquet
path = Path.join(System.tmp_dir!(), "gentoo.parquet")
penguins
|> Dux.filter(species == "Gentoo")
|> Dux.to_parquet(path)
# Read it back
Dux.from_parquet(path)
|> Dux.head(3)
|> Dux.compute()
# Cleanup
File.rm(path)
Joins
Combine datasets by matching on shared columns:
# Create a small lookup table
habitat = Dux.from_list([
%{island: "Biscoe", climate: "maritime"},
%{island: "Dream", climate: "maritime"},
%{island: "Torgersen", climate: "polar"}
])
penguins
|> Dux.join(habitat, on: :island)
|> Dux.group_by(:climate)
|> Dux.summarise(n: count(species))
|> Dux.compute()
What’s Next
- Data IO — CSV, Parquet, NDJSON, S3, glob patterns
- Transformations — deep dive on filter, mutate, window functions
- Joins & Reshape — star schema joins, pivot_wider/longer
-
Distributed Execution — scale across BEAM nodes with
distribute/2 - Graph Analytics — PageRank, shortest paths, connected components