Powered by AppSignal & Oban Pro

Getting Started

guides/getting-started.livemd

Getting Started

Mix.install([{:dux, "~> 0.2.0"}])

Meet the Penguins

Dux is a DuckDB-native dataframe library for Elixir. Let’s explore it with a real dataset — 344 penguins from Palmer Station, Antarctica.

require Dux

penguins = Dux.Datasets.penguins()

> #### Lazy by default {: .info} > > Nothing has hit DuckDB yet. penguins is a %Dux{} struct holding a > source and zero operations. Everything is lazy until you call compute/1, > to_rows/1, or to_columns/1.

Your First Pipeline

Pipelines chain verbs — each returns a new %Dux{}. The require Dux above unlocks expression macros where bare identifiers become column names.

penguins
|> Dux.filter(species == "Gentoo" and body_mass_g > 5000)
|> Dux.select([:species, :island, :body_mass_g, :sex])
|> Dux.sort_by(desc: :body_mass_g)
|> Dux.head(5)

Materialization

Pipelines are lazy — nothing executes until you materialise:

  • compute/1 — execute and return a %Dux{} with the data
  • to_rows/1 — execute and return a list of maps
  • to_columns/1 — execute and return a map of lists
penguins
|> Dux.filter(species == "Adelie")
|> Dux.head(3)
|> Dux.compute()
penguins
|> Dux.filter(species == "Adelie")
|> Dux.select([:island, :bill_length_mm])
|> Dux.head(3)
|> Dux.to_rows()
penguins
|> Dux.filter(species == "Chinstrap")
|> Dux.select([:species, :body_mass_g])
|> Dux.head(5)
|> Dux.to_columns()

Aggregation

Group rows and compute summaries. The expression syntax lets you write sum(body_mass_g) directly:

penguins
|> Dux.drop_nil([:body_mass_g, :sex])
|> Dux.group_by([:species, :sex])
|> Dux.summarise(
  count: count(body_mass_g),
  avg_mass: avg(body_mass_g),
  max_mass: max(body_mass_g)
)
|> Dux.sort_by([:species, :sex])
|> Dux.compute()

> #### Raw SQL alternative {: .tip} > > Every expression verb has a _with variant that accepts raw DuckDB SQL strings: > > elixir > Dux.summarise_with(df, avg_mass: "AVG(body_mass_g)") > > > Use this for DuckDB functions the expression macro doesn’t cover.

Computed Columns

mutate/2 adds or replaces columns:

penguins
|> Dux.drop_nil([:bill_length_mm, :bill_depth_mm])
|> Dux.mutate(bill_ratio: bill_length_mm / bill_depth_mm)
|> Dux.select([:species, :bill_length_mm, :bill_depth_mm, :bill_ratio])
|> Dux.sort_by(desc: :bill_ratio)
|> Dux.head(5)
|> Dux.compute()

Interpolation

Use ^ to interpolate Elixir values into expressions as parameter bindings (safe from SQL injection):

min_mass = 4500

penguins
|> Dux.filter(body_mass_g > ^min_mass)
|> Dux.group_by(:species)
|> Dux.summarise(n: count(species))
|> Dux.sort_by(desc: :n)
|> Dux.compute()

Conditional Expressions

Elixir’s cond compiles to SQL CASE WHEN — classify rows directly in your pipeline:

penguins
|> Dux.drop_nil([:body_mass_g])
|> Dux.mutate(
  size: cond do
    body_mass_g > 5000 -> "large"
    body_mass_g > 3500 -> "medium"
    true -> "small"
  end
)
|> Dux.group_by(:size)
|> Dux.summarise(n: count(species))
|> Dux.compute()

For simple two-branch logic, use if/else:

penguins
|> Dux.drop_nil([:body_mass_g])
|> Dux.mutate(heavy: if(body_mass_g > 4500, do: "yes", else: "no"))
|> Dux.group_by(:heavy)
|> Dux.summarise(n: count(species))
|> Dux.compute()

Membership Tests

Use in to filter by a set of values:

penguins
|> Dux.filter(species in ["Adelie", "Chinstrap"])
|> Dux.group_by(:species)
|> Dux.summarise(n: count(species))
|> Dux.compute()

Works with interpolated lists too:

target_islands = ["Biscoe", "Dream"]

penguins
|> Dux.filter(island in ^target_islands)
|> Dux.group_by(:island)
|> Dux.summarise(n: count(species))
|> Dux.compute()

Using DuckDB Functions

All DuckDB functions work inside the query macro — they pass through as SQL function calls. Here are some useful ones:

# String functions
penguins
|> Dux.mutate(
  species_lower: lower(species),
  first_char: left(species, 1)
)
|> Dux.select([:species, :species_lower, :first_char])
|> Dux.distinct()
|> Dux.compute()
# Math and rounding
penguins
|> Dux.drop_nil([:bill_length_mm, :bill_depth_mm])
|> Dux.mutate(
  bill_ratio: round(bill_length_mm / bill_depth_mm, 2),
  log_mass: round(ln(body_mass_g), 2)
)
|> Dux.select([:species, :bill_ratio, :log_mass])
|> Dux.head(5)
|> Dux.compute()
# Null handling with coalesce
penguins
|> Dux.mutate(safe_sex: coalesce(sex, "unknown"))
|> Dux.group_by(:safe_sex)
|> Dux.summarise(n: count(species))
|> Dux.compute()

> #### Any DuckDB function works {: .info} > > lower(), upper(), trim(), year(), month(), date_diff(), > regexp_matches(), list_extract(), struct_extract(), coalesce(), > cast(), and hundreds more. > If DuckDB has it, you can call it in a Dux expression.

See the SQL

Every pipeline compiles to SQL CTEs. Use sql_preview/1 to inspect what DuckDB will execute:

penguins
|> Dux.filter(species == "Gentoo")
|> Dux.mutate(mass_kg: body_mass_g / 1000.0)
|> Dux.group_by(:island)
|> Dux.summarise(avg_kg: avg(mass_kg))
|> Dux.sql_preview(pretty: true)
|> IO.puts()

Reading & Writing Files

Dux reads CSV, Parquet, and NDJSON via DuckDB — including S3 and HTTP URLs:

# Write the filtered penguins to Parquet
path = Path.join(System.tmp_dir!(), "gentoo.parquet")

penguins
|> Dux.filter(species == "Gentoo")
|> Dux.to_parquet(path)

# Read it back
Dux.from_parquet(path)
|> Dux.head(3)
|> Dux.compute()
# Cleanup
File.rm(path)

Joins

Combine datasets by matching on shared columns:

# Create a small lookup table
habitat = Dux.from_list([
  %{island: "Biscoe", climate: "maritime"},
  %{island: "Dream", climate: "maritime"},
  %{island: "Torgersen", climate: "polar"}
])

penguins
|> Dux.join(habitat, on: :island)
|> Dux.group_by(:climate)
|> Dux.summarise(n: count(species))
|> Dux.compute()

What’s Next