Joins & Reshape
Mix.install([{:dux, "~> 0.2.0"}])
NYC Flights: A Star Schema
The nycflights13 dataset is a classic star schema — a fact table (flights) joined
to dimension tables (airlines, airports, planes). It ships with Dux’s test data.
require Dux
flights = Dux.Datasets.flights()
airlines = Dux.Datasets.airlines()
airports = Dux.Datasets.airports()
planes = Dux.Datasets.planes()
# 6,099 flights from the first week of January 2013
Dux.n_rows(flights)
Inner Join: Carrier Name Lookup
Join flights with airlines to get the full carrier name. When both tables share a column name, pass it as an atom:
flights
|> Dux.join(airlines, on: :carrier)
|> Dux.group_by(:name)
|> Dux.summarise_with(n_flights: "COUNT(*)")
|> Dux.sort_by(desc: :n_flights)
|> Dux.head(5)
|> Dux.compute()
Different Column Names
When join keys have different names, use a tuple:
flights
|> Dux.join(airports, on: [{:origin, :faa}])
|> Dux.group_by(:name)
|> Dux.summarise_with(n: "COUNT(*)")
|> Dux.sort_by(desc: :n)
|> Dux.compute()
Left Join: Keep All Flights
A left join keeps all rows from the left table, filling NULLs where there’s no match on the right:
flights
|> Dux.drop_nil([:tailnum])
|> Dux.join(planes, on: :tailnum, how: :left)
|> Dux.group_by(:manufacturer)
|> Dux.summarise_with(n: "COUNT(*)", avg_seats: "AVG(seats)")
|> Dux.sort_by(desc: :n)
|> Dux.head(5)
|> Dux.compute()
Anti Join: Missing Records
An anti join finds rows in the left that have no match on the right. Which flights have tail numbers not in the planes table?
flights
|> Dux.drop_nil([:tailnum])
|> Dux.join(planes, on: :tailnum, how: :anti)
|> Dux.select([:tailnum])
|> Dux.distinct()
|> Dux.head(5)
|> Dux.compute()
> #### All join types {: .info}
>
> Dux supports :inner (default), :left, :right, :cross, :anti, and :semi.
Star Schema: Multi-Table Join
Chain joins to build a denormalized view:
flights
|> Dux.join(airlines, on: :carrier)
|> Dux.join(airports, on: [{:dest, :faa}])
|> Dux.select([:carrier, :name, :dest])
|> Dux.head(10)
|> Dux.compute()
ASOF Join: Time Series Alignment
An ASOF join matches each left row to the nearest right row satisfying an inequality — perfect for aligning time series data:
trades = Dux.from_list([
%{symbol: "AAPL", timestamp: 1, price: 150.0},
%{symbol: "AAPL", timestamp: 3, price: 151.5},
%{symbol: "GOOG", timestamp: 2, price: 2800.0}
])
quotes = Dux.from_list([
%{symbol: "AAPL", timestamp: 0, bid: 149.5, ask: 150.5},
%{symbol: "AAPL", timestamp: 2, bid: 150.5, ask: 151.5},
%{symbol: "GOOG", timestamp: 1, bid: 2795.0, ask: 2805.0}
])
# Match each trade to the most recent quote for that symbol
Dux.asof_join(trades, quotes, on: :symbol, by: {:timestamp, :>=})
|> Dux.compute()
The by: option specifies the inequality column and operator. Each trade gets
the quote with the largest timestamp <= trade.timestamp for the same symbol.
> #### ASOF join operators {: .info}
>
> Supports >=, >, <=, <. Exactly one inequality condition per join.
> All other conditions must be equality (on: columns).
Pivot Wider: Long to Wide
Spread values across columns. Count flights per origin airport per carrier:
flights
|> Dux.group_by([:origin, :carrier])
|> Dux.summarise_with(n: "COUNT(*)")
|> Dux.pivot_wider(:carrier, :n)
|> Dux.sort_by(:origin)
|> Dux.compute()
Pivot Longer: Wide to Long
Melt columns back into rows:
wide = Dux.from_list([
%{region: "US", q1: 100, q2: 200, q3: 150, q4: 300},
%{region: "EU", q1: 80, q2: 120, q3: 90, q4: 250}
])
wide
|> Dux.pivot_longer([:q1, :q2, :q3, :q4], names_to: "quarter", values_to: "revenue")
|> Dux.sort_by([:region, :quarter])
|> Dux.compute()
Concat Rows: UNION ALL
Stack dataframes vertically:
jan1 = Dux.from_list([%{day: 1, temp: 30}])
jan2 = Dux.from_list([%{day: 2, temp: 28}])
jan3 = Dux.from_list([%{day: 3, temp: 32}])
Dux.concat_rows([jan1, jan2, jan3])
|> Dux.compute()
> #### Cross-source joins {: .tip}
>
> Dux can join across different data sources — Parquet files with Postgres
> tables, S3 data with local CSVs. See the Distributed Execution
> guide for Dux.attach/3 and Dux.from_attached/3.
What’s Next
- Getting Started — core Dux concepts
- Transformations — filter, mutate, window functions
- Distributed Execution — architecture, partitioning, distributed IO