Powered by AppSignal & Oban Pro

Joins & Reshape

guides/joins-and-reshape.livemd

Joins & Reshape

Mix.install([{:dux, "~> 0.2.0"}])

NYC Flights: A Star Schema

The nycflights13 dataset is a classic star schema — a fact table (flights) joined to dimension tables (airlines, airports, planes). It ships with Dux’s test data.

require Dux

flights = Dux.Datasets.flights()
airlines = Dux.Datasets.airlines()
airports = Dux.Datasets.airports()
planes = Dux.Datasets.planes()
# 6,099 flights from the first week of January 2013
Dux.n_rows(flights)

Inner Join: Carrier Name Lookup

Join flights with airlines to get the full carrier name. When both tables share a column name, pass it as an atom:

flights
|> Dux.join(airlines, on: :carrier)
|> Dux.group_by(:name)
|> Dux.summarise_with(n_flights: "COUNT(*)")
|> Dux.sort_by(desc: :n_flights)
|> Dux.head(5)
|> Dux.compute()

Different Column Names

When join keys have different names, use a tuple:

flights
|> Dux.join(airports, on: [{:origin, :faa}])
|> Dux.group_by(:name)
|> Dux.summarise_with(n: "COUNT(*)")
|> Dux.sort_by(desc: :n)
|> Dux.compute()

Left Join: Keep All Flights

A left join keeps all rows from the left table, filling NULLs where there’s no match on the right:

flights
|> Dux.drop_nil([:tailnum])
|> Dux.join(planes, on: :tailnum, how: :left)
|> Dux.group_by(:manufacturer)
|> Dux.summarise_with(n: "COUNT(*)", avg_seats: "AVG(seats)")
|> Dux.sort_by(desc: :n)
|> Dux.head(5)
|> Dux.compute()

Anti Join: Missing Records

An anti join finds rows in the left that have no match on the right. Which flights have tail numbers not in the planes table?

flights
|> Dux.drop_nil([:tailnum])
|> Dux.join(planes, on: :tailnum, how: :anti)
|> Dux.select([:tailnum])
|> Dux.distinct()
|> Dux.head(5)
|> Dux.compute()

> #### All join types {: .info} > > Dux supports :inner (default), :left, :right, :cross, :anti, and :semi.

Star Schema: Multi-Table Join

Chain joins to build a denormalized view:

flights
|> Dux.join(airlines, on: :carrier)
|> Dux.join(airports, on: [{:dest, :faa}])
|> Dux.select([:carrier, :name, :dest])
|> Dux.head(10)
|> Dux.compute()

ASOF Join: Time Series Alignment

An ASOF join matches each left row to the nearest right row satisfying an inequality — perfect for aligning time series data:

trades = Dux.from_list([
  %{symbol: "AAPL", timestamp: 1, price: 150.0},
  %{symbol: "AAPL", timestamp: 3, price: 151.5},
  %{symbol: "GOOG", timestamp: 2, price: 2800.0}
])

quotes = Dux.from_list([
  %{symbol: "AAPL", timestamp: 0, bid: 149.5, ask: 150.5},
  %{symbol: "AAPL", timestamp: 2, bid: 150.5, ask: 151.5},
  %{symbol: "GOOG", timestamp: 1, bid: 2795.0, ask: 2805.0}
])

# Match each trade to the most recent quote for that symbol
Dux.asof_join(trades, quotes, on: :symbol, by: {:timestamp, :>=})
|> Dux.compute()

The by: option specifies the inequality column and operator. Each trade gets the quote with the largest timestamp <= trade.timestamp for the same symbol.

> #### ASOF join operators {: .info} > > Supports >=, >, <=, <. Exactly one inequality condition per join. > All other conditions must be equality (on: columns).

Pivot Wider: Long to Wide

Spread values across columns. Count flights per origin airport per carrier:

flights
|> Dux.group_by([:origin, :carrier])
|> Dux.summarise_with(n: "COUNT(*)")
|> Dux.pivot_wider(:carrier, :n)
|> Dux.sort_by(:origin)
|> Dux.compute()

Pivot Longer: Wide to Long

Melt columns back into rows:

wide = Dux.from_list([
  %{region: "US", q1: 100, q2: 200, q3: 150, q4: 300},
  %{region: "EU", q1: 80, q2: 120, q3: 90, q4: 250}
])

wide
|> Dux.pivot_longer([:q1, :q2, :q3, :q4], names_to: "quarter", values_to: "revenue")
|> Dux.sort_by([:region, :quarter])
|> Dux.compute()

Concat Rows: UNION ALL

Stack dataframes vertically:

jan1 = Dux.from_list([%{day: 1, temp: 30}])
jan2 = Dux.from_list([%{day: 2, temp: 28}])
jan3 = Dux.from_list([%{day: 3, temp: 32}])

Dux.concat_rows([jan1, jan2, jan3])
|> Dux.compute()

> #### Cross-source joins {: .tip} > > Dux can join across different data sources — Parquet files with Postgres > tables, S3 data with local CSVs. See the Distributed Execution > guide for Dux.attach/3 and Dux.from_attached/3.

What’s Next