Beyond Session 1: Memory, Binding, and the Million Records Problem
Mix.install([
{:kino, "~> 0.12"},
{:kino_vega_lite, "~> 0.1"}
])
Introduction
In Session 1, you learned that Elixir data is immutable - when you “change” a variable, you’re actually creating a new value. But what does that really mean? Where does the old data go? And why should you care?
This interactive notebook will make these concepts concrete by showing you exactly what happens in memory.
Part 1: Variable Binding is Not Assignment
In languages like Java or Python, = is assignment - it puts a value into a memory location:
// Java: x is a box, we're putting 5 in the box
int x = 5;
x = 10; // Same box, new value - the 5 is gone forever
In Elixir, = is binding (technically pattern matching) - it attaches a name to a value:
# Elixir: we're giving the value 5 the name "x"
x = 5
# Now "x" points to 10, but 5 still exists (until garbage collected)
x = 10
Assignment vs Binding: Side by Side
graph TB
subgraph java["☕ Java: Assignment (mutation)"]
direction TB
j1["int x = 5"] --> jbox1["📦 5"]
jbox1 --> |"x = 10"| jbox2["📦 10"]
jnote["Same box,
value replaced.
5 is gone forever."]
end
subgraph elixir["💧 Elixir: Binding (immutable)"]
direction TB
e1["x = 5"] --> eptr1["x → 5"]
eptr1 --> |"x = 10"| estate["x → 10
5 (orphaned, awaits GC)"]
enote["x points to new value.
Old value still exists
until garbage collected."]
end
Let’s visualize the rebinding process step by step:
How Binding Works (Visual)
graph LR
subgraph "Step 1: x = [1, 2, 3]"
x1[x] --> list1["[1, 2, 3]"]
end
graph LR
subgraph "Step 2: x = [4, 5, 6] (rebinding)"
x2[x] --> list2["[4, 5, 6]"]
orphan1["[1, 2, 3]
⚠️ orphaned"]
style orphan1 fill:#ffcccc,stroke:#cc0000
end
graph LR
subgraph "Step 3: After Garbage Collection"
x3[x] --> list3["[4, 5, 6]"]
gone["🗑️ [1, 2, 3] removed"]
style gone fill:#cccccc,stroke:#999999,stroke-dasharray: 5 5
end
The key difference from OOP:
-
In Java/Python:
x = [4,5,6]would overwrite the memory location, destroying[1,2,3]immediately -
In Elixir:
x = [4,5,6]makesxpoint to a new list, while[1,2,3]still exists until GC cleans it up
This is why immutability enables safe concurrency - another process could still be using [1,2,3] while we’ve moved on to [4,5,6]!
Let’s prove this with actual memory measurements.
Measuring Process Memory
Every Elixir process has its own memory heap. We can inspect it:
defmodule MemoryHelper do
@doc """
Returns the current process memory in kilobytes, with optional label for display.
"""
def measure(label \\ "Current") do
# Force garbage collection first for accurate "in use" measurement
# Comment this out to see memory BEFORE garbage collection
# :erlang.garbage_collect()
{:memory, bytes} = Process.info(self(), :memory)
kb = Float.round(bytes / 1024, 2)
IO.puts("#{label}: #{kb} KB")
kb
end
@doc """
Returns memory without printing - useful for collecting data points.
"""
def memory_kb do
{:memory, bytes} = Process.info(self(), :memory)
Float.round(bytes / 1024, 2)
end
end
The Rebinding Experiment
Watch what happens when we rebind a variable to increasingly large lists:
# Baseline memory
MemoryHelper.measure("Baseline")
# Create a list with 100,000 integers
x = Enum.to_list(1..100_000)
MemoryHelper.measure("After creating x (100K integers)")
# "Rebind" x to a NEW list - the old list still exists!
x = Enum.to_list(1..100_000)
MemoryHelper.measure("After rebinding x (now 2 lists exist!)")
# Rebind again - now 3 lists exist in memory
x = Enum.to_list(1..100_000)
MemoryHelper.measure("After rebinding again (3 lists!)")
# The variable x only points to the LAST list
# But all three lists are still in memory, waiting for garbage collection
:ok
Key Insight: Each rebinding created a NEW list. The old lists weren’t deleted - they’re just “orphaned” (no variable points to them anymore).
Part 2: Garbage Collection - The Cleanup Crew
The BEAM VM has a per-process garbage collector. This is one of Elixir’s superpowers:
- Each process has its own heap and GC
- When one process collects garbage, other processes aren’t affected
- No “stop the world” pauses that affect your entire application
graph TB
subgraph "Traditional GC (Java, Go, etc.)"
direction TB
t1["All threads STOP"] --> t2["GC runs globally"] --> t3["All threads resume"]
tnote["❌ 'Stop the world' pauses
affect entire application"]
end
subgraph "BEAM GC (Elixir/Erlang)"
direction TB
b1["Process A"] --> b1gc["GC A"]
b2["Process B"] --> b2run["Still running!"]
b3["Process C"] --> b3run["Still running!"]
bnote["✅ Only affected process pauses
Others continue working"]
end
Demonstrating Garbage Collection
Important: GC behavior can be tricky to demonstrate because:
- The BEAM may trigger GC automatically during allocation
- The heap doesn’t always shrink immediately after GC (it keeps space for efficiency)
This demo shows the clearest case: create one large value, orphan it, then collect.
# Start clean
:erlang.garbage_collect()
baseline = MemoryHelper.memory_kb()
IO.puts("Step 1 - Baseline: #{baseline} KB")
# Create a large list (500K integers ≈ 4-8 MB)
big_list = Enum.to_list(1..100_000)
after_create = MemoryHelper.memory_kb()
increase = Float.round(after_create - baseline, 2)
IO.puts("Step 2 - After creating 500K list: #{after_create} KB (+#{increase} KB)")
# Orphan the list by rebinding to something tiny
big_list = :done
IO.puts("Step 3 - Rebound big_list to :done")
IO.puts(" The 500K integer list is now orphaned (no variable references it)")
# Memory before GC - the orphan is still there!
before_gc = MemoryHelper.memory_kb()
IO.puts("Step 4 - Before GC: #{before_gc} KB (orphan still in memory)")
# Now trigger garbage collection
:erlang.garbage_collect()
after_gc = MemoryHelper.memory_kb()
reclaimed = Float.round(before_gc - after_gc, 2)
IO.puts("Step 5 - After GC: #{after_gc} KB")
IO.puts(" Reclaimed: #{reclaimed} KB")
if reclaimed > 1000 do
IO.puts("\n✅ GC successfully cleaned up the orphaned list!")
else
IO.puts("\n⚠️ GC ran but heap didn't shrink much.")
IO.puts(" This can happen if the BEAM keeps heap space allocated for efficiency.")
IO.puts(" The important thing: the DATA is gone, even if heap size stays.")
end
:ok
Visualizing the GC Effect
Let’s create a visual representation of memory over time:
alias VegaLite, as: Vl
# Collect memory data points
data_points =
Enum.reduce(1..10, [{0, MemoryHelper.memory_kb(), "baseline"}], fn i, acc ->
# Create garbage
_garbage = Enum.to_list(1..50_000)
before_gc = MemoryHelper.memory_kb()
# Collect garbage
:erlang.garbage_collect()
after_gc = MemoryHelper.memory_kb()
acc ++
[
{i * 2 - 1, before_gc, "before GC"},
{i * 2, after_gc, "after GC"}
]
end)
# Format for chart
chart_data =
Enum.map(data_points, fn {step, memory, label} ->
%{step: step, memory: memory, phase: label}
end)
Vl.new(width: 600, height: 300, title: "Memory Usage: Before vs After Garbage Collection")
|> Vl.data_from_values(chart_data)
|> Vl.mark(:line, point: true)
|> Vl.encode_field(:x, "step", type: :quantitative, title: "Step")
|> Vl.encode_field(:y, "memory", type: :quantitative, title: "Memory (KB)")
|> Vl.encode_field(:color, "phase", type: :nominal)
Notice the sawtooth pattern: memory grows as we create data, then drops when GC runs.
Part 3: The Million Records Problem
Now let’s see why this matters in the real world. Imagine you’re building an API that needs to return user data. A naive implementation might load everything into memory:
defmodule FakeDatabase do
@doc """
Simulates loading records from a database.
Each "record" is a map with user data (~500 bytes each).
"""
def load_all_users(count) do
Enum.map(1..count, fn id ->
%{
id: id,
name: "User #{id}",
email: "user#{id}@example.com",
department: Enum.random(["Engineering", "Sales", "Marketing", "Finance"]),
metadata: %{
created_at: DateTime.utc_now(),
preferences: %{theme: "dark", notifications: true},
tags: ["active", "verified", "premium"]
}
}
end)
end
end
Loading 10,000 Records
MemoryHelper.measure("Before loading")
users = FakeDatabase.load_all_users(10_000)
MemoryHelper.measure("After loading 10K users")
# How big is this data?
IO.puts("Number of users loaded: #{length(users)}")
:ok
Loading 100,000 Records
:erlang.garbage_collect()
MemoryHelper.measure("Baseline (after GC)")
users = FakeDatabase.load_all_users(100_000)
MemoryHelper.measure("After loading 100K users")
IO.puts("That's a lot of memory for just user records!")
:ok
The Danger: Loading 1,000,000 Records
Warning: This cell will use significant memory. Run it to see what happens when you naively load a large dataset:
:erlang.garbage_collect()
baseline = MemoryHelper.memory_kb()
IO.puts("Baseline: #{baseline} KB")
# This simulates: Repo.all(User) on a table with 1M rows
IO.puts("\nLoading 1,000,000 records... (this may take a moment)")
users = FakeDatabase.load_all_users(1_000_000)
final = MemoryHelper.memory_kb()
IO.puts("After loading: #{final} KB")
IO.puts("Memory increase: #{Float.round(final - baseline, 2)} KB")
IO.puts("That's approximately #{Float.round((final - baseline) / 1024, 2)} MB!")
# This single process is now holding all that data
IO.puts("\nImagine 100 users hitting this endpoint simultaneously...")
:ok
Visualizing the Problem
alias VegaLite, as: Vl
# Measure memory at different record counts
record_counts = [1_000, 5_000, 10_000, 25_000, 50_000, 100_000, 250_000, 500_000]
memory_data =
Enum.map(record_counts, fn count ->
:erlang.garbage_collect()
_users = FakeDatabase.load_all_users(count)
memory = MemoryHelper.memory_kb()
# Clean up for next iteration
:erlang.garbage_collect()
%{records: count, memory_mb: Float.round(memory / 1024, 2)}
end)
Vl.new(width: 600, height: 300, title: "Memory Usage vs Record Count")
|> Vl.data_from_values(memory_data)
|> Vl.mark(:bar)
|> Vl.encode_field(:x, "records", type: :ordinal, title: "Number of Records")
|> Vl.encode_field(:y, "memory_mb", type: :quantitative, title: "Memory (MB)")
|> Vl.encode(:color, value: "#e74c3c")
Part 4: Solutions Preview
So what do we do about this? Here’s a preview of techniques you’ll learn in later sessions:
Solution 1: Streaming with Stream (Session 3)
Instead of loading everything into memory, process records one at a time.
The key difference:
-
Enum.mapbuilds the entire list in memory first, then processes it -
Stream.mapprocesses one item at a time, never building the full list
graph LR
subgraph enum["Enum.map (eager) - High Memory"]
direction LR
e1["1..500K"] --> e2["Build FULL list
[item1, item2, ..., item500K]"] --> e3["Then process
each item"]
style e2 fill:#ffcccc,stroke:#cc0000
end
graph LR
subgraph stream["Stream.map (lazy) - Low Memory"]
direction LR
s1["1..500K"] --> s2["Get item 1"] --> s3["Process it"] --> s4["Get item 2"] --> s5["Process it"] --> s6["..."]
style s2 fill:#ccffcc,stroke:#00cc00
style s4 fill:#ccffcc,stroke:#00cc00
end
# To see the difference, we need to HOLD the intermediate result
# not just process and discard it
defmodule StreamDemo do
@doc "Builds entire list in memory, then returns it"
def build_with_enum(count) do
1..count
|> Enum.map(fn i -> %{id: i, data: String.duplicate("x", 100)} end)
end
@doc "Returns a Stream - nothing computed yet!"
def build_with_stream(count) do
1..count
|> Stream.map(fn i -> %{id: i, data: String.duplicate("x", 100)} end)
end
@doc "Processes items one at a time, returns only the count"
def process_with_stream(count) do
1..count
|> Stream.map(fn i -> %{id: i, data: String.duplicate("x", 100)} end)
|> Enum.reduce(0, fn _item, acc -> acc + 1 end)
end
end
# DEMO 1: Enum builds everything in memory
:erlang.garbage_collect()
baseline = MemoryHelper.memory_kb()
IO.puts("=== Enum.map (eager) ===")
IO.puts("Baseline: #{baseline} KB")
# This creates 500K maps in memory AT ONCE
result = StreamDemo.build_with_enum(500_000)
after_enum = MemoryHelper.memory_kb()
IO.puts("After Enum.map: #{after_enum} KB")
IO.puts(
"Memory used: #{Float.round(after_enum - baseline, 2)} KB (#{Float.round((after_enum - baseline) / 1024, 2)} MB)"
)
IO.puts("Holding #{length(result)} items in memory\n")
# Clean up
result = nil
:erlang.garbage_collect()
# DEMO 2: Stream processes one at a time
:erlang.garbage_collect()
baseline = MemoryHelper.memory_kb()
IO.puts("=== Stream (lazy) ===")
IO.puts("Baseline: #{baseline} KB")
# This processes 500K items but only holds ONE at a time
count = StreamDemo.process_with_stream(500_000)
after_stream = MemoryHelper.memory_kb()
IO.puts("After Stream processing: #{after_stream} KB")
IO.puts("Memory used: #{Float.round(after_stream - baseline, 2)} KB")
IO.puts("Processed #{count} items without holding them all in memory!")
The difference should be dramatic: Enum uses ~100+ MB, Stream uses almost nothing!
Solution 2: Pagination
Instead of loading 1M records, load them in pages:
defmodule PaginationDemo do
def load_page(_page, per_page \\ 100) do
# In real code: Repo.all(from u in User, limit: ^per_page, offset: ^((page - 1) * per_page))
FakeDatabase.load_all_users(per_page)
end
end
# Instead of loading 1M records (would use ~500MB):
# We load 100 records at a time (uses ~0.5MB)
:erlang.garbage_collect()
MemoryHelper.measure("Before pagination")
page1 = PaginationDemo.load_page(1, 100)
MemoryHelper.measure("After loading page 1 (100 records)")
IO.puts("Got #{length(page1)} records with minimal memory!")
:ok
Solution 3: DataLoader for GraphQL (Sessions 10 & 12)
When building GraphQL APIs, you often face the N+1 problem:
query {
users(first: 100) { # 1 query
name
posts { # 100 queries (one per user!)
title
}
}
}
DataLoader solves this by:
- Collecting all the IDs that need to be fetched
- Making a single batched query
- Distributing results to the right places
# Preview of DataLoader pattern (you'll implement this in Session 12):
# Instead of:
# users |> Enum.map(fn user -> Repo.all(from p in Post, where: p.user_id == ^user.id) end)
# (100 separate queries!)
# DataLoader does:
# 1. Collects all user_ids: [1, 2, 3, ..., 100]
# 2. Single query: Repo.all(from p in Post, where: p.user_id in ^user_ids)
# 3. Groups results by user_id and returns them
IO.puts("""
DataLoader Benefits:
- Reduces N+1 queries to 2 queries
- Automatic batching and caching
- Integrates seamlessly with Absinthe GraphQL
- You'll build this in Session 12!
""")
:ok
Key Takeaways
IO.puts("""
╔══════════════════════════════════════════════════════════════════╗
║ KEY TAKEAWAYS ║
╠══════════════════════════════════════════════════════════════════╣
║ ║
║ 1. BINDING VS ASSIGNMENT ║
║ • x = 5 binds the name "x" to value 5 ║
║ • Rebinding creates NEW values, doesn't modify old ones ║
║ ║
║ 2. GARBAGE COLLECTION ║
║ • Each process has its own GC (no global pauses!) ║
║ • Orphaned data is cleaned up automatically ║
║ • You can manually trigger with :erlang.garbage_collect() ║
║ ║
║ 3. MEMORY AWARENESS ║
║ • Loading large datasets into memory is dangerous ║
║ • Use Stream for lazy processing ║
║ • Use pagination for APIs ║
║ • Use DataLoader for GraphQL batching ║
║ ║
║ 4. THE BEAM ADVANTAGE ║
║ • Per-process heaps = isolated failures ║
║ • One slow GC doesn't affect other processes ║
║ • This is why Elixir handles millions of connections ║
║ ║
╚══════════════════════════════════════════════════════════════════╝
""")
What’s Next?
In Session 2: Pattern Matching, you’ll discover that = is even more powerful than simple binding - it’s a pattern matching operator that can destructure complex data structures!
# Sneak peek:
{status, %{name: name, age: age}} = {:ok, %{name: "Alice", age: 30, role: "admin"}}
IO.puts("Status: #{status}, Name: #{name}, Age: #{age}")
# This will fail (pattern doesn't match):
# {:ok, _} = {:error, "something went wrong"}
This “Beyond” section is optional supplementary material. The core exercises in the main README are sufficient for completing the session.