Session 12: What is OTP?
Mix.install([])
Introduction
Welcome to Phase 3! In Phase 2, you built a working agent framework with:
-
ProcessAgent- Agents as real processes with message loops -
AgentMonitor- Manual supervision and restart logic -
AgentRegistry- Name-based process lookup
Here’s the exciting revelation: You’ve already been doing OTP!
OTP (Open Telecom Platform) isn’t something new to learn - it’s the formalization of the exact patterns you implemented manually. The Erlang/Elixir community spent 30+ years refining these patterns, and OTP packages them into battle-tested, production-ready behaviours.
Sources for This Session
This session synthesizes concepts from:
- Elixir School - OTP Concurrency
- Learn You Some Erlang - What is OTP?
- Learn You Some Erlang - Clients and Servers
Learning Goals
By the end of this session, you’ll understand:
- What OTP is and why it exists
- How your Phase 2 code maps to OTP behaviours
- The OTP behaviour system (contracts/callbacks)
- The structure of supervision trees
Section 1: The OTP Philosophy
🤔 Reflection: Before We Begin
Before reading further, think about your Phase 2 AgentMonitor:
# Take a moment to answer these questions:
reflection_questions = [
"How many lines of code did AgentMonitor require?",
"What percentage of that code is 'boilerplate' vs 'your specific restart logic'?",
"If you wanted to add a new restart strategy, how many places would you need to change?",
"How confident are you that your monitor handles all edge cases correctly?"
]
# Jot down your thoughts:
your_reflections = """
# Lines of code in AgentMonitor: ???
# Boilerplate percentage: ???
# Places to change for new strategy: ???
# Edge case confidence (1-10): ???
"""
From Telecoms to Modern Web
OTP was born at Ericsson in the 1980s-90s for building telephone switches that needed to run 24/7 with minimal downtime. The name “Open Telecom Platform” is now somewhat misleading - as the Learn You Some Erlang book notes, it’s “not that much about telecom anymore” but rather applies principles developed for telecom-grade reliability to general software engineering.
The original requirements were:
- High availability - Systems must stay up
- Fault tolerance - Failures must be isolated and recovered
- Hot code upgrades - Update code without stopping the system
- Concurrent connections - Handle millions of simultaneous calls
These requirements led to a philosophy:
> “Let it crash” + Supervision = Reliability
The Core Insight: Generic vs. Specific
Here’s the central insight from Learn You Some Erlang that makes OTP so powerful:
> Every concurrent process follows predictable patterns—spawning, initialization, > looping, and termination. By extracting these generic components into reusable > libraries, developers focus exclusively on application-specific logic.
Think about it: In your ProcessAgent.loop/1, how much code is:
- Generic (receive loop, state threading, stopping) vs.
-
Specific (handling
:search,:remember, etc.)?
OTP separates these concerns completely.
🤔 Socratic Question
# Consider this scenario:
# You have 10 different GenServer-based services in your application.
# A bug is discovered in how timeouts are handled in receive loops.
# With manual loops (Phase 2 approach):
# - How many files would you need to fix?
# - How would you ensure consistency across all 10?
# With OTP GenServer:
# - How many files would you need to fix?
# - Who maintains that code?
your_answer = """
Manual approach: ???
OTP approach: ???
"""
The answer reveals why OTP matters: When someone optimizes or fixes a bug in the single OTP backend, every process using it benefits automatically.
Section 2: What You Already Know
Let’s map your Phase 2 code to OTP concepts:
| Your Phase 2 Code | OTP Equivalent | What It Does |
|---|---|---|
ProcessAgent.loop/1 |
GenServer |
Process with state + message handling |
send(pid, msg); receive do... |
GenServer.call/cast |
Synchronous/async messaging |
AgentMonitor.loop/1 |
Supervisor |
Monitors children, restarts on crash |
Process.monitor + restart logic |
Supervisor child specs | Restart policies |
AgentRegistry |
Registry |
Already OTP! |
Your ProcessAgent Loop vs GenServer
Here’s your Phase 2 code:
# Phase 2: Manual loop
defp loop(state) do
receive do
{:get_state, from} ->
send(from, {:state, state})
loop(state)
{:remember, key, value} ->
new_memory = Map.put(state.memory, key, value)
loop(%{state | memory: new_memory})
:stop ->
:ok
end
end
Here’s the OTP equivalent:
# Phase 3: GenServer
def handle_call(:get_state, _from, state) do
{:reply, state, state}
end
def handle_cast({:remember, key, value}, state) do
{:noreply, %{state | memory: Map.put(state.memory, key, value)}}
end
def terminate(_reason, _state) do
:ok
end
🤔 Spot the Differences
# Look carefully at both versions above. What's MISSING from the GenServer version?
# (Think about what you had to write manually that GenServer handles for you)
missing_from_genserver = [
# 1. ???
# 2. ???
# 3. ???
]
# Reveal your answers then check below:
# 1. The recursive `loop(state)` call - GenServer handles the loop automatically
# 2. The `receive do` block - GenServer dispatches to your callbacks
# 3. Managing the `from` reference for replies - GenServer tracks this for you
GenServer handles:
- The receive loop for you
- Process registration
- Timeout handling
- Debugging/tracing support
- Hot code upgrades
Your AgentMonitor vs Supervisor
Your Phase 2 monitor (~200 lines):
defp loop(state) do
receive do
{:DOWN, ref, :process, pid, reason} ->
case should_restart?(state.restart_policy, reason) do
true ->
{:ok, new_pid, new_ref} = do_start_agent(name, opts)
loop(%{state | agents: Map.put(state.agents, name, new_info)})
false ->
loop(%{state | agents: Map.delete(state.agents, name)})
end
end
end
OTP Supervisor (~10 lines):
def init(_opts) do
children = [
{AgentServer, "Worker-1"},
{AgentServer, "Worker-2"}
]
Supervisor.init(children, strategy: :one_for_one)
end
🤔 Why This Matters
# Think about the implications:
# 1. You wrote ~200 lines for AgentMonitor
# 2. OTP Supervisor does the same in ~10 lines
# 3. OTP has been refined for 30+ years
# Question: Which version likely handles more edge cases correctly?
# Question: If you needed to support a new restart pattern, which would be easier to extend?
# But here's the deeper question:
# WHY did you need to write AgentMonitor in Phase 2?
phase_2_purpose = """
# Your answer: ???
# Hint: What did writing it manually teach you about processes,
# monitoring, and fault tolerance that you wouldn't have learned
# by just using Supervisor from the start?
"""
Section 3: OTP Behaviours Overview
A behaviour in OTP is a contract - a set of callbacks that your module must implement. Think of it like an interface in other languages.
From Learn You Some Erlang:
> The framework “takes care of” repetitive implementation details “by grouping > these essential practices into a set of libraries that have been carefully > engineered and battle-hardened over years.”
The Main Behaviours
┌─────────────────────────────────────────────────────────────┐
│ Application │
│ (Lifecycle management - start/stop your OTP app) │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Supervisor │
│ (Manages child processes, handles restarts) │
└─────────────────────────────────────────────────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│GenServer │ │GenServer │ │ Agent │
│(complex) │ │(worker) │ │(simple) │
└──────────┘ └──────────┘ └──────────┘
GenServer Callbacks (from Learn You Some Erlang)
GenServer requires six callbacks forming a complete lifecycle:
| Callback | Purpose |
|---|---|
init/1 |
Initialize state when process starts |
handle_call/3 |
Synchronous requests (caller waits for response) |
handle_cast/2 |
Asynchronous messages (fire and forget) |
handle_info/2 |
Non-GenServer messages (monitors, timers, etc.) |
terminate/2 |
Cleanup when stopping |
code_change/3 |
Hot code upgrade support |
🤔 Mapping Callbacks to Your Code
# Look at your ProcessAgent and identify which callback each pattern maps to:
callback_mapping = %{
# Your receive pattern -> GenServer callback
# {:get_state, from} -> which callback?
get_state: nil,
# {:remember, key, value} -> which callback?
remember: nil,
# {:DOWN, ref, :process, pid, reason} -> which callback?
monitor_message: nil,
# :stop -> which callback handles this?
stop: nil
}
# Fill in your answers:
# callback_mapping = %{
# get_state: :handle_call, # Needs a response - synchronous
# remember: :handle_cast, # Fire and forget - async
# monitor_message: :handle_info, # System message, not GenServer call
# stop: :terminate # Cleanup callback
# }
Why Synchronous vs Asynchronous?
From Learn You Some Erlang on the difference:
> gen_server:call/2-3 sends synchronous requests and blocks until replies
> arrive (default 5-second timeout). gen_server:cast/2 sends asynchronous
> messages with immediate return.
🤔 When Would You Use Each?
# For each operation in your agent framework, decide: call or cast?
operations = [
{:get_agent_state, "Need to know current state to display it"},
{:remember_fact, "Store something in memory"},
{:send_task, "Queue a task for processing"},
{:process_next, "Process and return the result"},
{:broadcast_shutdown, "Tell all agents to prepare for shutdown"}
]
# Your answers:
# {:get_agent_state, :call} - We need the state back
# {:remember_fact, :cast} - Fire and forget, don't need confirmation
# {:send_task, :cast} - Queuing is async
# {:process_next, :call} - We want the result
# {:broadcast_shutdown, :cast} - Don't wait for acknowledgment
# But wait - what about :remember_fact?
# Is there a case where you WOULD want it to be a call?
remember_as_call_scenario = """
# When might you want remember/1 to be synchronous?
# Hint: Think about ordering guarantees...
"""
Section 4: The Supervision Tree
Why Trees?
From Learn You Some Erlang on supervision trees:
> A supervision tree organizes processes hierarchically where supervisors > oversee workers and other supervisors, but workers should never supervise > anything. This structure ensures every process can be tracked and cleanly > shut down in an orderly fashion from the top down.
Application
│
┌─────┴─────┐
▼ ▼
Supervisor Supervisor
│ │
┌────┼────┐ │
▼ ▼ ▼ ▼
Worker Worker Worker DynamicSupervisor
│
┌────┼────┐
▼ ▼ ▼
Agent Agent Agent
Benefits of Tree Structure
- Fault Isolation: Crashes in one branch don’t affect others
- Restart Granularity: Can restart a subtree without affecting the whole app
- Organized Shutdown: Clean shutdown from leaves to root
- Mental Model: Clear hierarchy of responsibilities
Restart Strategies Preview
| Strategy | Behavior |
|---|---|
:one_for_one |
Only restart the crashed child |
:one_for_all |
Restart ALL children if one crashes |
:rest_for_one |
Restart crashed child + all started after it |
🤔 Choosing Strategies
# For each scenario, which restart strategy makes sense?
scenarios = [
%{
description: "5 independent worker agents, each handling different tasks",
workers: ["ResearchAgent", "WriterAgent", "EditorAgent", "FactCheckerAgent", "PublisherAgent"],
dependencies: "None - each works independently",
strategy: nil # :one_for_one, :one_for_all, or :rest_for_one?
},
%{
description: "A pipeline: Parser -> Validator -> Transformer -> Writer",
workers: ["Parser", "Validator", "Transformer", "Writer"],
dependencies: "Each depends on the previous step's output format",
strategy: nil
},
%{
description: "Database connection pool with shared connection manager",
workers: ["ConnectionManager", "Worker1", "Worker2", "Worker3"],
dependencies: "All workers depend on ConnectionManager being correct",
strategy: nil
}
]
# Think through each one before revealing answers:
# Scenario 1: :one_for_one - Independent workers, no reason to restart others
# Scenario 2: :rest_for_one - If Parser crashes with bad state, downstream may have bad data
# Scenario 3: :one_for_all - If ConnectionManager crashes, all workers have stale connections
Restart Limits
From Learn You Some Erlang:
> Supervisors accept MaxRestart and MaxTime parameters. “If more than
> MaxRestarts happen within MaxTime (in seconds), the supervisor just gives up”
> and terminates itself, allowing its parent supervisor to potentially restart it.
🤔 Why Would a Supervisor “Give Up”?
# Consider: A worker keeps crashing every 100ms.
# The supervisor keeps restarting it.
# This goes on forever...
# What's the problem with this?
infinite_restart_problem = """
# 1. ???
# 2. ???
# 3. ???
"""
# Answers:
# 1. CPU/memory churn - constant spawn/crash cycle wastes resources
# 2. Log flooding - error logs become useless
# 3. Root cause obscured - the real bug isn't being investigated
# 4. Cascading failures - the crashing process might affect others each time
# This is why max_restarts exists - it's a circuit breaker!
Section 5: Interactive Exploration
Let’s explore some OTP concepts hands-on.
Exploring GenServer Behaviour
# What callbacks does GenServer require?
GenServer.behaviour_info(:callbacks)
# What optional callbacks exist?
GenServer.behaviour_info(:optional_callbacks)
Exploring Supervisor Behaviour
# What callbacks does Supervisor need?
Supervisor.behaviour_info(:callbacks)
A Minimal GenServer
defmodule MinimalServer do
use GenServer
# Client API
def start_link(initial) do
GenServer.start_link(__MODULE__, initial)
end
def get(pid) do
GenServer.call(pid, :get)
end
def set(pid, value) do
GenServer.cast(pid, {:set, value})
end
# Server Callbacks
@impl true
def init(initial) do
{:ok, initial}
end
@impl true
def handle_call(:get, _from, state) do
{:reply, state, state}
end
@impl true
def handle_cast({:set, value}, _state) do
{:noreply, value}
end
end
# Try it out!
{:ok, pid} = MinimalServer.start_link(0)
MinimalServer.get(pid)
MinimalServer.set(pid, 42)
MinimalServer.get(pid)
🤔 Understanding Return Tuples
# GenServer callbacks return specific tuples. Match each return to its meaning:
return_tuples = %{
"{:ok, state}" => "From init - process started successfully",
"{:reply, response, new_state}" => "???",
"{:noreply, new_state}" => "???",
"{:stop, reason, new_state}" => "???"
}
# Fill in the meanings, then check:
# "{:reply, response, new_state}" => "From handle_call - send response to caller, update state"
# "{:noreply, new_state}" => "From handle_cast/info - no response needed, update state"
# "{:stop, reason, new_state}" => "Stop the GenServer (triggers terminate callback)"
The @impl Annotation
Notice the @impl true above each callback. This tells the compiler:
- “This function implements a behaviour callback”
- Catches typos in callback names
- Documents which functions are callbacks
# What happens if you typo a callback name?
defmodule BadServer do
use GenServer
@impl true
def init(arg), do: {:ok, arg}
# Uncomment to see the warning:
# @impl true
# def handle_cal(:get, _from, state), do: {:reply, state, state}
end
Section 6: Why OTP Matters for Your Agent Framework
What You’ll Gain
| Without OTP (Phase 2) | With OTP (Phase 3) |
|---|---|
| ~400 lines of supervision logic | ~50 lines with Supervisor |
| Manual message loop | Automatic with GenServer |
| Custom restart logic | Built-in restart strategies |
| No debugging tools | OTP observer, tracing |
| No hot code reload | Supported automatically |
🤔 Final Reflection
# Before moving on, answer these questions:
final_reflection = %{
# 1. What's the key insight behind OTP's design?
key_insight: """
# Your answer: ???
# Hint: Think "generic vs. specific"
""",
# 2. Why did we build Phase 2 manually before learning OTP?
phase_2_purpose: """
# Your answer: ???
""",
# 3. Which OTP behaviour will your ProcessAgent become?
process_agent_becomes: nil, # :gen_server, :supervisor, or :application?
# 4. Which OTP behaviour will your AgentMonitor become?
agent_monitor_becomes: nil
}
# Check your understanding:
# 1. "Separate generic server mechanics from application-specific logic"
# 2. "To understand what OTP does for us, we needed to do it ourselves first"
# 3. :gen_server (it has state and handles messages)
# 4. :supervisor (or DynamicSupervisor - it manages child processes)
Exercises
Exercise 1: Identify the Pattern
For each of these scenarios, identify which OTP behaviour you would use:
- A cache that stores key-value pairs and responds to get/set requests
- A pool of worker processes that should be restarted if they crash
- A simple counter that just needs to track a number
- The entry point for starting your entire application
exercise_1_answers = %{
cache: nil, # :gen_server, :agent, or :supervisor?
worker_pool: nil, # :gen_server, :agent, or :supervisor?
counter: nil, # :gen_server or :agent?
app_entry: nil # :application or :supervisor?
}
Exercise 2: Map Your Code
Look at your Phase 2 ProcessAgent module. List 3 things that GenServer
will handle automatically that you had to code manually:
genserver_provides = [
# 1. ?
# 2. ?
# 3. ?
]
Exercise 3: Supervision Tree Design
Design a supervision tree for an agent framework with:
- 3 permanent worker agents
- A dynamic pool of temporary agents
- A shared state store
Draw it as ASCII art:
supervision_tree = """
?
│
┌───────┼───────┐
? ? ?
"""
Key Takeaways
- OTP is patterns, not magic - You already implemented OTP concepts in Phase 2
- Generic vs. Specific - OTP handles the generic, you focus on your logic
- Behaviours are contracts - They define callbacks your module must implement
- GenServer = your loop - Handles the receive loop, state, and more
- Supervisor = your monitor - Handles process monitoring and restarts
- Trees isolate faults - Hierarchical supervision contains failures
- 30 years of refinement - OTP is battle-tested at massive scale
What’s Next?
In the next session, we’ll dive deep into GenServer:
- The full callback contract
- Converting ProcessAgent to AgentServer
- Synchronous (call) vs asynchronous (cast) patterns
- Handling system messages with handle_info
You’ll transform your manual process loop into a proper OTP GenServer!