Powered by AppSignal & Oban Pro

Cluster Sampling Demo: Livebook → eXMC via :erpc

notebooks/cluster-sampling-demo.livemd

Cluster Sampling Demo: Livebook → eXMC via :erpc

The Setup

This notebook runs inside the livebook jail (10.17.89.13). The sampling computation runs on the exmc jail (10.17.89.14). They’re connected via distributed Erlang over bastille0 — same cookie, same cluster, different jails.

No HTTP APIs. No message queues. Just :erpc.call/4.

# Verify we're in the cluster
IO.puts("I am: #{Node.self()}")
IO.puts("My peers: #{inspect(Node.list())}")

# Connect to exmc if not already connected
unless :"exmc@10.17.89.14" in Node.list() do
  Node.connect(:"exmc@10.17.89.14")
end

IO.puts("exmc connected: #{:"exmc@10.17.89.14" in Node.list()}")

Define the Model

A simple Bayesian model: estimate the mean of some observed data. We define the model here in Livebook but it’s just data (an IR struct) — portable across nodes.

# Build the model IR — this is just a data structure, no computation yet
model_code = """
alias Exmc.{Builder, Dist}

ir = Builder.new_ir()
ir = Builder.rv(ir, "mu", Dist.Normal, %{mu: Nx.tensor(0.0), sigma: Nx.tensor(10.0)})
ir = Builder.rv(ir, "sigma", Dist.HalfNormal, %{sigma: Nx.tensor(2.0)})
ir = Builder.rv(ir, "y", Dist.Normal, %{mu: "mu", sigma: "sigma"})
observations = Nx.tensor([2.1, 2.5, 1.8, 2.3, 2.7, 1.9, 2.4, 2.2, 2.6, 2.0])
ir = Builder.obs(ir, "y_obs", "y", observations)
ir
"""

IO.puts("Model defined (10 observations, estimating mu and sigma)")
IO.puts("Will sample on: exmc@10.17.89.14")

Run 4 Chains on eXMC (Remote)

Four independent MCMC chains run in parallel on the exmc node. We send the model over via :erpc, exmc runs sample_chains/3, and returns all 4 traces back to Livebook as Nx tensors.

# Ship 4-chain sampling to exmc via :erpc
target = :"exmc@10.17.89.14"

{time_us, chains} = :timer.tc(fn ->
  :erpc.call(target, fn ->
    alias Exmc.{Builder, Dist}

    ir = Builder.new_ir()
    ir = Builder.rv(ir, "mu", Dist.Normal, %{mu: Nx.tensor(0.0), sigma: Nx.tensor(10.0)})
    ir = Builder.rv(ir, "sigma", Dist.HalfNormal, %{sigma: Nx.tensor(2.0)})
    ir = Builder.rv(ir, "y", Dist.Normal, %{mu: "mu", sigma: "sigma"})
    observations = Nx.tensor([2.1, 2.5, 1.8, 2.3, 2.7, 1.9, 2.4, 2.2, 2.6, 2.0])
    ir = Builder.obs(ir, "y_obs", "y", observations)

    # 4 chains, 1000 draws each, 500 warmup
    Exmc.NUTS.Sampler.sample_chains(ir, 4, num_samples: 1000, num_warmup: 500)
  end, 60_000)  # 60s timeout
end)

IO.puts("4 chains completed in #{div(time_us, 1000)}ms on #{target}")
IO.puts("Chains returned: #{length(chains)}")

Inspect the Posterior (4 chains)

Four traces came back from exmc as Nx tensors — serialized over distributed Erlang, deserialized here in Livebook. Zero ceremony.

# Summarize across all 4 chains
for {chain, i} <- Enum.with_index(chains, 1) do
  mu_samples = chain["mu"]
  sigma_samples = chain["sigma"]
  mu_mean = Nx.mean(mu_samples) |> Nx.to_number() |> Float.round(3)
  sigma_mean = Nx.mean(sigma_samples) |> Nx.to_number() |> Float.round(3)
  IO.puts("  Chain #{i}: mu=#{mu_mean}, sigma=#{sigma_mean}")
end

# Pool all chains
all_mu = chains |> Enum.map(&amp; &amp;1["mu"]) |> Nx.concatenate()
all_sigma = chains |> Enum.map(&amp; &amp;1["sigma"]) |> Nx.concatenate()

IO.puts("""

Pooled posterior (#{Nx.size(all_mu)} total samples across 4 chains):
  mu:    #{Nx.mean(all_mu) |> Nx.to_number() |> Float.round(3)} ± #{Nx.standard_deviation(all_mu) |> Nx.to_number() |> Float.round(3)}
  sigma: #{Nx.mean(all_sigma) |> Nx.to_number() |> Float.round(3)}

Computed on: exmc@10.17.89.14
Displayed on: #{Node.self()}
Transport: :erpc over distributed Erlang (bastille0 loopback)
""")

What This Demonstrates

  1. Livebook is the interactive frontend (notebook UI)
  2. eXMC is the compute backend (NUTS sampler)
  3. Connection: distributed Erlang over bastille0, shared cookie from ZFS secrets
  4. Serialization: Nx tensors travel over :erpc transparently
  5. No infrastructure between them: no REST API, no message queue, no service mesh

This is what “off docker-compose” looks like for compute workloads: the cluster IS the API.