Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Production Code Upgrades

production_code_upgrades.livemd

Production Code Upgrades

Introduction

This is document and my learning node for Production Code Upgrades In Elixir Series

Part 1: How code Reloading in Elixir

From Hot Code Reloading in Elixir

1.1 The Erlang Code Server

Upgrading Modules

This part shows Erlang’s code server can run multiple versions of a module simultaneously.

defmodule Counter do
  def count(n) do
    :timer.sleep(1000)
    IO.puts("- #{inspect(self())}: #{n}")
    count(n + 0)
  end
end
# Run the counter in a spawn process to not block terminal.
spawn(Counter, :count, [0])

Now, we update the counter module to increment the number by 2 instead of 1. \

  • Reevaluate the changed module.
  • Rerun the above code block to spawn an other process.

We could see two counters:

  • the old one is still running with old value
  • the new one is using new value.

The Erlang Code Server

  • The server can keep two version of a module in memory, one is new (current), another is old.
  • When a module is loaded, it becomes the current version.
  • Exported function from the old version are replaced by the ones from the new version.
  • If a process is already running when a new version of a module is loaded, it will stay on old version.

1.2 Hot Reloading GenServers

Hot Reloading GenServers

defmodule CountServer do
  use GenServer

  def start_link do
    GenServer.start_link(__MODULE__, 0)
  end

  def init(state) do
    Process.send_after(self(), :increment, 1000)
    {:ok, state}
  end

  def handle_info(:increment, n) do
    incremented = n + 0
    IO.puts("- #{inspect(self())}: #{incremented}")

    Process.send_after(self(), :increment, 1000)

    {:noreply, incremented}
  end
end

Start the counter

{:ok, pid} = CountServer.start_link()
Process.exit(pid, :kill)

Observe

  • Change the counter update.
  • Recompile the module.
  • The previous running process started to use new value immediately.
  • We do not need to start a new GenServer.

Why GenServer can start to use new value immediately?

  • The GenServer’s module and its spawned state are run in seperate processes.
  • The state which was kep in the GenServer process, was updated by calling out to the CounterServer module.
  • External function calls, like the GenServer process calling out to the CounterServer module, are always done on the current version of the module.

See Deconstructing Elixir’s GenServers for more details.

1.3 Transforming State

Although the state in the GenSever example got transformed correctly by the reloaded version of the CountServer module, there’s one more scenario to look at.

What happens when the new version of the implementation requires a different state?
We need to update the state when we upgrade the module to the new version.

defmodule CountServer do
  use GenServer

  def start_link do
    GenServer.start_link(__MODULE__, 0)
  end

  def init(state) do
    Process.send_after(self(), :increment, 1000)
    {:ok, state}
  end

  def handle_info(:increment, n) do
    incremented = n + 2
    IO.puts("- #{inspect(self())}: #{incremented}")

    Process.send_after(self(), :increment, 1000)

    {:noreply, incremented}
  end

  # ===> added
  def code_change(_old_vsn, state, _extra) when rem(state, 2) == 1 do
    {:ok, state - 1}
  end

  def code_change(_old_vsn, state, _extra) do
    {:ok, state}
  end
end
{:ok, pid} = CountServer.start_link()
# If we test the module from iex, we may need this commands to help to better show the result.
# In Livebook, this is not needed since we could reevaluate each block.
:sys.suspend(pid)
:sys.change_code(pid, CountServer, nil, [])
:sys.resume(pid)

About Backward Compatibility

  • When we update GenServer, we need to keep a clause to make it accept old messages so as to provide backward compatibility of the previous version to do a clean upgrade.
  • For instance, if we changed the handle_info callback clause without left a previous clause, our GenServer will crash and report “no function clause matching”.

1.4 Code Purge in Elixir

For testing example from: Should You Code Purge in Elixir?

CodePurge.pi()

We could recompile from Elixir code itself in Livebook.

# This use a simple module from `elixir_horizion`. 
# So we are using attached node running from mix.
# Recompile the project in a separate shell
ExecCmd.run("mix compile")
# In our iex shell, we reload the module code:
:code.load_file(CodePurge)

Rerun the function, we could see the pi value changed.

CodePurge.pi()

If we reload this module once more, it has error: ‘{:error, :not_purged}’.

:code.load_file(CodePurge)

To solve the problem, we could use :code.purge and :code.sofe_purge.

  • They are used to handle running old code.
  • :code.purge/1 kills processes running old code.
  • :code.sofe_purge/1 fails if there are any processes running old code.
:code.purge(CodePurge)
:code.load_file(CodePurge)
:code.soft_purge(CodePurge)
:code.load_file(CodePurge)

1.5 How Not to Do a Code Upgrade

# lib/code_purge/pi.ex
defmodule CodePurgeV2.Pi do
  def start_link do
    spawn_link(&server/0)
  end

  def server do
    receive do
      {:get, from} ->
        send(from, {:ok, 3.14})
        CodePurgeV2.Pi.server()
    end
  end

  def get(pid) do
    send(pid, {:get, self()})

    receive do
      {:ok, value} ->
        {:ok, value}
    after
      1000 ->
        :error
    end
  end
end
pid = CodePurgeV2.Pi.start_link()
CodePurgeV2.Pi.get(pid)
# Now, reload the module once (without any actual changes to functions) and 
# try to purge it so that you can do the next 'upgrade':

:code.load_file(CodePurgeV2.Pi)
# This will terminate the process of shell process (the Livebook code block will show "Aborted")
:code.purge(CodePurgeV2.Pi)

This because the server didn’t receive messages and so didn’t transition to the new code after the first upgrade.

Your code to follow OPT behaviour to be safely upgraded. That’s why you should generally avoid spawn or spawn_link because your home-brewed servers or other long-running processes didn’t use OTP.

(see Demystifying processes in Elixir)

1.6 How To Do a Code Upgrade Using GenServer

defmodule CodePurgeV3 do
  use GenServer

  def start_link(value \\ 3.14) do
    GenServer.start_link(__MODULE__, value)
  end

  def init(value) do
    {:ok, value}
  end

  def handle_call(:get, _from, value) do
    {:reply, value, value}
  end

  def get(pid) do
    GenServer.call(pid, :get)
  end
end

Try to upgrade/purge the code of a running process several times.

{:ok, pid} = CodePurgeV3.start_link()
CodePurgeV3.get(pid)
:code.load_file(CodePurgeV3)
:code.purge(CodePurgeV3)

1.7 Keep updating the states of GenServer processes

When update GenServer, we also need to update the state. This is covered in above section: “1.3 Transforming State”.

TODO: The main challenge is to keep updating the states of GenServer processes, so that any new code can work.

Part 2: Using Supervisors to Organize Your Elixir Application

From Using Supervisors to Organize Your Elixir Application.

In the previous chapter of this series, we looked at hot code reloading in Elixir and why we should use GenServer to implement long-running processes.

But to organize a whole application, we need one more building block — supervisors. Let’s take a look at supervisors in detail.

defmodule CounterV4 do
  use GenServer
  require Logger

  @interval 100

  def start_link(start_from, opts \\ []) do
    GenServer.start_link(__MODULE__, start_from, opts)
  end

  def get(pid) do
    GenServer.call(pid, :get)
  end

  def init(start_from) do
    st = %{
      current: start_from,
      timer: :erlang.start_timer(@interval, self(), :tick)
    }

    {:ok, st}
  end

  def handle_call(:get, _from, st) do
    {:reply, st.current, st}
  end

  def handle_info({:timeout, _timer_ref, :tick}, st) do
    new_timer = :erlang.start_timer(@interval, self(), :tick)
    :erlang.cancel_timer(st.timer)

    {:noreply, %{st | current: st.current + 1, timer: new_timer}}
  end
end
children = [
  {CounterV4, 10000}
]

opts = [strategy: :one_for_one, name: MyApp.Supervisor]
Supervisor.start_link(children, opts)

Start CounterV4 with created supervisor.

[{_, pid, _, _}] = Supervisor.which_children(MyApp.Supervisor)
CounterV4.get(pid)

Kill the child which is CounterV4.

Process.exit(pid, :shutdown)

We could see the CounterV4 is restarted by the supervisor.

Supervisor.which_children(MyApp.Supervisor)

Let’s see the supervision tree.

Kino.Process.render_sup_tree(MyApp.Supervisor)

2.1 Adding GenServers to Custom Supervisors

First, let us create callback module for MyApp.Supervisor.

defmodule CounterSup do
  use Supervisor

  def start_link(start_numbers) do
    Supervisor.start_link(__MODULE__, start_numbers, name: __MODULE__)
  end

  @impl true
  def init(start_numbers) do
    children =
      for start_number <- start_numbers do
        # We can't just use `{OurNewApp.Counter, start_number}`
        # because we need different ids for children

        Supervisor.child_spec({CounterV4, start_number}, id: start_number)
      end

    Supervisor.init(children, strategy: :one_for_one)
  end
end

Update how we start MyApp.Supervisor. It doesn’t start CounterV4 directly. Instead, it start CounterSup.

children = [
  {CounterSup, [10000, 20000]}
]

opts = [strategy: :one_for_one, name: MyApp.SupervisorV2]
Supervisor.start_link(children, opts)
Supervisor.which_children(MyApp.SupervisorV2)
Supervisor.which_children(CounterSup)

Let’s visualize the supervision tree.

Kino.Process.render_sup_tree(MyApp.SupervisorV2)

Now, let’s add the 3rd child and visualize the supervision tree.

new_children_spec = Supervisor.child_spec({CounterV4, 30000}, id: 30000)
Supervisor.start_child(CounterSup, new_children_spec)
Supervisor.which_children(CounterSup)
Kino.Process.render_sup_tree(MyApp.SupervisorV2)

We could also remove a child from supervision tree.

How do I programatically to get pid from string? (see: How to convert a pid string from logs into pid values?)

problem_child = IEx.Helpers.pid("0.859.0")
Process.alive?(problem_child)

However, this is not the child_id. The child_id is the one we specified using Supervisor.child_spec.

# We have to first terminate a running child then delete it
Supervisor.terminate_child(CounterSup, 10000)
Supervisor.delete_child(CounterSup, 10000)
# Visualize again
Kino.Process.render_sup_tree(MyApp.SupervisorV2)

Let’s add another supervision tree under MyApp.SupervisorV2

children_specs = for n <- [10000, 20000, 30000], do: Supervisor.child_spec({CounterV4, n}, id: n)
second_sup_spec = %{
  id: CraftSub,
  start: {Supervisor, :start_link, [children_specs, [strategy: :one_for_one]]},
  type: :supervisor,
  restart: :permanent,
  shutdown: 5000
}
Supervisor.start_child(MyApp.SupervisorV2, second_sup_spec)
Supervisor.which_children(MyApp.SupervisorV2)
Kino.Process.render_sup_tree(MyApp.SupervisorV2)

2.2 Examples of Custom Supervisor Usage

Stop all Counter instances by stop their supervisor

Consider a situation in which we want to stop all stoped Counter instance.

We could do this by stop their supervisor.

defmodule CounterV5 do
  use GenServer
  require Logger

  @interval 100

  def start_link(start_from, opts \\ []) do
    GenServer.start_link(__MODULE__, start_from, opts)
  end

  def get(pid) do
    GenServer.call(pid, :get)
  end

  def init(start_from) do
    # Updated here
    Process.flag(:trap_exit, true)

    st = %{
      current: start_from,
      timer: :erlang.start_timer(@interval, self(), :tick)
    }

    {:ok, st}
  end

  def handle_call(:get, _from, st) do
    {:reply, st.current, st}
  end

  def handle_info({:timeout, _timer_ref, :tick}, st) do
    new_timer = :erlang.start_timer(@interval, self(), :tick)
    :erlang.cancel_timer(st.timer)

    {:noreply, %{st | current: st.current + 1, timer: new_timer}}
  end

  # Updated here
  def terminate(reason, st) do
    Logger.info("terminating with #{inspect(reason)}, counter is #{st.current}")
  end
end
defmodule CounterSupV2 do
  use Supervisor

  def start_link(start_numbers) do
    Supervisor.start_link(__MODULE__, start_numbers, name: __MODULE__)
  end

  @impl true
  def init(start_numbers) do
    children =
      for start_number <- start_numbers do
        # We can't just use `{OurNewApp.Counter, start_number}`
        # because we need different ids for children

        Supervisor.child_spec({CounterV5, start_number}, id: start_number)
      end

    Supervisor.init(children, strategy: :one_for_one)
  end
end
children = [
  {CounterSupV2, [10000, 20000]}
]

opts = [strategy: :one_for_one, name: MyApp.SupervisorV3]
Supervisor.start_link(children, opts)
Kino.Process.render_sup_tree(MyApp.SupervisorV3)
Supervisor.stop(MyApp.SupervisorV3, :normal)

How to stop all Counters gracefully.

The condition of gracefulness is to count up until we reach numbers divisible by 10 (10, 20, 30, etc) before shutdown.

Of course, in our simple example, we may just send ticks to count to the nearest number divisible by 10 in terminate.
Instead, imagine that these events are external end emulate some metrics that we would prefer to aggregate consistently.

To achieve that, we need to do the following modification.

defmodule CounterV6 do
  use GenServer
  require Logger

  @interval 100

  def start_link(start_from) do
    GenServer.start_link(__MODULE__, start_from)
  end

  def get(pid) do
    GenServer.call(pid, :get)
  end

  def stop_gracefully(pid) do
    GenServer.call(pid, :stop_gracefully)
  end

  def init(start_from) do
    Process.flag(:trap_exit, true)

    st = %{
      current: start_from,
      timer: :erlang.start_timer(@interval, self(), :tick),
      terminator: nil
    }

    {:ok, st}
  end

  def handle_call(:get, _from, st) do
    {:reply, st.current, st}
  end

  def handle_call(:stop_gracefully, from, st) do
    if st.terminator do
      {:reply, :already_stopping, st}
    else
      {:noreply, %{st | terminator: from}}
    end
  end

  def handle_info({:timeout, _timer_ref, :tick}, st) do
    :erlang.cancel_timer(st.timer)

    new_current = st.current + 1

    if st.terminator &amp;&amp; rem(new_current, 10) == 0 do
      # we are terminating
      GenServer.reply(st.terminator, :ok)
      {:stop, :normal, %{st | current: new_current, timer: nil}}
    else
      new_timer = :erlang.start_timer(@interval, self(), :tick)
      {:noreply, %{st | current: new_current, timer: new_timer}}
    end
  end

  def terminate(reason, st) do
    Logger.info("terminating with #{inspect(reason)}, counter is #{st.current}")
  end
end

Let’s test if it works for a single process.

{:ok, pid} = CounterV6.start_link(10000)
CounterV6.stop_gracefully(pid)

After confirm the Counter module could stop gracefull, let’s modify supervisor to make it work with our Counter’s new feature.

In elixir, the application has “prep_stop”, what is the similar callback for supervisor?

The prep_stop callback is used in applications to perform graceful termination before the application shuts down. Supervisors, on the other hand, don’t have a built-in callback specifically for this purpose.

When a supervisor is stopped, either due to an error or a clean shutdown, it will stop all its child processes as well. This behavior is based on the linked supervision tree where a supervisor is linked to its child processes, and if a supervisor terminates, its children will also be terminated.

If you need to perform some cleanup or graceful termination logic for a supervised process before it is stopped, you can define your own termination function within the supervised process itself. When the process receives a termination signal (e.g., via Process.exit/2), it can handle the termination and perform any necessary cleanup before exiting.

How to define custom termination function see above usage of terminate/2.

Please note that if you have child supervisors within your main supervisor, they will also follow the same linked supervision model, and their children will be terminated in a similar way. Therefore, you can implement custom termination logic within individual processes or child supervisors as needed.

This means, we have to create reall application for testing.
The supervisor module has no different from previous example except a new name.

defmodule CounterSupV4 do
  use Supervisor

  def start_link(start_numbers) do
    Supervisor.start_link(__MODULE__, start_numbers, name: __MODULE__)
  end

  @impl true
  def init(start_numbers) do
    children =
      for start_number <- start_numbers do
        # We can't just use `{OurNewApp.Counter, start_number}`
        # because we need different ids for children

        Supervisor.child_spec(
          {CounterV6, start_number},
          id: start_number,
          restart: :transient
        )
      end

    Supervisor.init(children, strategy: :one_for_one)
  end
end

However, the below application callback module could not be tested by executing Livebook.
We need to either modify our existing application or create a new one to test.

defmodule App do
  use Application

  @impl true
  def start(_type, _args) do
    # Defines the children we need to start. Here, when App start, we want to start CounterSupV4
    children = [
      {CounterSupV4, [10000, 20000]}
    ]

    opts = [strategy: :one_for_one, name: MyCounterAppSupervisor]
    Supervisor.start_link(children, opts)
  end

  @impl true
  def prep_stop(st) do
    stop_tasks =
      for {_, pid, _, _} <- Supervisor.which_children(CounterSupV4) do
        Task.async(fn ->
          :ok = CounterV6.stop_gracefully(pid)
        end)
      end

    Task.await_many(stop_tasks)
    st
  end
end

Recap steps to add a supervisor to an application.

  • In your application’s mix.exs file define the callback module to use, such as

    def application do
      [
        extra_applications: [:logger],
        mod: {App, []}
      ]
    end
  • In your application callback module, the application callback’s job is to start a supervision tree.

    • Define use Application
    • Define which supervisor you want to start by implementing the start/2.
  • Make sure the supervisors are configed properly. We could start a supervisor in two ways:

    • Use Supervisor.start_link, with children and opts.
    • Call supervisor’s start_link directly.

More details check the application call back.

Part 3: Application Code Upgrades in Elixir

From Application Code Upgrades in Elixir

Must also read through this: Lear you some Erlang – Leveling Up in The Process Quest