Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Supervisor restart strategies

ch_5.3_restart_strategies.livemd

Supervisor restart strategies

Navigation

Home Supervision strategiesIntroduction to dynamic supervisor

Restart Strategies

In the previous chapter, we learned about supervision strategies that determine whether a supervisor should restart its child processes when one of them crashes. However, it’s important to consider when a process should be considered “crashed.” Processes can gracefully terminate, in which case we might not want to restart them, or they can crash due to errors.

To address this, restart strategies come into play. Unlike supervision strategies that apply to the entire supervision tree, restart strategies can be defined for each individual child process, allowing for more fine-grained control over their behavior.

Restart values can be specified either in the child spec or when creating a GenServer.

In the child spec, the restart value can be set as follows:

children = [
  %{
    id: :stack_1,
    start: {Stack, :start_link, []},
    restart: :temporary  # <================= Here
  }
]

Modifying default child spec

Alternatively, when creating a GenServer, the restart value can be specified using the use GenServer macro:

use GenServer, restart: :transient

As we learned earlier, GenServers provide a default child_spec/1 function that automatically generates the child specification. By passing options directly to use GenServer, we can customize the child_spec/1 function.

To understand the behavior of different restart options, let’s create a simple GenServer and add it to the supervision tree with various restart options.

defmodule CrashDummyServer do
  use GenServer

  def start_link(name) do
    random_state = System.unique_integer([:positive])
    GenServer.start_link(__MODULE__, {random_state, name}, name: name)
  end

  ## Callbacks

  @impl true
  def init({random_value, name}) do
    IO.inspect("#{name} starting up!")
    {:ok, random_value}
  end

  @impl true
  def handle_cast(:stop_gracefully, state) do
    # Returning this value makes the GenServer stop gracefully with :normal reason
    # If reason is neither :normal, :shutdown, nor {:shutdown, term} an error is logged.
    {:stop, :normal, state}
  end

  @impl true
  def handle_cast(:crash, state) do
    process_pid = self() |> inspect()
    raise "BOOM! CrashDummyServer process: #{process_pid} crashed!"
    {:noreply, state}
  end
end
defmodule CrashDummySupervisor do
  use Supervisor

  def start_link() do
    Supervisor.start_link(__MODULE__, :noop, name: __MODULE__)
  end

  @impl true
  def init(_) do
    # Supervision tree, start multiple instances of our genserver with different restart options
    children = [
      child_spec(:permanent_dummy, :permanent),
      child_spec(:temporary_dummy, :temporary),
      child_spec(:transient_dummy, :transient)
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

  defp child_spec(name, restart_strategy) do
    Supervisor.child_spec(
      {CrashDummyServer, name},
      id: name,
      # Specifying the restart strategy
      restart: restart_strategy
    )
  end
end

In the code snippet above, we have created a simple GenServer that crashes when receiving the :boom message and gracefully stops when receiving the :stop_gracefully message.

Within the Supervisor, we start three instances of this GenServer with three different restart strategies:

  • :permanent: This is the default restart strategy, where the child process is always restarted regardless of whether it crashes or is gracefully shut down.
  • :temporary: With this restart strategy, the child process is never restarted, even in the case of abnormal termination such as a crash. Any termination, even if it is abnormal, is considered successful.
  • :transient: The child process is restarted only if it terminates abnormally, meaning it exits with an exit reason other than :normal, :shutdown, or {:shutdown, term}.

Now, let’s test these restart strategies in action.


:permanent restart strategy

{:ok, supervisor_pid} = CrashDummySupervisor.start_link()
Supervisor.which_children(supervisor_pid)
# Test graceful termination of child with `:permanent` restart strategy
# Notice how the GenServer is restarted
GenServer.cast(:permanent_dummy, :stop_gracefully)
# Test abnormal termination of child with `:permanent` restart strategy
# Notice how the GenServer is restarted
GenServer.cast(:permanent_dummy, :crash)

:temporary restart strategy

# Test graceful termination of child with `:temporary` restart strategy
# Notice how the GenServer is NOT restarted
GenServer.cast(:temporary_dummy, :stop_gracefully)
# Notice how temporary_dummy is no longer present in the list of children
Supervisor.which_children(supervisor_pid) |> IO.inspect(label: "Supervisors Children")
# Restart the Supervisor so that all child processes are start again
Supervisor.stop(supervisor_pid)
{:ok, supervisor_pid} = CrashDummySupervisor.start_link()
# Test abnormal termination of child with `:temporary` restart strategy
# Notice how the GenServer is NOT restarted
GenServer.cast(:temporary_dummy, :crash)
# Notice how temporary_dummy is no longer present in the list of children
Supervisor.which_children(supervisor_pid) |> IO.inspect(label: "Supervisors Children")

:transient restart strategy

# Restart the Supervisor so that all child processes are start again
Supervisor.stop(supervisor_pid)
{:ok, supervisor_pid} = CrashDummySupervisor.start_link()
Supervisor.which_children(supervisor_pid) |> IO.inspect(label: "Supervisors Children")

# Test graceful termination of child with `:transient` restart strategy
# Notice how the GenServer is NOT restarted since it was stopped gracefully
GenServer.cast(:transient_dummy, :stop_gracefully)
# Notice how the transient child has a pid "undefined` since its no longer running
Supervisor.which_children(supervisor_pid)
# Restart the Supervisor so that all child processes are start again
Supervisor.stop(supervisor_pid)
{:ok, supervisor_pid} = CrashDummySupervisor.start_link()
# Test abnormal termination of child with `:transient` restart strategy
# Notice how the GenServer is restarted since it was stopped abnormally
GenServer.cast(:transient_dummy, :crash)
# Notice how transient_dummy was restarted and all children are running
Supervisor.which_children(supervisor_pid) |> IO.inspect(label: "Supervisors Children")

The Max Restarts option

So far, we have covered various options in the supervisor’s child specifications, such as :id, :strategy, :name, and :restart. Now, let’s explore the remaining options that the child specification supports.

Two important options are:

  • :max_restarts: This option sets the maximum number of restarts allowed within a specified time frame. By default, it is set to 3.

  • :max_seconds: This option defines the time frame in which the :max_restarts limit applies. The default value is 5 seconds.

These options determine the maximum restart intensity of a supervisor, controlling the number of restarts that can occur within a given time interval. It is part of the supervisor’s built-in mechanism to manage restarts effectively.

If the number of restarts exceeds the :max_restarts limit within the last :max_seconds seconds, the supervisor terminates all its child processes and itself. In this case, the termination reason for the supervisor is :shutdown.

When a supervisor terminates, the next higher-level supervisor takes action. It either restarts the terminated supervisor or terminates itself.

The restart mechanism is designed to prevent a scenario where a process repeatedly crashes for the same reason, only to be restarted again and again.

Shutdown strategy

When defining child specifications for a supervisor, we have the option to include the :shutdown option, which determines how the supervisor shuts down its child processes.

The :shutdown option has three possible values:

  • An integer greater than or equal to 0: This specifies the amount of time in milliseconds that the supervisor will wait for its children to terminate after sending a Process.exit(child, :shutdown) signal. If the child process does not trap exits, it will be terminated immediately upon receiving the :shutdown signal. If the child process traps exits, it has the specified amount of time to terminate. If it fails to terminate within the specified time, the supervisor will forcefully terminate the child process using Process.exit(child, :kill).

  • :brutal_kill: This option causes the child process to be unconditionally and immediately terminated using Process.exit(child, :kill). That is the supervisor will not wait for the child process to terminate gracefully but will immediately kill the process.

  • :infinity: With this option, the supervisor will wait indefinitely for the child process to terminate.

By default, the :shutdown option is set to 5_000 (5 seconds), which means the supervisor will wait for a maximum of 5 seconds for the child process to shut down gracefully. If the child process does not terminate within this time frame, the supervisor will forcefully terminate it using Process.exit(child, :kill).

These options provide flexibility in managing the shutdown behavior of child processes in a supervisor.

Key points

  • During startup, a supervisor processes all child specifications and starts each child in the order they are defined. This is achieved by invoking the function specified under the :start key in the child specification, typically start_link/1.

  • When a supervisor initiates shutdown, it terminates its children in the reverse order of their listing. This termination process involves sending a shutdown exit signal, Process.exit(child_pid, :shutdown), to each child process and waiting for a specified time interval for them to terminate. The default interval is 5000 milliseconds.

  • If a child process is not trapping exits, it will immediately shut down upon receiving the first exit signal. On the other hand, if a child process is trapping exits, it will invoke the terminate callback and must terminate within a reasonable time before the supervisor forcefully terminates it.

  • When an Elixir application exits, the termination propagates down the supervision tree. Supervisors always trap exits for various reasons, so they attempt to stop all their children upon receiving an exit signal. This is achieved by sending an exit signal to each child individually, allowing a timeout period before resorting to a brutal termination with :kill. The duration of this timeout is determined by the shutdown option specified in the child specification.

  • Once all children are stopped, the supervisor itself stops as well, resulting in the orderly shutdown of the supervision tree.

These points highlight the startup and shutdown behavior of supervisors, the termination process for child processes, and the flow of exit signals within the supervision tree.

Navigation

Home Supervision strategiesIntroduction to dynamic supervisor