Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Building a distributed system

Building_a_distributed_system.livemd

Building a distributed system

Distribution primitives

A distributed system in erlang is different nodes running on the same or different machine that are associated to each other.

starting a clusted can be done using –sname

 iex --sname node1@localhost

This will start a beam node with node1 as it’s name, this name is represented internally as an atome.

This makes it easy to connect two nodes, starting an other node and then connecting it to the first one.

Calling node/0 to print the current node name

node() 
iex --sname node2@localhost
Node.connect(:node1@localhost)

Calling Node.list/0 will list all connected node.

And Node.list([:this, :visible]) to list all available nodes including the curent one.

:this means that you’d want to list the current node and :visible that you want to list all visible nodes (it’s possible to start a node as hidden).

It’s not uncommon for a node to stop working and disconnect, that’s why each node send tick to each other periodically to verify the connection is still healthy. The node is considered disconnected after 4 unanswered ticks. There is no attempt to connect again after that.

Notification can be sent when this happens using Node.monitor/1, and :net_kernel.monitor_nodes to monitor all connections and disconnections.

Knowing this, nodes can cooperate together. One way to do that is by calling Node.spawn/2, this takes a node name and a lambda, it then spawn a new process and start the lambda in that node.

Node.spawn(
  :node2@localhost,
    fn -> IO.puts("Hello from #{node()}") end
  )

Starting a process on an other node happen like it’s on the same node see the following:

caller = self()
Node.spawn(
  :node2@localhost,
  fn -> send(caller, {:response, 1+2}) end
)
flush()

This calls the lambda in a new process inside node2, notice the pid of the caller process is still passed to the new process and a message is recieved back from it. Just like when done locally on one node.

Keep in mind that when sending a message it is encoded using :erlang.term_to_binary/1 and decoded with :erlang.binary_to_term/1.

It’s better to avoid Spawning lambdas on other nodes as this can result in some issues, like when updating an application on one server and the other is not updated yet.

Is it therefore preffered to use Node.spawn/4 which take a MFA (Module, Function, Argument).

One other thing to keep in mind when in a multinode setup is registering a name in a node only makes it available on that node. this means you can regiter the same name in each of the other nodes.

Calling Send(atome, message) will only send it to local node.

Calling Send({atome, other_node_name}, message) will send it to a remote node.

Worth mentionning GenServer.abcast/3 and GenServer.multi_call/4 for sending request to all locally registered processes.

As discussed before, for communication to happen there needs to be a way to discover process pids, previously discussed Registery works well locally but they are not viable in a cluster.

For cluster usage we can use :global module that let us register names globally.

For example we can set node1 to register as responsible for bob’s todo list.

:global.register_name({:todo_list, "bob"}, self())

To find the process pid:

:global.whereis_name({:todo_list, "bob"})

Global registration takes some chatting between connected servers while lookups are performed in a fast cheap way.

A process pid is the form of: , When a process is local, it’s PID will be in the form of <0.Y.Z> where Z is non null when there are many preocesses on the node. If X is none null you can be sure it’s a remote process. X is an internal reference to a node.

Node where the process is running can be determined with Kernel.node/1

Global registration can be used for GenServer as well

GenServer.start_link(
  __MODULE__,
  arg,
  name: {:global, some_global_alias}
)
GenServer.call({:global, some_global_alias}, ...)

If a process fail it is unregistered from all other nodes.

An other concept is Process Groups, this is used to group multiple processes under the same global name. for this :pg (process groups) module is used.

:pg.start_link()

:pg neets to be initialized on all nodes

Node.connect(:node1@localhost)
:pg.join({:todo_list, "bob"}, self())

This adds the process to the group, this needs to be done for all other processed on other nodes that needs to be in the group.

After this calling :pg.get_members/1 to get all registered processes

:pg.get_members({:todo_list, "bob"})

This could be usefull for updating all instance of a process responsible of a certain task, using for example GenServer.multi_call/4

Links and monitors work even on remote nodes. These link and monitor send an exit or down message if

Crash of a linked or monitored process

Crash of a BEAM instance

Network connection loss

Node.spawn can be used to call funtions on other node but these are fire and forget kind of calls.

:rpc.call/4 can be used to call a MFA and fait for the answer.

When calling remote funtions it’s important to keep in mind the concept of locks.

Calling :global.set_lock/1 will block the name for other nodes untill you call :global.del_lock({:some_resource, self()}) again on the same node to release the lock. If an other node ask for the lock it will wait untill that lock is released. This look like a synchronization call on local node and it works almost the same.

:global.trans/2 is a helper function that takes a name and a lambda and will lock the name untill the lambda is finished and release the lock after.

This usually avoided as it can block processes and lead to deadlocks, livelocks, or starvation, however there are cases where this improves performance

Building a fault-tolerant cluster

  • The basic registration facility is a local registration that allows you to use a simple atom as an alias to the single process on a node.
  • Registry extends this by letting you use rich aliases—any term can be used as an alias.
  • :global allows you to register a cluster-wide alias.
  • :pg is useful for registering multiple processes behind a cluster-wide alias (process group), which is usually suitable for distributed pub-sub scenarios.

To connect nodes on different server, they need to agree on a cookie (passphrase), local nodes agree on this because the first instance generate a cookie and store it on disk and the other read from it when starting. on different server however, this cookies needs to be set either with Node.set_cookie/1 or by passing it when starting a new node iex –sname node1@localhost –cookie another_cookie. This cookie can be obtained using Node.get_cookie/0, cookies are rempresented as an atome. They help in the security of the system.

Sometimes you’d want to start node but make them hidden from the cluster and vice versa, this can be done when starting a node with –hidden.

Hidden nodes can be listed with Node.list([:hidden])

Calling Node.list([:connected]) will list all visible and hidden nodes

Calling Node.list([:visible]) will show only visible ones

It’s advised to use visible when performing operation, services like :global, :rpc, and :pg all ignore hidden nodes3

Firewall

When an node try to connect with an other node it first connect with the Erlang Port Mapper Daemon (EPMD) who then gives the port of the node. The EPMD listens on port 4369, this port must be accessible.

Nodes listen to random port when started this is not perfect as it’s not possible to set a firewall.

erlang give us a setting when starting it to configure a range of ports to use for nodes.

iex \
--erl '-kernel inet_dist_listen_min 10000' \
--erl '-kernel inet_dist_listen_max 10100' \
--sname node1@localhost

These can be set to the same number, though it means only one node should be started or there will be a clash.

It’s possible to inspect all node port on a system using :net_adm.names/0

Is summary, port number 4369 and the port setup for nodes need to be accessible.

BEAM instance can access all the system if given the privilege, it’s always a good practice to start BEAM instace in lowest privilege and avoid exposing them to internet.