Advanced Erlang Data Structures and Storage

advanced_erlang_data_structures.livemd

Andrey Marchenko

@anmarchenko

notebooks

More notebooks

Advanced Erlang Data Structures and Storage

Introduction

While Erlang, and by extension Elixir, are functional programming languages, they also offer a number of mutable data structures and storage mechanisms optimized for certain use cases. For example, if you have some data that is read frequently but rarely ever updated, then :persistent_term might be what you are looking for.

With that being said, let’s take a look at some of the mutable data structures and storage mechanisms that we have access to right out of the box with Erlang and OTP.

All the content presented in this Livebook comes from an upcoming publication that we are working on. If you find the content here useful, you should consider checking out our book as there is a ton more to learn!

:digraph and :digraph_utils Modules

In computer science, graphs are abstract data types that allow you to represent certain types of data where the elements in the dataset have relationships with other pieces of data in the dataset. Some common examples of real-world problems described as graphs include:

Workflow engines
Social networks
Build systems
Routing and mapping
Recommendation engines

In each of these problems you have a collection of vertices (also called nodes) connected to one another via edges. For example, in routing and mapping applications, cities and towns would be your vertices, and the roads that connect them would be your edges. Once you construct your graph with all of your vertices and edges, you can traverse it and ascertain all kinds of information (like the shortest path from city A to city B in a mapping application).

In order to get comfortable with the digraph module and the accompanying utility module, let’s construct a directed graph that represents a series of steps that we need to perform when a new user signs up for our SaaS service. The workflow steps are connected in a directed graph like so:

graph TD;
  create_user(Create user in database)
  upload_image(Upload image to S3)
  bill_user(Bill credit card)
  welcome_email(Send welcome email)

  create_user --> upload_image
  create_user --> bill_user

  upload_image --> welcome_email
  bill_user --> welcome_email

In order to build this graph using :digraph, you can do the following:

# Create a new graph
my_workflow = :digraph.new([:acyclic])

# Dummy work function
do_work = fn step ->
  fn ->
    IO.puts("Running the following step: " <> step)

    # Simulate load
    Process.sleep(500)
  end
end

# Create the vertices
:digraph.add_vertex(my_workflow, :create_user, do_work.("Create user in database"))
:digraph.add_vertex(my_workflow, :upload_avatar, do_work.("Upload image to S3"))
:digraph.add_vertex(my_workflow, :charge_card, do_work.("Bill credit card"))
:digraph.add_vertex(my_workflow, :welcome_email, do_work.("Send welcome email"))

# Create the edges
:digraph.add_edge(my_workflow, :create_user, :upload_avatar)
:digraph.add_edge(my_workflow, :create_user, :charge_card)
:digraph.add_edge(my_workflow, :upload_avatar, :welcome_email)
:digraph.add_edge(my_workflow, :charge_card, :welcome_email)

Once the graph is constructed, you can perform a wide array of operations on it:

# Print info about graph
:digraph.info(my_workflow) |> IO.inspect(label: "Graph info")

# Which vertices are starting points into the graph
:digraph.source_vertices(my_workflow) |> IO.inspect(label: "Graph entrypoints")

# Which vertices are terminal points in the graph
:digraph.sink_vertices(my_workflow) |> IO.inspect(label: "Graph exit points")

# Are there any cycles in the graph?
:digraph_utils.is_acyclic(my_workflow) |> IO.inspect(label: "Is acyclical?")

# Traverse the graph in topological order (the graph is run in such a way where
# dependencies are executed prior to when they are needed) and run the anonymous
# function at each vertex
my_workflow
|> :digraph_utils.topsort()
|> Enum.each(fn vertex ->
  {_vertex, work_function} = :digraph.vertex(my_workflow, vertex)
  work_function.()
end)

:digraph.delete(my_workflow)

Before moving on to counters and atomics it is important to note a few things about the :digraph module:

Under the hood, Erlang uses ETS to store the graph data. That data will only be cleared if the process that created the graph terminates (as ETS tables are tied to processes), or if :digraph.delete/1 is called.
Since :digraph leverages ETS, it is in turn a mutable data structure. You’ll notice that in the previous code snippets the my_workflow variable is never updated, but rather passed to each :digraph and :digraph_utils function as a reference to the graph.

:ets Module

Erlang Term Storage, or ETS for short, is a highly performant key-value store that can be used for storing large amounts of data. It is able to fulfill these requirements due to the fact that it is a mutable key-value store. Maps as implemented in Erlang, are not true key-value data structures given that they are represented by trees under the hood in order to support immutability.

Let’s start by experimenting with ETS by creating an ETS table that does not support duplicate keys and running some queries against the table:

# Create a new ETS table
unique_ets_table = :ets.new(:my_table, [:set])

# Add data to the ETS table with the format `{ID, USER_DATA_MAP}`
user_1 = {1, %{first_name: "Alex", last_name: "Koutmos", favorite_lang: :elixir}}
user_2 = {2, %{first_name: "Hugo", last_name: "Barauna", favorite_lang: :elixir}}
user_3 = {3, %{first_name: "Joe", last_name: "Smith", favorite_lang: :go}}

:ets.insert(unique_ets_table, user_1)
:ets.insert(unique_ets_table, user_2)
:ets.insert(unique_ets_table, user_3)

# Some simple ETS query functions
:ets.lookup(unique_ets_table, 2) |> IO.inspect(label: "Get by ID")
:ets.member(unique_ets_table, 100) |> IO.inspect(label: "Is member?")

# Perform a MatchSpec query against the ETS table to retrieve only
# the entries that match the MatchSpec
unique_ets_table
|> :ets.select([
  {
    {:"$1", %{first_name: :"$2", last_name: :"$3", favorite_lang: :elixir}},
    [],
    [%{id: :"$1", first_name: :"$2", last_name: :"$3"}]
  }
])
|> IO.inspect(label: "Only users with Elixir as their favorite language")

# Count the results from the MatchSpec select
unique_ets_table
|> :ets.select_count([
  {
    {:"$1", %{favorite_lang: :"$2"}},
    [{:==, :"$2", :elixir}],
    [true]
  }
])
|> IO.inspect(label: "Number of users with Elixir as their favorite language")

# Delete the ETS table
:ets.delete(unique_ets_table)

There is a bit to unpack here so let’s take it step by step. Firstly, entries in ETS table have the format {KEY, VALUE} and as you can see from the :ets.lookup/2 and :ets.member/2 calls you can fetch values by their KEY. The more complicated parts are the calls to :ets.select/2 and :ets.select_count/2. Let’s explain those in more detail.

The second argument to both of the select functions is called a match specification (check out the Erlang docs for more information). A match specification is a way for ETS to compare the data in a table with the match specification provided in order to only return the data that matches. Let’s deep dive into the match specification provided to :ets.select/2 in order to better understand what is going on:

unique_ets_table
|> :ets.select([
  {
    # The first element in the match specification tuple defines the match
    # for the data in the ETS table. If you remember from the insert statements,
    # the first element in the tuple was the ID and the second element was the
    # data associated with that ID. The match below mirrors that structure,
    # and uses `:$X` variable to denote parts of the ETS entry that are pattern
    # matched without any filtering (and can be used and referenced in later
    # portions of the match specification). We are able to filter for users who have
    # Elixir as their favorite language given that it is hard coded in the
    # match specification.
    {:"$1", %{first_name: :"$2", last_name: :"$3", favorite_lang: :elixir}},
    # The second argument is used to define any guards that should be used during the
    # match specification. For example, if the ID (or the matched :"$1" variable in this
    # case) needs to be less than 10, you would have an entry like so in the guard list:
    # `{:"<", :"$1", 10}`. Take a look at the `:ets.select_count/2` call in the previous
    # example as that make use of the guard clause to check for the favorite programming
    # language to be Elixir.
    [],
    # The third argument defines the output from the select call. In this case since
    # we matched on the id ($1), first name ($2), and last name ($3), our output
    # looks like so:
    # [{1, "Alex", "Koutmos"}, {2, "Hugo", "Barauna"}]
    [{{:"$1", :"$2", :"$3"}}]
  }
])

Now that we have an understanding of how to use ETS as a key-value store with unique keys, let’s take a look at using it as a key-value store where multiple keys can exist in the same table:

# Create a new table
user_actions_table = :ets.new(:my_table, [:bag])

# User 1 actions
:ets.insert(
  user_actions_table,
  {"user-1-id", %{action: "create_account", timestamp: NaiveDateTime.utc_now()}}
)

:ets.insert(
  user_actions_table,
  {"user-1-id", %{action: "update_account", timestamp: NaiveDateTime.utc_now()}}
)

:ets.insert(
  user_actions_table,
  {"user-1-id", %{action: "update_account", timestamp: NaiveDateTime.utc_now()}}
)

:ets.insert(
  user_actions_table,
  {"user-1-id", %{action: "log_out", timestamp: NaiveDateTime.utc_now()}}
)

# User 2 actions
:ets.insert(
  user_actions_table,
  {"user-2-id", %{action: "log_out", timestamp: NaiveDateTime.utc_now()}}
)

# Get all the entries for user 1
user_actions_table
|> :ets.lookup("user-1-id")
|> IO.inspect(label: "User 1 actions")

:ets.delete(user_actions_table)

As you can see from this previous example where we have an ETS table that keeps track of user actions, by setting the type of ETS table to :bag, we are able to have multiple values for the same key. When we use the :ets.lookup/2 function we retrieve all of the entries in the table for that particular user.

:atomics and :counters Modules

Similarly to the :digraph module, the :atomics and :counters modules are not immutable data structures. In fact, they are not necessarily data structures so much as they are purpose-built data storage mechanisms. Specifically, they are intended for when you have a use case that needs to be highly optimized. With that disclaimer out of the way, let’s see what these modules do!

The :atomics and :counters modules are purpose-built, hardware-accelerated modules used for adding and subtracting from arrays of numbers in the most performant way possible. As the name implies, the :atomics module allows you to perform these operations in an atomic fashion such that there will be no data inconsistencies even if there are a large number of changes happening concurrently. The :counters module on the other hand allows you to make the trade-off as to whether you want additional performance at the expense of read inconsistencies.

Let’s play around with :atomics module first and see how it works in a high throughput environment:

existing_user_index = 1
new_user_index = 2

# Create a new atomic with values being unsigned.
# Index 1 will count new user interactions
# while index 2 in the atomics array will count
# existing user interactions in this example
my_atomic = :atomics.new(2, signed: false)

1..100_000
|> Task.async_stream(
  fn _ ->
    user_type = Enum.random([:existing_user, :new_user])

    case user_type do
      :existing_user -> :atomics.add(my_atomic, existing_user_index, 1)
      :new_user -> :atomics.add(my_atomic, new_user_index, 1)
    end
  end,
  max_concurrency: 500
)
|> Stream.run()

num_existing_users = :atomics.get(my_atomic, existing_user_index)
num_new_users = :atomics.get(my_atomic, new_user_index)

IO.inspect(num_existing_users, label: "Number of existing user requests")
IO.inspect(num_new_users, label: "Number of new user requests")
IO.inspect(num_existing_users + num_new_users, label: "Total number of requests")

In the above example, you created a new :atomics array of size two. The first position signifies the counts for existing user requests and the second position signifies the counts for new user requests. By leveraging Task.async_stream/2, you spawn a maximum of 500 processes in order to simulate concurrent load on the atomic counter. As you can see, in the end, the total count is equal to our expected number of requests.

Let’s try the same thing, but with the :counters module:

existing_user_index = 1
new_user_index = 2

# Create a new atomic with values being unsigned.
# Index 1 will count new user interactions
# while index 2 in the atomics array will count
# existing user interactions in this example
my_counter = :counters.new(2, [:write_concurrency])

1..100_000
|> Task.async_stream(
  fn _ ->
    user_type = Enum.random([:existing_user, :new_user])

    case user_type do
      :existing_user -> :counters.add(my_counter, existing_user_index, 1)
      :new_user -> :counters.add(my_counter, new_user_index, 1)
    end
  end,
  max_concurrency: 500
)
|> Stream.run()

num_existing_users = :counters.get(my_counter, existing_user_index)
num_new_users = :counters.get(my_counter, new_user_index)

IO.inspect(num_existing_users, label: "Number of existing user requests")
IO.inspect(num_new_users, label: "Number of new user requests")
IO.inspect(num_existing_users + num_new_users, label: "Total number of requests")

As you can see, we are able to yield the same results with the :counters module as we did with the :atomics module. While these modules are not something that you should reach for on a day-to-day basis (given they are not immutable data structures), it is good to know they exist and are there if you need the extra performance.

:persistent_term Module

Similarly to the :digraph, :atomics, and :counters modules, the :persistent_term module is another mutable data storage option that is purpose-built for read performance. This is an excellent option for when you have data that rarely changes but that is read often. As a warning, if the data in :persistent_term is changed, a global garbage collection must be run on all running processes in order to clean up any references to the persistent term entry.

With that little warning out of the way, let’s see how you can use :persistent_term to store your data. You can think of :persistent_term as a globally namespaced key-value store. In other words, if you know the key, you can access the data regardless of what process you are calling from. That’s why it is good to namespace your keys like in the example below:

# :persistent_term is used by Erlang and Elixir for certain
# things as soon as the BEAM starts up
:persistent_term.get() |> IO.inspect(limit: 4)

# Put some data into :persistent_term
:persistent_term.put({:my_app, :my_key}, %{some: "Data", that: "Rarely", ever: "Changes"})

# Get some data out of :persistent_term
{:my_app, :my_key}
|> :persistent_term.get()
|> Map.fetch(:some)

As you can see, by using :persistent_term.get/0, you can get all of the data currently stored in :persistent_term, and by using :persistent_term.put/2 and :persistent_term.get/1 you can write and read from the global data store.

Back