Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Regular Expressions

docs/elixir-exercism-04.livemd

Regular Expressions

Regular expressions (regex) are a powerful tool for working with strings in Elixir. Regular expressions in Elixir follow the PCRE specification (Perl Compatible Regular Expressions). String patterns representing the regular expression’s meaning are first compiled then used for matching all or part of a string.

In Elixir, the most common way to create regular expressions is using the ~r sigil. Sigils provide syntactic sugar shortcuts for common tasks in Elixir. In this case, ~r is a shortcut for using Regex.compile!/2.

Regex.compile!("test") == ~r/test/
# => true

The =~/2 operator is useful to perform a regex match on a string to return a boolean result.

"this is a test" =~ ~r/test/
# => true

Regex syntax review

  • Some characters in a regular expression pattern have special meaning, to use the character plainly it must be escaped with \, e.g. ~r/\?/.
  • Character classes (e.g. \d, \w) allow patterns to match a range of characters
  • Alternations (|) allow patterns to match one pattern or another
  • Quantifiers ({N, M}, *, ?) allow patterns to match a specified number of repeating patterns
  • Groups (()) allow parts of patterns to function as a unit

Captures

Regular expressions are also useful for extracting a portion of a string. This is called capturing. To capture a part of a string, create a group (()) for the part that you want to capture and use Regex.run.

Regex.run(~r/Weight: (\d*)g/, "Weight: 150g")
# => ["Weight: 150g", "150"]

Captures are numbered (starting at 1) and can also be used in the result when replacing parts of a string with a regular expression:

Regex.replace(~r/Weight: (\d*)g/, "Weight: 150g", "Gewicht: \\1g")
# => "Gewicht: 150g"

Captures can also be named by appending ? after the opening parenthesis. Use Regex.named_captures/3 to get a map with named captures.

Regex.named_captures(~r/Weight: (?\d*)g/, "Weight: 150g")
# => %{"weight" => "150"}

Modifiers

The behavior of a regular expression can be modified by appending special flags at the end of the regular expression, e.g. ~r/test/i.

  • caseless i - case insensitive
    "A" =~ ~r/a/
    # => false
    "A" =~ ~r/a/i
    # => true
  • unicode u - enables Unicode specific patterns like \p and causes character classes like \w etc. to also match on Unicode
    "ö" =~ ~r/^\w$/
    # => false
    "ö" =~ ~r/^\w$/u
    # => true
  • And more: dotall, multiline, extended, firstline, ungreedy

Dynamically building regular expressions

Because the ~r sigil is a shortcut for "pattern" |> Regex.escape() |> Regex.compile!(), you may also use string interpolation to dynamically build a regular expression pattern:

anchor = "$"
regex = ~r/end of the line#{anchor}/
"end of the line?" =~ regex
# => false
"end of the line" =~ regex
# => true

Regular expressions vs the String module

Although regular expressions are powerful, it is not always wise to them:

  • They must be compiled before use, this takes computation time and memory.
  • They may be slower than using plain string functions.

As a rule of thumb, it is better to use the functions from the String module whenever possible.

# Don't use regular expressions to check a suffix:
if "YELLING!" =~ ~r/!$/, do: "Whoa, chill out!"

# Use a string function:
if String.ends_with?("YELLING!", "!"), do: "Whoa, chill out!"

Dates and Time

Elixir’s standard library offers 4 different modules for working with dates and time, each with its own struct.

  • The Date module. A Date struct can be created with the ~D sigil.

      ~D[2021-01-01]
  • The Time module. A Time struct can be created with the ~T sigil.

      ~T[12:00:00]
  • The NaiveDateTime module for datetimes without a timezone. A NaiveDateTime struct can be created with the ~N sigil.

      ~N[2021-01-01 12:00:00]
  • The DateTime module for datetimes with a timezone. Using this module for timezones other than UTC requires an external dependency, a timezone database. A DateTime struct can be represented with the ~U sigil, but should be created using DateTime functions instead.

      DateTime.new!(~D[2021-01-01], ~T[12:00:00], "Etc/UTC")
      # => ~U[2021-01-01 12:00:00Z]

Comparisons

To compare dates or times to one another, look for a compare or diff function in the corresponding module. Comparison operators such as ==, >, and < seem to work, but they don’t do a correct semantic comparison for those structs.

Date.compare(~D[2020-11-30], ~D[2020-12-01])
# => :lt

Time.diff(~T[13:45:00], ~T[13:46:30])
# => -90

Shifting

Dates, time, and datetimes can be shifted forwards and backwards in time using the add/2 function from the corresponding module.

# add 4 days
Date.add(~D[2021-01-01], 4)
# => ~D[2021-01-05]

# subtract 1 second
Time.add(~T[12:00:00], -1)
# => ~T[11:59:59.000000]

# add 4 days and 30 seconds
NaiveDateTime.add(~N[2021-01-01 12:00:00], 4 * 24 * 60 * 60 + 30)
# => ~N[2021-01-05 12:00:30]

Conversions

A NaiveDateTime struct can be deconstructed into a Date struct and a Time struct using NaiveDateTime.to_date/1 and NaiveDateTime.to_time/1. The opposite can be achieved with NaiveDateTime.new!/2.

NaiveDateTime.to_date(~N[2021-01-01 12:00:00])
# => ~D[2021-01-01]

NaiveDateTime.to_time(~N[2021-01-01 12:00:00])
# => ~T[12:00:00]

NaiveDateTime.new!(~D[2021-01-01], ~T[12:00:00])
# => ~N[2021-01-01 12:00:00]

Access Behaviour

Elixir uses Behaviours to provide common generic interfaces while facilitating specific implementations for each module which implements the behaviour. One of those behaviours is the Access Behaviour.

The Access Behaviour provides a common interface for retrieving key-based data from a data structure. It is implemented for maps and keyword lists.

The Access module defines the callbacks required for the interface. The Map and Keyword modules then implement the required callbacks fetch/2, get_and_update/3, and pop/2

To use the behaviour, you may follow a bound variable with square brackets and then use the key to retrieve the value associated with that key. Maps support atom and string keys, while keyword lists only atom keys.

# Atom as key
my_map = %{key: "my value"}
my_map[:key]
# => "my value"

# String as key
another_map = %{"key" => "my value"}
another_map["key"]
# => "my value"

If the key does not exist in the data structure, then nil is returned. This can be a source of unintended behavior, because it does not raise an error.

my_map = %{key: "my value"}
my_map[:wrong_key][:other_wrong_key]
# => nil

Note that nil itself implements the Access Behaviour and always returns nil for any key.

nil[:key]
# => nil

Structs do not implement the Access behaviour.

Access shortcuts

Enum

Enum is a very useful module that provides a set of algorithms for working with enumerables. It offers:

And much more! Refer to the Enum module documentation for a full list.

Enumerable

In general, an enumerable is any data that can be iterated over, a collection. In Elixir, an enumerable is any data type that implements the Enumerable protocol. Those are:

Don’t worry if you don’t know them all yet.

Anyone can implement the Enumerable protocol for their own custom data structure.

Reduce

Enum.reduce/2 allows you to reduce the whole enumerable to a single value. To achieve this, a special variable called the accumulator is used. The accumulator carries the intermediate state of the reduction between iterations. This makes it one of the most powerful functions for enumerables. Many other specialized functions could be replaced by the more general reduce. For example…

Finding the maximum value:

Enum.max([4, 20, 31, 9, 2])
# => 31

Enum.reduce([4, 20, 31, 9, 2], nil, fn x, acc ->
  cond do
    acc == nil -> x
    x > acc -> x
    x <= acc -> acc
  end
end)

# => 31

And even mapping (but it requires reversing the result afterwards):

Enum.map([1, 2, 3, 4, 5], fn x -> x + 10 end)
# => [11, 12, 13, 14, 15]

Enum.reduce([1, 2, 3, 4, 5], [], fn x, acc -> [x + 10 | acc] end)
# => [15, 14, 13, 12, 11]

Mapping maps

  • With Map.new/2:

    %{a: 1, b: 2}
    |> Map.new(fn {key, value} -> {key, value * 10} end)
  • With Enum.into/3:

    %{a: 1, b: 2}
    |> Enum.into(%{}, fn {key, value} -> {key, value * 10} end)
  • With Enum.map/2:

    %{a: 1, b: 2}
    |> Enum.map(fn {key, value} -> {key, value * 10} end)
    |> Enum.into(%{})

File

Functions for working with files are provided by the File module.

To read a whole file, use File.read/1. To write to a file, use File.write/2.

The module also provides file functions for copying, removing, renaming etc. Their names are similar to their UNIX equivalents, for example:

All the mentioned functions from the File module also have a ! variant that raises an error instead of returning an error tuple (e.g. File.read!/1). Use that variant if you don’t intend to handle errors such as missing files or lack of permissions.

File.read("does_not_exist")
# => {:error, :enoent}

File.read!("does_not_exist")
# => ** (File.Error) could not read file "does_not_exist": no such file or directory
#        (elixir 1.10.4) lib/file.ex:353: File.read!/1

Files and processes

Every time a file is written to with File.write/2, a file descriptor is opened and a new Elixir process is spawned. For this reason, writing to a file in a loop using File.write/2 should be avoided.

Instead, a file can be opened using File.open/2. The second argument to File.open/2 is a list of modes, which allows you to specify if you want to open the file for reading or for writing.

Commonly-used modes are:

  • :read - open for reading, file must exist
  • :write - open for writing, file will be created if doesn’t exist, existing content will be overwritten
  • :append - open for writing, file will be created if doesn’t exist, existing content will be preserved

For a comprehensive list of all modes, see the documentation of File.open/2.

File.open/2 returns a PID of a process that handles the file. To read and write to the file, use functions from the IO module and pass this PID as the IO device.

When you’re finished working with the file, close it with File.close/1.

file = File.open!("README.txt", [:write])
# => #PID<0.157.0>

IO.puts(file, "# README")
# => :ok

File.close(file)
# => :ok

Streaming files

Reading a file with File.read/1 is going to load the whole file into memory all at once. This might be a problem when working with really big files. To handle them efficiently, you might use File.open/2 and IO.read/2 to read the file line by line, or you can stream the file with File.stream/3. The stream implements both the Enumerable and Collectable protocols, which means it can be used both for reading and writing.

File.stream!("file.txt")
|> Stream.map(&amp;(&amp;1 <> "!"))
|> Stream.into(File.stream!("new_file.txt"))
|> Stream.run()

Paths

All the functions working on files require a file path. File paths can be absolute or relative to the current directory. For manipulating paths, use functions from the Path module. For cross-platform compatibility, use Path.join/1 to create paths. It will choose the platform-appropriate separator.

Path.expand(Path.join(["~", "documents", "important.txt"]))
"/home/user/documents/important.txt"

Tail Recursion

La partie sur la récursion d’Exercism n’est pas des plus lumineuses. Je vais l’aborder différemment.

Chaque appel de fonction consomme de la mémoire (variables locales, paramètres, etc ..) dans un espace mémoire réservé et de taille fixe appelé la pile (stack). Dans ces conditions, une fonction récursive qui traite un volume de données important peut rapidement consommer toute la mémoire disponible de la pile et planter le programme avec la fameuse erreur “Stack Overflow”.

Pour contourner ce problème, certains langages comme Elixir (et la plupart des langages fonctionnels) supportent un mécanisme appelé récursion terminale (Tail Call Recursion).
Ce mécanisme décrit une situation où l’appel récursif est la dernière opération effectuée par la fonction avant de retourner un résultat. Dans ce cadre, en schématisant, la récursion est remplacée par une boucle, évitant ainsi de consommer de la mémoire sur la pile.

La fonction suivante permet de compter les éléments d’une liste sans utiliser de Tail Call Recursion. La dernière opération est constitué par une expression 1 + count(tail) et pas uniquement par un appel récursif.

defmodule Recursion do
  def count([]), do: 0
  def count([_head | tail]), do: 1 + count(tail)
end

Recursion.count([1, 2, 3])

Pour obtenir une fonction avec récursion terminale, nous devons utiliser un accumulateur. Pour éviter d’exposer l’utilisateur de la fonction à cette complexité supplémentaire, on utilisera généralement une fonction publique sans accumulateur. Cette dernière appelle ensuite une fonction privée qui met en oeuvre la récursion terminale à l’aide d’un accumulateur.
La même fonction que précédemment utilisant cette approche devrait clarifier la démarche:

defmodule TailCallRecursion do
  # fonction publique sans accumulateur
  def count(list), do: do_count(list, 0)

  # fonctions privées mettant en oeuvre le mécanisme de récursion terminale
  defp do_count([], acc), do: acc
  defp do_count([_head | tail], acc), do: do_count(tail, acc + 1)
end

TailCallRecursion.count([1, 2, 3])
defmodule Fact do
  # Non tail recursive version
  def non_tail(0), do: 1

  def non_tail(n) when n > 0 do
    n * non_tail(n - 1)
  end

  # Tail recursive version
  def tail_rec(n), do: do_tail_rec(n, 1)
  defp do_tail_rec(0, acc), do: acc
  defp do_tail_rec(curr, acc), do: do_tail_rec(curr - 1, acc * curr)

  def run_benchmark do
    Benchee.run(%{
      "non_tail_recursive" => fn -> Fact.non_tail(200) end,
      "tail_recursive" => fn -> Fact.tail_rec(200) end
    })
  end
end

Fact.run_benchmark()
# Define the Fibonacci module with all three implementations
defmodule Fibonacci do
  # Non-tail-recursive version
  def non_tail_recursive(0), do: 0
  def non_tail_recursive(1), do: 1

  def non_tail_recursive(n) when n > 1 do
    non_tail_recursive(n - 1) + non_tail_recursive(n - 2)
  end

  # Tail-recursive version
  def tail_recursive(n), do: tail_recursive(n, 0, 1)

  defp tail_recursive(0, a, _), do: a

  defp tail_recursive(n, a, b) when n > 0 do
    tail_recursive(n - 1, b, a + b)
  end

  # Memoized Fibonacci version
  def memoized_fib(n), do: memoized_fib(n, %{0 => 0, 1 => 1})

  defp memoized_fib(0, memo), do: {0, memo}
  defp memoized_fib(1, memo), do: {1, memo}

  defp memoized_fib(n, memo) do
    case Map.fetch(memo, n) do
      {:ok, value} ->
        {value, memo}

      :error ->
        {n_minus_1, memo} = memoized_fib(n - 1, memo)
        {n_minus_2, memo} = memoized_fib(n - 2, memo)
        value = n_minus_1 + n_minus_2
        {value, Map.put(memo, n, value)}
    end
  end

  def run_benchmark do
    Benchee.run(%{
      "fb_non_tail_recursive" => fn -> Fibonacci.non_tail_recursive(30) end,
      "fb_tail_recursive" => fn -> Fibonacci.tail_recursive(30) end,
      "memoized_fib" => fn -> Fibonacci.memoized_fib(30) end
    })
  end
end

Fibonacci.run_benchmark()

Structs

Structs are special maps with a defined set of keys.

  • Structs provide compile-time checks and default values.
  • A struct is named after the module it is defined in.
  • To define a struct use the defstruct construct.
    • The construct usually immediately follows after the module definition.
  • defstruct accepts either a list of atoms (for nil default values) or keyword lists (for specified default values).
    • The fields without defaults must precede the fields with default values.
defmodule Plane do
  defstruct [:engine, wings: 2]
end

plane = %Plane{}
# => %Plane{engine: nil, wings: 2}

Accessing fields and updating

  • Most functions that work with maps will also work with structs.

  • It is recommended to use the static access operator . to access struct fields.

  • Get/fetch field values:

    plane = %Plane{}
    plane.engine
    # => nil
    Map.fetch(plane, :wings)
    # => {:ok, 2}
  • Update field values

    plane = %Plane{}
    %{plane | wings: 4}
    # => %Plane{engine: nil, wings: 4}

Enforcing field value initialization

  • The @enforce_keys module attribute creates a run-time check that specified fields must be initialized to a non-nil value when the struct is created.
  • @enforce_keys is followed by a list of the field keys (which are atoms).
  • If an enforced key is not initialized, an error is raised.
defmodule User do
  @enforce_keys [:username]
  defstruct [:username]
end

%User{}
# => (ArgumentError) the following keys must also be given when building struct User: [:username]

List Comprehensions

Comprehension provide a facility for transforming Enumerables easily and declaratively. They are syntactic sugar for iterating through enumerables in Elixir.

for s <- ["a", "b", "hello", "c"], # 1. generator
  String.length(s) == 1,           # 2. filter
  into: "",                        # 3. collectable
  do: String.upcase(s)

# => "ABC"

There are three parts to a comprehension:

  1. generators:
    • Values enumerated from structures that implement the Enumerable protocol.
    • Pattern matching expressions to destructure enumerated values.
  2. Filters: Boolean conditions, used to select which enumerated values to use when creating the new values.
  3. Collectables: A structure which implements the Collectable protocol, used to collect the new values.

There are single- and multi-line comprehensions. When more than one generator is used, a cartesian product of the values generated is enumerated. That means that each value generated by the first generator will be paired once with each value generated by the second generator.

for n <- [0, 1, 2, 3], do: n + 1
# => [1, 2, 3, 4]

for x <- [0, 1],
    y <- [0, 1] do
  {x, y}
end

# => [{0, 0}, {0, 1}, {1, 0}, {1, 1}]

The value in the do-block is inserted into the collectable for each value generated from the enumerable.

for _ <- [1, 2, 3], do: :a
# => [:a, :a, :a]

Pattern matching can occur in the comprehension, either on the left side of the <- or on their own line.

for {atom, str} <- [a: "string"], do: str
# => ["string"]

for pair <- [a: "string"],
    {atom, str} = pair do
  str
end

# => ["string"]