Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Working with Elixir Strings, charlists and binaries

string-charlist-binary.livemd

Working with Elixir Strings, charlists and binaries

Introduction

This Livebook aims to clarify the relationship between the various types used in Elixir to represent sequences of characters, bytes and other types of numbers.

These types are Strings, charlists, binaries, bitstrings.

There are a few of definitions that are useful when talking about strings:

A “code point” is and integer used by Unicode to indicate a certain character. These integer values range from 0 to many tens of thousands.

An “ASCII character” is code point between 0 and 127.

“Byte” is the basic unit of management of computer memory. It is a group of bits, ususally 8 bits. An 8-bit byte can hold numbers between 0 and 255.

“UTF-8” is a system that was invented to solve the problem of using bytes to represent Unicode cose points that are larger than 255.

Bitstrings and Binaries

These types deal with data as it is stored in memory.

Bitstring

A bitstring is a contiguous series of bits in memory.

The bitstring literal form specifies values and number of bits:

<<6::4>>

There are many options for constructing bitstrings.

If you supply integers that contain more bits than you indicate, the integers are truncated to the requested number of bits:

<<15::3>>

Inspecting Bitstrings

You can split bitstrings into bits with a “for comprehension”:

for <<(x::1 <- <<6::4>>)>>, do: x

Binary

A binary is a bitstring whose length (in bits) is divisible by 8.

So it is a series of bytes.

8 is the default size for the bitstring literal:

<<0, 1, 2>>

If you supply values that require more than 8 bits, they are truncated to fit in a byte:

<<1 + 256>>

If you construct a bitstring with more that 8 bits, you occupy a number of bytes

<<65535::32>>

Strings

Strings are sequences of UTF-8 encoded characters.

They can be created in various ways:

["a", ~s[a], ~S[a]]

Even as a binary

<<97>>

Strings and binaries

Definition: A String is a binary that contains nothing but printable characters.

is_binary("abc")
String.printable?("abc")
is_binary(<<97, 0>>)
String.printable?(<<97, 0>>)

You can also use an Erlang function :unicode.bin_is_7bit/1 to check the same thing.

Of course, as they are binaries, Strings are also bitstrings

is_bitstring("a")

As Strings are binaries, we know that they are held as continguous portions of memory.

They are not Lists

is_list("a")

Representing Strings

inspect/1 presents printable binaries as Strings

IO.puts(inspect([<<97>>, "a"]))

So, that’s how they appear in Livebooks

[<<97>>, "a"]

charlists

charlists are lists of valid code points, i.e. integers.

There are various ways of creating charlists.

[~c"abc", ~c[abc], [97, 98, 99], [?a, ?b, ?c]]

charlists are Elixir Lists, not binaries. (Elixir Lists are linked lists.)

is_list(~c"a")
is_binary(~c"a")
is_bitstring(~c"a")

Non-ASCII Strings

Strings with non-ASCII characters have more bytes than code points.

String.length("à")

byte_size calculates bytes, not characters (code points):

byte_size("à")

You can see the bytes, not the String representation, by converting to a List with :binary.bin_to_list/1

:binary.bin_to_list("à")

Alternatively, you can use inspect/2

inspect("à", binaries: :as_binaries)

As Strings can be created as binaries, we can create Strings with non-ASCII characters by supplying the UTF-8 encoded bytes

<<195, 160>>

Non-ASCII charlists

When using Livebook and IEX, results are the output of inspect/1.

When a charlist contains only ASCII characters, the output is as expected

~c"abc"

But, inspecting a charlist with non ASCII characters results in a list of the codepoints

~c"àç"

Converting

From Strings to charlists

charlists are Lists of integers of any size.

So when we convert a non-ASCII String to a charlist, we get to see its underlying Unicode codepoints

to_charlist("à")

But, ASCII strings will just be printed as normal charlists

To guarantee you actually see the integer values, you need actually convert the single code points to Strings

"abc" |> to_charlist() |> Enum.map(&amp;"#{&amp;1}") |> Enum.join(", ") |> then(&amp;"[#{&amp;1}]")

From charlists to Strings

[
  to_string(~c"àbc"),
  to_string([224, 98, 99]),
  Kernel.to_string([224, 98, 99]),
  List.to_string([224, 98, 99])
]

Interpolation and Concatenation

Interpolation in Strings

Use #{ .. } to interpolate in “” and ~s[] Strings.

answer = "42"
"The answer is #{answer}"
~s[The answer is #{answer}]

~S[] Strings do not interpolate.

~S[The answer is #{answer}]

Concatenation with Strings

<> takes two binary arguments

"a" <> "b"
"abc" <> <<0>>

Concatenation with charlists

Use the List concatenation operator ++

~c"a" ++ ~c"b"

Binary Pattern Matching

By default, single bytes are matched

<> = "ab"
{first, second}

::binary matches any number of bytes:

<> = "Prefix"
{head, tail}

So, ::binary can only used on the last field:

<> = "Prefix"

::binary-size/1 can be used to match a specific number of bytes

<> = "Prefix"
{head, tail}

Strings can be used in biary pattern matching

<<"Pre", rest::binary>> = "Prefix"
rest

As can nested binaries

<<(<<80, 114, 101>>), rest::binary>> = "Prefix"
rest

Ho UTF-8 Works

https://en.wikipedia.org/wiki/UTF-8

defmodule Utf8 do
  @moduledoc """
  This module implements conversion between UTF-8 Bytes and codepoints.
  """

  import Bitwise

  @doc """
  This is the same as `:binary.bin_to_list/1`
  """
  def encode([]), do: []

  def encode(codepoints) do
    [codepoint | rest] = codepoints

    bytes =
      cond do
        codepoint < 0x80 ->
          # One byte
          [codepoint]

        codepoint < 0x0800 ->
          # Two bytes
          [
            0b11000000 ||| codepoint >>> 6,
            0b10000000 ||| (codepoint &amp;&amp;&amp; 0b00111111)
          ]

        codepoint < 0x010000 ->
          # Three bytes
          [
            0b11100000 ||| codepoint >>> 12,
            0b10000000 ||| (codepoint >>> 6 &amp;&amp;&amp; 0b00111111),
            0b10000000 ||| (codepoint &amp;&amp;&amp; 0b00111111)
          ]

        true ->
          # Four bytes
          [
            0b11110000 ||| codepoint >>> 18,
            0b10000000 ||| (codepoint >>> 12 &amp;&amp;&amp; 0b00111111),
            0b10000000 ||| (codepoint >>> 6 &amp;&amp;&amp; 0b00111111),
            0b10000000 ||| (codepoint &amp;&amp;&amp; 0b00111111)
          ]
      end

    bytes ++ encode(rest)
  end

  @doc """
  This is the same as `to_charlist/1`
  """
  def decode([]), do: []

  def decode(bytes) do
    [first | rest] = bytes

    cond do
      (first &amp;&amp;&amp; 0b11110000) == 0b11110000 ->
        # Four bytes
        [second | [third | [fourth | rest]]] = rest

        codepoint =
          ((first &amp;&amp;&amp; 0b00001111) <<< 18) +
            ((second &amp;&amp;&amp; 0b00011111) <<< 12) +
            ((third &amp;&amp;&amp; 0b00111111) <<< 6) +
            (fourth &amp;&amp;&amp; 0b00111111)

        [codepoint] ++ decode(rest)

      (first &amp;&amp;&amp; 0b11100000) == 0b11100000 ->
        # Three bytes
        [second | [third | rest]] = rest

        codepoint =
          ((first &amp;&amp;&amp; 0b00001111) <<< 12) +
            ((second &amp;&amp;&amp; 0b00011111) <<< 6) +
            (third &amp;&amp;&amp; 0b00111111)

        [codepoint] ++ decode(rest)

      (first &amp;&amp;&amp; 0b11000000) == 0b11000000 ->
        # Two bytes
        [second | rest] = rest

        codepoint =
          ((first &amp;&amp;&amp; 0b00011111) <<< 6) +
            (second &amp;&amp;&amp; 0b00111111)

        [codepoint] ++ decode(rest)

      true ->
        # One byte
        [first] ++ decode(rest)
    end
  end
end

We will use the following Unicode code points:

  • A - U+0041, 65 decimal
  • à - U+00E0, 224 decimal
  • € - U+20AC, 8364 decimal
  • 🌍 - U+1F30D, 127757 decimal
encoded = Utf8.encode([0x41, 0xE0, 0x20AC, 0x1F30D])

We can check that by creating a binary with those values

:binary.list_to_bin(encoded)
decoded = Utf8.decode(encoded)

Which were our original integers

Enum.map(decoded, &amp;Integer.to_string(&amp;1, 16))