Would you like to see your link here? Contact us

Notesclub

created by hec & contributors

terms privacy

Working with Elixir Strings, charlists and binaries

string-charlist-binary.livemd

Joe Yates

@joeyates

til

Share to X

Share to Bluesky

More notebooks

Working with Elixir Strings, charlists and binaries

Introduction

This Livebook aims to clarify the relationship between the various types used in Elixir to represent sequences of characters, bytes and other types of numbers.

These types are Strings, charlists, binaries, bitstrings.

There are a few of definitions that are useful when talking about strings:

A “code point” is and integer used by Unicode to indicate a certain character. These integer values range from 0 to many tens of thousands.

An “ASCII character” is code point between 0 and 127.

“Byte” is the basic unit of management of computer memory. It is a group of bits, ususally 8 bits. An 8-bit byte can hold numbers between 0 and 255.

“UTF-8” is a system that was invented to solve the problem of using bytes to represent Unicode cose points that are larger than 255.

Bitstrings and Binaries

These types deal with data as it is stored in memory.

Bitstring

A bitstring is a contiguous series of bits in memory.

The bitstring literal form specifies values and number of bits:

<<6::4>>

There are many options for constructing bitstrings.

If you supply integers that contain more bits than you indicate, the integers are truncated to the requested number of bits:

<<15::3>>

Inspecting Bitstrings

You can split bitstrings into bits with a “for comprehension”:

for <<(x::1 <- <<6::4>>)>>, do: x

Binary

A binary is a bitstring whose length (in bits) is divisible by 8.

So it is a series of bytes.

8 is the default size for the bitstring literal:

<<0, 1, 2>>

If you supply values that require more than 8 bits, they are truncated to fit in a byte:

<<1 + 256>>

If you construct a bitstring with more that 8 bits, you occupy a number of bytes

<<65535::32>>

Strings

Strings are sequences of UTF-8 encoded characters.

They can be created in various ways:

["a", ~s[a], ~S[a]]

Even as a binary

<<97>>

Strings and binaries

Definition: A String is a binary that contains nothing but printable characters.

is_binary("abc")

String.printable?("abc")

is_binary(<<97, 0>>)

String.printable?(<<97, 0>>)

You can also use an Erlang function :unicode.bin_is_7bit/1 to check the same thing.

Of course, as they are binaries, Strings are also bitstrings

is_bitstring("a")

As Strings are binaries, we know that they are held as continguous portions of memory.

They are not Lists

is_list("a")

Representing Strings

inspect/1 presents printable binaries as Strings

IO.puts(inspect([<<97>>, "a"]))

So, that’s how they appear in Livebooks

[<<97>>, "a"]

charlists

charlists are lists of valid code points, i.e. integers.

There are various ways of creating charlists.

[~c"abc", ~c[abc], [97, 98, 99], [?a, ?b, ?c]]

charlists are Elixir Lists, not binaries. (Elixir Lists are linked lists.)

is_list(~c"a")

is_binary(~c"a")

is_bitstring(~c"a")

Non-ASCII Strings

Strings with non-ASCII characters have more bytes than code points.

String.length("à")

byte_size calculates bytes, not characters (code points):

byte_size("à")

You can see the bytes, not the String representation, by converting to a List with :binary.bin_to_list/1

:binary.bin_to_list("à")

Alternatively, you can use inspect/2

inspect("à", binaries: :as_binaries)

As Strings can be created as binaries, we can create Strings with non-ASCII characters by supplying the UTF-8 encoded bytes

<<195, 160>>

Non-ASCII charlists

When using Livebook and IEX, results are the output of inspect/1.

When a charlist contains only ASCII characters, the output is as expected

~c"abc"

But, inspecting a charlist with non ASCII characters results in a list of the codepoints

~c"àç"

Converting

From Strings to charlists

charlists are Lists of integers of any size.

So when we convert a non-ASCII String to a charlist, we get to see its underlying Unicode codepoints

to_charlist("à")

But, ASCII strings will just be printed as normal charlists

To guarantee you actually see the integer values, you need actually convert the single code points to Strings

"abc" |> to_charlist() |> Enum.map(&amp;"#{&amp;1}") |> Enum.join(", ") |> then(&amp;"[#{&amp;1}]")

From charlists to Strings

[
  to_string(~c"àbc"),
  to_string([224, 98, 99]),
  Kernel.to_string([224, 98, 99]),
  List.to_string([224, 98, 99])
]

Interpolation and Concatenation

Interpolation in Strings

Use #{ .. } to interpolate in “” and ~s[] Strings.

answer = "42"
"The answer is #{answer}"

~s[The answer is #{answer}]

~S[] Strings do not interpolate.

~S[The answer is #{answer}]

Concatenation with Strings

<> takes two binary arguments

"a" <> "b"

"abc" <> <<0>>

Concatenation with charlists

Use the List concatenation operator ++

~c"a" ++ ~c"b"

Binary Pattern Matching

By default, single bytes are matched

<> = "ab"
{first, second}

::binary matches any number of bytes:

<> = "Prefix"
{head, tail}

So, ::binary can only used on the last field:

<> = "Prefix"

::binary-size/1 can be used to match a specific number of bytes

<> = "Prefix"
{head, tail}

Strings can be used in biary pattern matching

<<"Pre", rest::binary>> = "Prefix"
rest

As can nested binaries

<<(<<80, 114, 101>>), rest::binary>> = "Prefix"
rest

Ho UTF-8 Works

https://en.wikipedia.org/wiki/UTF-8

defmodule Utf8 do
  @moduledoc """
  This module implements conversion between UTF-8 Bytes and codepoints.
  """

  import Bitwise

  @doc """
  This is the same as `:binary.bin_to_list/1`
  """
  def encode([]), do: []

  def encode(codepoints) do
    [codepoint | rest] = codepoints

    bytes =
      cond do
        codepoint < 0x80 ->
          # One byte
          [codepoint]

        codepoint < 0x0800 ->
          # Two bytes
          [
            0b11000000 ||| codepoint >>> 6,
            0b10000000 ||| (codepoint &amp;&amp;&amp; 0b00111111)
          ]

        codepoint < 0x010000 ->
          # Three bytes
          [
            0b11100000 ||| codepoint >>> 12,
            0b10000000 ||| (codepoint >>> 6 &amp;&amp;&amp; 0b00111111),
            0b10000000 ||| (codepoint &amp;&amp;&amp; 0b00111111)
          ]

        true ->
          # Four bytes
          [
            0b11110000 ||| codepoint >>> 18,
            0b10000000 ||| (codepoint >>> 12 &amp;&amp;&amp; 0b00111111),
            0b10000000 ||| (codepoint >>> 6 &amp;&amp;&amp; 0b00111111),
            0b10000000 ||| (codepoint &amp;&amp;&amp; 0b00111111)
          ]
      end

    bytes ++ encode(rest)
  end

  @doc """
  This is the same as `to_charlist/1`
  """
  def decode([]), do: []

  def decode(bytes) do
    [first | rest] = bytes

    cond do
      (first &amp;&amp;&amp; 0b11110000) == 0b11110000 ->
        # Four bytes
        [second | [third | [fourth | rest]]] = rest

        codepoint =
          ((first &amp;&amp;&amp; 0b00001111) <<< 18) +
            ((second &amp;&amp;&amp; 0b00011111) <<< 12) +
            ((third &amp;&amp;&amp; 0b00111111) <<< 6) +
            (fourth &amp;&amp;&amp; 0b00111111)

        [codepoint] ++ decode(rest)

      (first &amp;&amp;&amp; 0b11100000) == 0b11100000 ->
        # Three bytes
        [second | [third | rest]] = rest

        codepoint =
          ((first &amp;&amp;&amp; 0b00001111) <<< 12) +
            ((second &amp;&amp;&amp; 0b00011111) <<< 6) +
            (third &amp;&amp;&amp; 0b00111111)

        [codepoint] ++ decode(rest)

      (first &amp;&amp;&amp; 0b11000000) == 0b11000000 ->
        # Two bytes
        [second | rest] = rest

        codepoint =
          ((first &amp;&amp;&amp; 0b00011111) <<< 6) +
            (second &amp;&amp;&amp; 0b00111111)

        [codepoint] ++ decode(rest)

      true ->
        # One byte
        [first] ++ decode(rest)
    end
  end
end

We will use the following Unicode code points:

A - U+0041, 65 decimal
à - U+00E0, 224 decimal
€ - U+20AC, 8364 decimal
🌍 - U+1F30D, 127757 decimal

encoded = Utf8.encode([0x41, 0xE0, 0x20AC, 0x1F30D])

We can check that by creating a binary with those values

:binary.list_to_bin(encoded)

decoded = Utf8.decode(encoded)

Which were our original integers

Enum.map(decoded, &amp;Integer.to_string(&amp;1, 16))

Other notebooks:

Stewart
@imakestews

cur

Testing Phoenix

testing_phoenix.livemd

jason kino youtube hidden_cell

2025-6-30
@instancer-kirik

resolvinator

Elixir Pattern Library

elixir_patterns.livemd

jason kino explorer mox telemetry

2024-11-8
Technology Transformation Services
@GSA-TTS

livebooks

Digital Analytics Program

digital-analytics-program.livemd

httpoison jason kino_explorer

2024-1-20
Chris Martin
@trbngr

elixir_cqrs_tools

Using cqrs_tools with Absinthe

absinthe.livemd

absinthe absinthe_relay cqrs_tools ecto etso jason elixir_uuid

2022-8-18
Carlo Gilmar
@carlogilmar

ml_study_group

Machine Learning Chapter 1

chapter1.livemd

axon nx explorer kino

2025-4-15
@DockYard-Academy

curriculum

Pokemon API

pokemon_api.livemd

jason finch kino youtube hidden_cell

2023-3-21
Patrick Smith
@RoyalIcing

Orb

Orb YouTube URL parser

youtube-url-parser.livemd

wasmex mime kino orb

2024-3-19

Back