Ch 9: RNNs

Ch9 - RNN.livemd

Malcolm Cumming

@malcolmsgc

ml-elixir

Share to X

Share to Bluesky

More notebooks

Ch 9: RNNs

Mix.install([
  {:scidata, "~> 0.1"},
  {:axon, "~> 0.5"},
  {:exla, "~> 0.6"},
  {:nx, "~> 0.6"},
  {:table_rex, "~> 3.1.1"},
  {:kino, "~> 0.7"}
])

Nx.default_backend(EXLA.Backend)

Get data

data = Scidata.IMDBReviews.download()

{train_data, test_data} =
  data.review
  |> Enum.zip(data.sentiment)
  |> Enum.shuffle()
  |> Enum.split(23_000)

Tokenising

Let’s have a peek at our data and a potential tokenising strategy: normalise by downcasing and stripping punctuation. Splitting on whitespace.

Not suitable for everything but fits acceptably to problem scenario.

{review, _sentiment} = train_data |> hd()

normalise = fn (review) ->
review
|> String.downcase()
# remove punctuation and symbols
|> String.replace(~r/[\p{P}\p{S}]/, "")
|> String.split()
end

normalise.(review)

We’ll use a sparse representation by mapping words to an index. To do this we’ll map the most frequent words to avoid vocabulary bloat, i.e. a ginormous index.

frequencies =
  Enum.reduce(train_data, %{}, fn {review, _}, tokens -> 
    review 
    |> normalise.()
    |> Enum.reduce(tokens, &amp;Map.update(&amp;2, &amp;1, 1, fn x -> x + 1 end))
    end)

# num_tokens is arbitrary limit
num_tokens = 1024

tokens =
  frequencies
  |> Enum.sort_by(&amp;elem(&amp;1, 1), :desc)
  |> Enum.take(num_tokens)

tokens =
  tokens
  |> Enum.with_index(fn {token, _}, i -> {token, i + 2} end)
  |> Map.new()

Note we’ve started indexing from 2. The 0 and 1 indexes are unassigned. These are for

a padding token

Nx requires static shapes and doesn’t support ragged tensors* (tensors of non-uniform dimensions), so you need a strategy to convert all of your input reviews to a uniform shape. The most common way to do this is by padding or truncating each sequence to a fixed length.
the OOV tokens (as a category)

We need to account for the words that’ll fall outside this vocabulary, out-of-vocab(OOV) tokens.

*Some frameworks support working with ragged tensors.

Enum.find(tokens, "No zero or one index found", fn {_t,i} -> i == 0 or i == 1 end )

Now we have an indexed vocab.

Next we’ll write a tokeniser function. Note the replacement of OOV tokens with the 0 index.

pad_token = 0
unknown_token = 1

max_seq_len = 64

tokenize = fn review ->   
  review   
  |> normalise.()
  |> Enum.map(&amp;Map.get(tokens, &amp;1, unknown_token))   
  |> Nx.tensor()
  |> then(&amp;Nx.pad(&amp;1, pad_token, [{0, max_seq_len - Nx.size(&amp;1), 0}]))
  end     
  
tokenize.(review)

Input pipeline

batch_size = 64

train_pipeline =
  train_data
  |> Stream.map(fn {review, label} ->
    {tokenize.(review), Nx.tensor(label)}
  end)
  |> Stream.chunk_every(batch_size, batch_size, :discard)
  |> Stream.map(fn reviews_and_labels ->
    {review, label} = Enum.unzip(reviews_and_labels)
    {Nx.stack(review), Nx.stack(label) |> Nx.new_axis(-1)}
  end)

test_pipeline =
  test_data
  |> Stream.map(fn {review, label} ->
    {tokenize.(review), Nx.tensor(label)}
  end)
  |> Stream.chunk_every(batch_size, batch_size, :discard)
  |> Stream.map(fn reviews_and_labels ->
    {review, label} = Enum.unzip(reviews_and_labels)
    {Nx.stack(review), Nx.stack(label) |> Nx.new_axis(-1)}
  end)

Enum.take(train_pipeline, 1)

Other notebooks:

Michal Slaski
@michalslaski

livebook_examples

Salary predictions

salary_prediction.livemd

data-science advanced exla axon nx

2022-8-18
Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

data-science advanced tutorial axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

data-science advanced tutorial req axon exla nx

2022-8-18
Yejun Su
@goofansu

ogp

ogp

ogp.livemd

tutorial intermediate ogp kino

2022-8-18
Numerical Elixir (Nx)
@elixir-nx

axon

Metric Learning

metric-learning.livemd

data-science advanced tutorial axon scidata exla stb_image kino_vega_lite

2025-11-15
Base59
@base59-dev

quant

Backtest examples

backtest_examples.livemd

advanced tutorial data-science testing algorithms quant explorer kino decimal kino_vega_lite vega_lite

2026-1-27
Ryo Wakabayashi
@RyoWakabayashi

elixir-learning

Stable Diffusion with Pythonx

stable_diffusion_with_pythonx.livemd

intermediate tutorial pythonx kino_pythonx evision

2025-6-20

Back