Powered by AppSignal & Oban Pro

Ch 9: RNNs

Ch9 - RNN.livemd

Ch 9: RNNs

Mix.install([
  {:scidata, "~> 0.1"},
  {:axon, "~> 0.5"},
  {:exla, "~> 0.6"},
  {:nx, "~> 0.6"},
  {:table_rex, "~> 3.1.1"},
  {:kino, "~> 0.7"}
])

Nx.default_backend(EXLA.Backend)

Get data

data = Scidata.IMDBReviews.download()
{train_data, test_data} =
  data.review
  |> Enum.zip(data.sentiment)
  |> Enum.shuffle()
  |> Enum.split(23_000)

Tokenising

Let’s have a peek at our data and a potential tokenising strategy: normalise by downcasing and stripping punctuation. Splitting on whitespace.

Not suitable for everything but fits acceptably to problem scenario.

{review, _sentiment} = train_data |> hd()

normalise = fn (review) ->
review
|> String.downcase()
# remove punctuation and symbols
|> String.replace(~r/[\p{P}\p{S}]/, "")
|> String.split()
end

normalise.(review)

We’ll use a sparse representation by mapping words to an index. To do this we’ll map the most frequent words to avoid vocabulary bloat, i.e. a ginormous index.

frequencies =
  Enum.reduce(train_data, %{}, fn {review, _}, tokens -> 
    review 
    |> normalise.()
    |> Enum.reduce(tokens, &Map.update(&2, &1, 1, fn x -> x + 1 end))
    end)
# num_tokens is arbitrary limit
num_tokens = 1024

tokens =
  frequencies
  |> Enum.sort_by(&elem(&1, 1), :desc)
  |> Enum.take(num_tokens)
tokens =
  tokens
  |> Enum.with_index(fn {token, _}, i -> {token, i + 2} end)
  |> Map.new()

Note we’ve started indexing from 2. The 0 and 1 indexes are unassigned. These are for

  1. a padding token

    Nx requires static shapes and doesn’t support ragged tensors* (tensors of non-uniform dimensions), so you need a strategy to convert all of your input reviews to a uniform shape. The most common way to do this is by padding or truncating each sequence to a fixed length.

  2. the OOV tokens (as a category)

    We need to account for the words that’ll fall outside this vocabulary, out-of-vocab(OOV) tokens.

*Some frameworks support working with ragged tensors.

Enum.find(tokens, "No zero or one index found", fn {_t,i} -> i == 0 or i == 1 end )

Now we have an indexed vocab.

Next we’ll write a tokeniser function. Note the replacement of OOV tokens with the 0 index.

pad_token = 0
unknown_token = 1

max_seq_len = 64

tokenize = fn review ->   
  review   
  |> normalise.()
  |> Enum.map(&Map.get(tokens, &1, unknown_token))   
  |> Nx.tensor()
  |> then(&Nx.pad(&1, pad_token, [{0, max_seq_len - Nx.size(&1), 0}]))
  end     
  
tokenize.(review)

Input pipeline

batch_size = 64

train_pipeline =
  train_data
  |> Stream.map(fn {review, label} ->
    {tokenize.(review), Nx.tensor(label)}
  end)
  |> Stream.chunk_every(batch_size, batch_size, :discard)
  |> Stream.map(fn reviews_and_labels ->
    {review, label} = Enum.unzip(reviews_and_labels)
    {Nx.stack(review), Nx.stack(label) |> Nx.new_axis(-1)}
  end)

test_pipeline =
  test_data
  |> Stream.map(fn {review, label} ->
    {tokenize.(review), Nx.tensor(label)}
  end)
  |> Stream.chunk_every(batch_size, batch_size, :discard)
  |> Stream.map(fn reviews_and_labels ->
    {review, label} = Enum.unzip(reviews_and_labels)
    {Nx.stack(review), Nx.stack(label) |> Nx.new_axis(-1)}
  end)
Enum.take(train_pipeline, 1)