Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Pretrained tokenizers

notebooks/pretrained.livemd

Pretrained tokenizers

Mix.install([
  {:kino, "~> 0.10.0"},
  {:scidata, "~> 0.1.5"},
  {:tokenizers, "~> 0.4.0"},
  {:nx, "~> 0.5"}
])

Setup

This Livebook will demonstrate how to use Tokenizers with pretrained tokenizers available on the Hugging Face Hub.

We’ll install Kino for user input and SciData for real data to tokenize.

Check Notebook dependencies and setup section at the beginning of this notebook

We’ll alias modules in Tokenizers for readability. For now, the two main entry points into Tokenizers are the Tokenizer and Encoding modules.

alias Tokenizers.Tokenizer
alias Tokenizers.Encoding

Get a tokenizer

The first thing to do is get a tokenizer from the hub. I’ve chosen bert-base-cased here as it’s commonly used in Hugging Face examples. This call will download the tokenizer from the hub and load it into memory.

{:ok, tokenizer} = Tokenizer.from_pretrained("bert-base-cased")

Save and load

You can save and load models. That means you can load in tokenizers you may have trained locally!

You can choose the path with the Kino input below.

input = Kino.Input.text("Path")
path = Kino.Input.read(input)
Tokenizer.save(tokenizer, path)
{:ok, tokenizer} = Tokenizer.from_file(path)

Check the tokenizer

Let’s see what we can do with the tokenizer. First, let’s have a look at the vocab. It’s represented as a map of tokens to ids.

vocab = Tokenizer.get_vocab(tokenizer)

We can access an id using the vocab, but we don’t need to extract the vocab. Tokenizer.token_to_id/2 does the job for us.

vocab["Jaguar"]
Tokenizer.token_to_id(tokenizer, "Jaguar")

And if we want to go back the other way…

Tokenizer.id_to_token(tokenizer, 21694)

We can also see the vocab size.

Tokenizer.get_vocab_size(tokenizer)

Encode and decode

When you tokenize some text you get an encoding. This is represented as Tokenizers.Encoding.t(). Because Tokenizers relies on Rust bindings, the encoding itself appears opaque.

{:ok, encoding} = Tokenizer.encode(tokenizer, "Hello there!")

However, we can get the ids for the encoding as an Elixir list.

ids = Encoding.get_ids(encoding)

And we can decode those back into tokens.

Tokenizer.decode(tokenizer, ids)

Passing a batch of text as a list of strings returns a batch of encodings.

{:ok, encodings} = Tokenizer.encode_batch(tokenizer, ["Hello there!", "This is a test."])

And we can see the list of ids and decode them again.

list_of_ids = Enum.map(encodings, &Encoding.get_ids/1)
Tokenizer.decode_batch(tokenizer, list_of_ids)

Get a tensor

Typically the reason we’re tokenizing text is to use it as an input in a machine learning model. For that, we’ll need tensors.

In order to get a tensor, we need sequences that are all of the same length. We’ll get some data from Scidata and use Tokenizers.Encoding.pad/3 and Tokenizers.Encoding.truncate/3 to yield a tensor.

%{review: reviews} = Scidata.YelpPolarityReviews.download_test()
{:ok, encoding_batch} =
  reviews
  |> Enum.take(10)
  |> then(&Tokenizer.encode_batch(tokenizer, &1))

tensor =
  encoding_batch
  |> Enum.map(fn encoding ->
    encoding
    |> Encoding.pad(200)
    |> Encoding.truncate(200)
    |> Encoding.get_ids()
  end)
  |> Nx.tensor()

And we can reverse the operation to see our data. Note the [PAD] tokens.

tensor
|> Nx.to_batched(1)
|> Enum.map(&Nx.to_flat_list/1)
|> then(&Tokenizer.decode_batch(tokenizer, &1))