Pretrained tokenizers
Mix.install([
{:kino, "~> 0.10.0"},
{:scidata, "~> 0.1.5"},
{:tokenizers, "~> 0.4.0"},
{:nx, "~> 0.5"}
])
Setup
This Livebook will demonstrate how to use Tokenizers
with pretrained tokenizers available on the Hugging Face Hub.
We’ll install Kino
for user input and SciData
for real data to tokenize.
Check Notebook dependencies and setup section at the beginning of this notebook
We’ll alias modules in Tokenizers
for readability. For now, the two main entry points into Tokenizers
are the Tokenizer
and Encoding
modules.
alias Tokenizers.Tokenizer
alias Tokenizers.Encoding
Get a tokenizer
The first thing to do is get a tokenizer from the hub. I’ve chosen bert-base-cased
here as it’s commonly used in Hugging Face examples. This call will download the tokenizer from the hub and load it into memory.
{:ok, tokenizer} = Tokenizer.from_pretrained("bert-base-cased")
Save and load
You can save and load models. That means you can load in tokenizers you may have trained locally!
You can choose the path with the Kino input below.
input = Kino.Input.text("Path")
path = Kino.Input.read(input)
Tokenizer.save(tokenizer, path)
{:ok, tokenizer} = Tokenizer.from_file(path)
Check the tokenizer
Let’s see what we can do with the tokenizer. First, let’s have a look at the vocab. It’s represented as a map of tokens to ids.
vocab = Tokenizer.get_vocab(tokenizer)
We can access an id using the vocab, but we don’t need to extract the vocab. Tokenizer.token_to_id/2
does the job for us.
vocab["Jaguar"]
Tokenizer.token_to_id(tokenizer, "Jaguar")
And if we want to go back the other way…
Tokenizer.id_to_token(tokenizer, 21694)
We can also see the vocab size.
Tokenizer.get_vocab_size(tokenizer)
Encode and decode
When you tokenize some text you get an encoding. This is represented as Tokenizers.Encoding.t()
. Because Tokenizers
relies on Rust bindings, the encoding itself appears opaque.
{:ok, encoding} = Tokenizer.encode(tokenizer, "Hello there!")
However, we can get the ids for the encoding as an Elixir list.
ids = Encoding.get_ids(encoding)
And we can decode those back into tokens.
Tokenizer.decode(tokenizer, ids)
Passing a batch of text as a list of strings returns a batch of encodings.
{:ok, encodings} = Tokenizer.encode_batch(tokenizer, ["Hello there!", "This is a test."])
And we can see the list of ids and decode them again.
list_of_ids = Enum.map(encodings, &Encoding.get_ids/1)
Tokenizer.decode_batch(tokenizer, list_of_ids)
Get a tensor
Typically the reason we’re tokenizing text is to use it as an input in a machine learning model. For that, we’ll need tensors.
In order to get a tensor, we need sequences that are all of the same length. We’ll get some data from Scidata
and use Tokenizers.Encoding.pad/3
and Tokenizers.Encoding.truncate/3
to yield a tensor.
%{review: reviews} = Scidata.YelpPolarityReviews.download_test()
{:ok, encoding_batch} =
reviews
|> Enum.take(10)
|> then(&Tokenizer.encode_batch(tokenizer, &1))
tensor =
encoding_batch
|> Enum.map(fn encoding ->
encoding
|> Encoding.pad(200)
|> Encoding.truncate(200)
|> Encoding.get_ids()
end)
|> Nx.tensor()
And we can reverse the operation to see our data. Note the [PAD]
tokens.
tensor
|> Nx.to_batched(1)
|> Enum.map(&Nx.to_flat_list/1)
|> then(&Tokenizer.decode_batch(tokenizer, &1))