Notesclub

created by hec & contributors

terms privacy

Pretrained tokenizers

notebooks/pretrained.livemd

Numerical Elixir (Nx)

@elixir-nx

tokenizers

Share to X

Share to Bluesky

More notebooks

Pretrained tokenizers

Mix.install([
  {:kino, "~> 0.10.0"},
  {:scidata, "~> 0.1.5"},
  {:tokenizers, "~> 0.4.0"},
  {:nx, "~> 0.5"}
])

Setup

This Livebook will demonstrate how to use Tokenizers with pretrained tokenizers available on the Hugging Face Hub.

We’ll install Kino for user input and SciData for real data to tokenize.

Check Notebook dependencies and setup section at the beginning of this notebook

We’ll alias modules in Tokenizers for readability. For now, the two main entry points into Tokenizers are the Tokenizer and Encoding modules.

alias Tokenizers.Tokenizer
alias Tokenizers.Encoding

Get a tokenizer

The first thing to do is get a tokenizer from the hub. I’ve chosen bert-base-cased here as it’s commonly used in Hugging Face examples. This call will download the tokenizer from the hub and load it into memory.

{:ok, tokenizer} = Tokenizer.from_pretrained("bert-base-cased")

Save and load

You can save and load models. That means you can load in tokenizers you may have trained locally!

You can choose the path with the Kino input below.

input = Kino.Input.text("Path")

path = Kino.Input.read(input)
Tokenizer.save(tokenizer, path)

{:ok, tokenizer} = Tokenizer.from_file(path)

Check the tokenizer

Let’s see what we can do with the tokenizer. First, let’s have a look at the vocab. It’s represented as a map of tokens to ids.

vocab = Tokenizer.get_vocab(tokenizer)

We can access an id using the vocab, but we don’t need to extract the vocab. Tokenizer.token_to_id/2 does the job for us.

vocab["Jaguar"]

Tokenizer.token_to_id(tokenizer, "Jaguar")

And if we want to go back the other way…

Tokenizer.id_to_token(tokenizer, 21694)

We can also see the vocab size.

Tokenizer.get_vocab_size(tokenizer)

Encode and decode

When you tokenize some text you get an encoding. This is represented as Tokenizers.Encoding.t(). Because Tokenizers relies on Rust bindings, the encoding itself appears opaque.

{:ok, encoding} = Tokenizer.encode(tokenizer, "Hello there!")

However, we can get the ids for the encoding as an Elixir list.

ids = Encoding.get_ids(encoding)

And we can decode those back into tokens.

Tokenizer.decode(tokenizer, ids)

Passing a batch of text as a list of strings returns a batch of encodings.

{:ok, encodings} = Tokenizer.encode_batch(tokenizer, ["Hello there!", "This is a test."])

And we can see the list of ids and decode them again.

list_of_ids = Enum.map(encodings, &amp;Encoding.get_ids/1)

Tokenizer.decode_batch(tokenizer, list_of_ids)

Get a tensor

Typically the reason we’re tokenizing text is to use it as an input in a machine learning model. For that, we’ll need tensors.

In order to get a tensor, we need sequences that are all of the same length. We’ll get some data from Scidata and use Tokenizers.Encoding.pad/3 and Tokenizers.Encoding.truncate/3 to yield a tensor.

%{review: reviews} = Scidata.YelpPolarityReviews.download_test()

{:ok, encoding_batch} =
  reviews
  |> Enum.take(10)
  |> then(&amp;Tokenizer.encode_batch(tokenizer, &amp;1))

tensor =
  encoding_batch
  |> Enum.map(fn encoding ->
    encoding
    |> Encoding.pad(200)
    |> Encoding.truncate(200)
    |> Encoding.get_ids()
  end)
  |> Nx.tensor()

And we can reverse the operation to see our data. Note the [PAD] tokens.

tensor
|> Nx.to_batched(1)
|> Enum.map(&amp;Nx.to_flat_list/1)
|> then(&amp;Tokenizer.decode_batch(tokenizer, &amp;1))

Other notebooks:

Michal Slaski
@michalslaski

livebook_examples

Salary predictions

salary_prediction.livemd

advanced data-science exla axon nx

2022-8-18
Dr. Christian Geuer-Pollmann
@chgeuer

livebook_on_azure

Christian's first LiveBook test

notebook1.livemd

tutorial advanced data-science axon exla nx

2022-8-18
@andyl

elix_util

MNIST

mnist.livemd

tutorial advanced data-science req axon exla nx

2022-8-18
Yejun Su
@goofansu

ogp

ogp

ogp.livemd

tutorial intermediate ogp kino

2022-8-18
Wenqian (sofia) Gao
@QIANsofia1128

TPCDataAnalysis

Decision Tree

decisionTree.livemd

tutorial advanced data-science req explorer kino_explorer vega_lite kino_vega_lite tzdata nx scholar evision pythonx

2025-7-18
a/vivekbala
@ks0m1c

OrbisTertius

Thelka

thekla.livemd

tutorial advanced jason kino h3 geo_turf

2024-7-7
Ryo Wakabayashi
@RyoWakabayashi

elixir-learning

農林水産省筆ポリゴンデータ

farm_polygon.livemd

tutorial advanced data-science nx evision exla geo jason kino kino_maplibre

2023-1-5

Back