UniHan tutorial
Mix.install([
{:unicode_unihan, "~> 0.1.0"}
])
Setup
The Unihan
module lets you work with the Unihan database at three levels of granularity:
- individual characters,
- population of characters, and
- attributes of fields within characters
This Livebook walks you through these three levels.
import Unicode.Unihan
Single Unicode Lookup
The Unihan
library provides, first and foremost, fast lookups of the range of data within the Unihan database. The function unihan/1
accepts a variety of input, and returns the information contained within the Unihan database as a parsed map.
The character “萬”, standing for ten-thousand in Zh-T, will be used as an example.
# usage as codepoint
unihan(33836)
# use as string grapheme
unihan(33836) == unihan("萬")
# use as hex string
unihan(33836) == unihan("U+842C")
The map can be accessed through the unicode keys as atoms. These keys are specified in Annex #38.
The values have been further parsed, often into maps of their own; key naming of these smaller maps are not specified by Unihan, and have taken (in general) to be consistent to the implementation in Python’s unihan-etl library. See documentation on Hexdocs for details.
Note that these maps can be accessed using Access
(square brackets []
), or the method dot notation (.
). The former returns nil
when the key does not exist, whereas the latter throws an exception.
# parses to an int
IO.inspect(unihan("萬")[:kGradeLevel])
# parses to a list
IO.inspect(unihan("萬")[:kCantonese])
# parses to a map
IO.inspect(unihan("萬")[:kTotalStrokes])
Often we would like to return from a map or a codepoint to its string grapheme representation. The to_string/1
function lets you do that.
IO.inspect(Unicode.Unihan.to_string(33836))
map = unihan("萬")
IO.inspect(Unicode.Unihan.to_string(map))
Given that you’d often have a list of maps returned from population-level queries (in the next section), to_string/1
also accepts a list of maps.
Population-level Information
Unihan
provides 2 functions, filter/1
and reject/1
, which lets you isolate subset of codepoints from the @unihan
map. Both of these accepts a 1-arity function.
The following example selects, from the full @unihan
, the characters that Grade 1 & 2 students are expected to learn. Since we have parsed kGradeLevel
into an integer for you, you can use the comparison operator <=
directly:
filter(fn char ->
char[:kGradeLevel] <= 2
end)
|> Enum.count()
In practice, the 1-arity function can be written more conveniently using the capture &
syntax, especially when they are chained together. In the following usage, we isolate the characters that Grade 1 and 2 students are expected to learn, but only if they have tone 1 in Cantonese:
filter(
&(&1[:kGradeLevel] <= 2 and
&1[:kCantonese][:tone] == "1")
)
|> Enum.sort_by(fn {_codepoint, map} ->
map[:kTotalStrokes][:Hant]
end)
|> Enum.map(fn {_codepoint, map} ->
Unicode.Unihan.to_string(map)
end)
Here we also see the usage of to_string/1
acting on a list of maps to return their human-friendly string representation.
reject/1
works similarly:
reject(&(&1[:kTotalStrokes][:"zh-Hant"] < 60))
|> Enum.map(fn {_codepoint, map} ->
Unicode.Unihan.to_string(map)
end)
(That blob? It’s a character containing 6 distinct characters: cloud, cloud, cloud, dragon, dragon, dragon. You probably guessed correctly that it means dragon flying.)
Unihan field parsing
Unihan fields were given as strings in the UniHan database, where each string encapsulates complex meaning. For example, for “萬”, its kHanyuPinyin
is given as 53247.080:wàn
. This can be parsed according to the specifications:
> The 漢語拼音 Hànyǔ Pīnyīn reading(s) appearing in the edition of 《漢語大字典》 Hànyǔ Dà Zìdiǎn (HDZ) specified in the “kHanYu” property description (q.v.). Each location has the form “ABCDE.XYZ” (as in “kHanYu”); multiple locations for a given pīnyīn reading are separated by commas. The list of locations is followed by a colon, followed by a comma-separated list of one or more pīnyīn readings. Where multiple pīnyīn readings are associated with a given mapping, these are ordered as in HDZ (for the most part reflecting relative commonality).
The multi-clause private function Unicode.Unihan.Utils.decode_value/3
was used to parse the values. This gives the user access to the internals, as you have seen in the filter
example above: the tone, which was simply encoded as part of the binary, is present as a :tone
key.
For more information about the fields and how they were parsed, see the HexDoc page on Fields.
Support modules
Codepoint 33863 is useful for machines, but “萬” is usually what the end-users want to know. We thus provided the to_string/1
function for easy access. Several UniHan fields are similar in this regard: they use alphanumeric strings to encode data with specific semantic meaning, and we have similar convenient functions for you to access what you may want to use.
These requires domain-knowledge, and CJK glyphs encapsulates long history of a broad geography. If you have requests, please consider reaching out.
Radicals
CJK characters are often categorized by their radical (部首). These radicals are codified in the KangXi Dictionary 康熙字典 published in 1715, and there are 214 radicals.
In Unicode, these radicals were given a numerical index, and maps to two unicode codepoints whose glyphs are visually indistinguishable:
- codepoint for the radical, and
- codepoint for the radical as a CJK Unified Ideograph
An additional complication is that some radicals are represented differently in the Simplified script.
The following functions are provided for easy access to the various glyph representations.
# defaults to traditional, unified ideograph
Unicode.Unihan.Radical.radical(187)
# :script accepts :Hans and :Hant keys
Unicode.Unihan.Radical.radical(187, script: :Hans)
# to access the radical character, provide the following :glyph keyword
Unicode.Unihan.Radical.radical(
187,
script: :Hant,
glyph: :radical_character
)
# the special :all instruction gives the buddha hotdog
Unicode.Unihan.Radical.radical(187, :all)
Cangjie
Cangjie is a commonly used keyboard-input method. Each alphabet is mapped to a “construction part”, and the glyphs can be accessed as follows:
Unicode.Unihan.Cangjie.cangjie("A")
A bang equivalent cangjie!/1
exists, which additionally accepts a list as input:
Unicode.Unihan.Cangjie.cangjie!(["A", "B", "C", "D", "E", "F", "G"])
This facilitates preparing end-user friendly representations:
"萬"
|> unihan()
|> Map.get(:kCangjie)
|> Unicode.Unihan.Cangjie.cangjie!()
Jyutping
Jyutping is a modern Cantonese romanization scheme. Each reading comprises four components:
- onset
- nucleus
- coda
- tone
The nucleus and coda, concatenated together, is known as the final.
The Unicode.Unihan.Cantonese
module provides functions for working with jyutping.
# checking validity of a given input
Unicode.Unihan.Cantonese.is_valid?("faan1")
Unicode.Unihan.Cantonese.to_jyutping("faan1")
A bang equivalent to_jyutping!/1
is available.
"faan1"
|> Unicode.Unihan.Cantonese.to_jyutping!()
|> Map.get(:final)