Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us

Disambiguation

notebooks/disambiguation.livemd

Disambiguation

Akin

Akin is a collection of string comparison algorithms for Elixir. Algorithms can be called independently or combined to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries.

Algorithms

Utilities are provided to return all avialable algorithms.

Akin.Util.list_algorithms()

Note: Hamming Distance is excluded as it only compares strings of equal length. To use the Hamming Distance algorithm, call it directly (see: Independent Algorithms).

Combined Algorithms

Metrics

Results from all algorithms are returned as a map of metrics.

Compare Strings

Experiment by changing the value of the strings.

a = "weird"
b = "wierd"

Akin.compare(a, b)

Options

Comparison accepts options in a Keyword list.

  1. algorithms: algorithms to use in comparision. Accepts the name or a keyword list. Default is algorithms/0.
    1. metric - algorithm metric. Default is both
      • “string”: uses string algorithms
      • “phonetic”: uses phonetic algorithms
    2. unit - algorithm unit. Default is both.
      • “whole”: uses algorithms best suited for whole string comparison (distance)
      • “partial”: uses algorithms best suited for partial string comparison (substring)
  2. level - level for double phonetic matching. Default is “normal”.
    • “strict”: both encodings for each string must match
    • “strong”: the primary encoding for each string must match
    • “normal”: the primary encoding of one string must match either encoding of other string (default)
    • “weak”: either primary or secondary encoding of one string must match one encoding of other string
  3. match_at: an algorith score equal to or above this value is condsidered a match. Default is 0.9
  4. ngram_size: number of contiguous letters to split strings into. Default is 2.
  5. short_length: qualifies as “short” to recieve a shortness boost. Used by Name Metric. Default is 8.
  6. stem: boolean representing whether to compare the stemmed version the strings; uses Stemmer. Default false
opts = [algorithms: ["bag_distance", "jaccard", "jaro_winkler"]]
Akin.compare(a, b, opts)
opts = [algorithms: [metric: "phonetic", unit: "whole"]]
Akin.compare(a, b, opts)
Akin.compare(a, b, algorithms: [metric: "string", unit: "whole"], ngram_size: 1)

n-gram Size

The default ngram size for the algorithms is 2. You can change by setting a value in opts.

opts = [algorithms: ["sorensen_dice"]]
Akin.compare(a, b, opts)
opts = [algorithms: ["sorensen_dice"], ngram_size: 1]
Akin.compare(a, b, opts)

Match Level

The default match strictness is “normal” You change it by setting a value in opts. Currently it only affects the outcomes of the substring_set and double_metaphone algorithms

left = "Alice in Wonderland"
right = "Alice's Adventures in Wonderland"

Akin.compare(left, right, algorithms: ["substring_set"])
Akin.compare(left, right, algorithms: ["substring_set"], level: "weak")
left = "which way"
right = "whitch way"

Akin.compare(left, right, algorithms: ["double_metaphone"], level: "weak")
Akin.compare(left, right, algorithms: ["double_metaphone"], level: "strict")

Stems

Compare the stemmed version of two strings.

not_gerund = "write"
gerund = "writing"

Akin.compare(not_gerund, gerund, algorithms: ["bag_distance", "double_metaphone"])
Akin.compare(not_gerund, gerund, algorithms: ["bag_distance", "double_metaphone"], stem: true)

Preprocessing

Before being compared, strings are converted to downcase and unicode standard, whitespace is standardized, nontext (like punctuation & emojis) is replaced, and accents are converted. The string is then composed into a struct representing the corpus of data used by the comparison algorithms.

name = "Alice Liddell"

Akin.Util.compose(name)

Accents

name_a = "Hubert Łępicki"

Akin.Util.compose(name_a)
name_b = "Hubert Lepicki"

Akin.compare(name_a, name_b)

Phonemes

Akin.phonemes(name)
Akin.phonemes("wonderland")

Independent Algorithms

Each algorithm can be called directly. Module names are camelcased versions of the the snakecased algorithm names returned by list_algorithms/0.

a = Akin.Util.compose("weird")
b = Akin.Util.compose("wierd")
Akin.BagDistance.compare(a, b)

Hamming Distance is excluded from list_algorithms/0 and the combined algorithm metrics as it only compares strings of equal length. To use the Hamming Distance algorithm, call it directly.

Akin.Hamming.compare("weird", "wierd")