Powered by AppSignal & Oban Pro
Would you like to see your link here? Contact us
Notesclub

XML Data Files

experiments/xml_data.livemd

XML Data Files

Intro

During our conversation with authors of original Lexin, they’ve gave us access to the SVN repo that contains original XML files with dictionary data. One file per supported language plus .xsd schema file.

We will not commit these files to our repository, but we want to try to find a way if we can use them to avoid redundant requests to the main API – this will also make our application faster, if we provide a local cache for the dictionary data.

XML Tips & Tricks

Validation

First things first, we wanted tu ensure that the data files we downloaded are valid XML files, and if they valid against the given XSD schema.

There is a way how we can do that with xmllint tool (find how to install it by yourself):

$ xmllint --noout --schema LexinSchema.xsd */*.xml

In our case we’ve got a lot of errors with “tigrinska” (swe_tir.xml) data file, and a few more with “svenska” (swe_swe.xml) one; the rest of languages are valid.

Beautify

To re-format XML, we can also use xmllint, like here:

$ xmllint --format InputFile.xml > OutputFile.xml

Read from Language Files

We have to try to use “raw” XML language files we’ve downloaded to lookup for the words. Let’s install some XML reading library to search and operate with the files.

Mix.install([
  {:exqlite, "~> 0.8.4"},
  {:meeseeks, "~> 0.16.1"}
])

Let’s read?

> Useful XPath-playground: https://extendsclass.com/xpath-tester.html

defmodule LanguageReader do
  import Meeseeks.XPath

  defp raw_xml(), do: File.read!("experiments/swe_rus_f.xml")
  defp parse(), do: raw_xml() |> Meeseeks.parse(:xml)

  def lookup(word) do
    parse()
    |> Meeseeks.all(xpath("//Word[@Value='#{word}']"))
  end

  def all_words() do
    parse()
    |> Meeseeks.all(xpath("//Word"))
  end

  def all_ids() do
    for word <- Meeseeks.all(parse(), xpath("//Word")) do
      Meeseeks.attr(word, "VariantID")
    end
  end
end

# {
#   LanguageReader.all_words() |> Enum.count(),
#   LanguageReader.lookup("abc-bok")
# }

ids = LanguageReader.all_ids()

{
  ids |> Enum.count(),
  ids |> Enum.uniq() |> Enum.count()
}

SQLite

It takes a lot of time (~5 seconds) to directly lookup a word in the raw XML file (18MB), so we want to find another approach to index words and lookup them. We can try to use SQLite and convert these XML files into new format, so we will have a database index over available vocabulary.

We are limited in the number of datatypes we can use in an SQLite database, so we need to be inventive.

Every Word in the original XML file has ID and VariantID attributes. While ID sometimes repeats, VariantID seems to be unique (see section and LanguageReader.all_ids/0 above).

defmodule SQLiteConverter do
  import Meeseeks.XPath

  defp xml(), do: File.read!("experiments/swe_rus.xml") |> Meeseeks.parse(:xml)

  def all_words() do
    xml()
    |> Meeseeks.all(xpath("//Word"))
    |> Enum.map(fn word ->
      id = Meeseeks.attr(word, "VariantID")

      indexes =
        for index <- Meeseeks.all(word, xpath("//Index")) do
          {
            Meeseeks.attr(index, "Value"),
            Meeseeks.attr(index, "type")
          }
        end

      definition = Meeseeks.html(word)

      {id, indexes, definition}
    end)
  end
end

SQLiteConverter.all_words()
{:ok, conn} = Exqlite.Sqlite3.open("swe_rus.sqlite")

# Create the structure
vocabulary_table = """
BEGIN;
DROP TABLE IF EXISTS "vocabulary";
CREATE TABLE "vocabulary" ("id" INTEGER PRIMARY KEY, "definition_id" INTEGER, "word" TEXT, "type" TEXT);
COMMIT;
"""

# CREATE INDEX "index_word" ON "vocabulary" ("word");
:ok = Exqlite.Sqlite3.execute(conn, vocabulary_table)

definitions_table = """
BEGIN;
DROP TABLE IF EXISTS "definitions";
CREATE TABLE "definitions" ("id" INTEGER PRIMARY KEY, "definition" TEXT);
COMMIT;
"""

:ok = Exqlite.Sqlite3.execute(conn, definitions_table)

SQLiteConverter.all_words()
|> Enum.map(fn {id, indexes, definition} ->
  Enum.map(indexes, fn {word, type} ->
    word_sql = "INSERT INTO vocabulary (definition_id, word, type) VALUES (?1, ?2, ?3)"
    {:ok, statement} = Exqlite.Sqlite3.prepare(conn, word_sql)
    :ok = Exqlite.Sqlite3.bind(conn, statement, [id, word, type])
    :done = Exqlite.Sqlite3.step(conn, statement)
  end)

  definition_sql = "INSERT INTO definitions (id, definition) VALUES (?1, ?2)"
  {:ok, statement} = Exqlite.Sqlite3.prepare(conn, definition_sql)
  :ok = Exqlite.Sqlite3.bind(conn, statement, [id, definition])
  :done = Exqlite.Sqlite3.step(conn, statement)
end)

Quick overview of the resulting file shows that its filesize is getting bigger (in comparison to original XML), but not much (~18MB of XML converts into ~26MB of SQLite). Conversion of 18MB file takes around 100 seconds.

Time to word lookups tremendously faster: 10-12ms instead of ~5s.

> We don’t notice any significant difference when we put an index on the “word” column, or when > it is not indexed.

We can lookup definitions like this:

SELECT d.id, d.definition FROM definitions d
JOIN vocabulary v ON d.id = v.definition_id
WHERE v.word LIKE "fordon"