XML Data Files
Intro
During our conversation with authors of original Lexin, they’ve gave us access to the SVN repo
that contains original XML files with dictionary data. One file per supported language plus
.xsd
schema file.
We will not commit these files to our repository, but we want to try to find a way if we can use them to avoid redundant requests to the main API – this will also make our application faster, if we provide a local cache for the dictionary data.
XML Tips & Tricks
Validation
First things first, we wanted tu ensure that the data files we downloaded are valid XML files, and if they valid against the given XSD schema.
There is a way how we can do that with xmllint
tool (find how to install it by yourself):
$ xmllint --noout --schema LexinSchema.xsd */*.xml
In our case we’ve got a lot of errors with “tigrinska” (swe_tir.xml
) data file, and a few
more with “svenska” (swe_swe.xml
) one; the rest of languages are valid.
Beautify
To re-format XML, we can also use xmllint
, like here:
$ xmllint --format InputFile.xml > OutputFile.xml
Read from Language Files
We have to try to use “raw” XML language files we’ve downloaded to lookup for the words. Let’s install some XML reading library to search and operate with the files.
Mix.install([
{:exqlite, "~> 0.8.4"},
{:meeseeks, "~> 0.16.1"}
])
Let’s read?
> Useful XPath-playground: https://extendsclass.com/xpath-tester.html
defmodule LanguageReader do
import Meeseeks.XPath
defp raw_xml(), do: File.read!("experiments/swe_rus_f.xml")
defp parse(), do: raw_xml() |> Meeseeks.parse(:xml)
def lookup(word) do
parse()
|> Meeseeks.all(xpath("//Word[@Value='#{word}']"))
end
def all_words() do
parse()
|> Meeseeks.all(xpath("//Word"))
end
def all_ids() do
for word <- Meeseeks.all(parse(), xpath("//Word")) do
Meeseeks.attr(word, "VariantID")
end
end
end
# {
# LanguageReader.all_words() |> Enum.count(),
# LanguageReader.lookup("abc-bok")
# }
ids = LanguageReader.all_ids()
{
ids |> Enum.count(),
ids |> Enum.uniq() |> Enum.count()
}
SQLite
It takes a lot of time (~5 seconds) to directly lookup a word in the raw XML file (18MB), so we want to find another approach to index words and lookup them. We can try to use SQLite and convert these XML files into new format, so we will have a database index over available vocabulary.
We are limited in the number of datatypes we can use in an SQLite database, so we need to be inventive.
Every Word
in the original XML file has ID
and VariantID
attributes. While ID
sometimes
repeats, VariantID
seems to be unique (see section and LanguageReader.all_ids/0
above).
defmodule SQLiteConverter do
import Meeseeks.XPath
defp xml(), do: File.read!("experiments/swe_rus.xml") |> Meeseeks.parse(:xml)
def all_words() do
xml()
|> Meeseeks.all(xpath("//Word"))
|> Enum.map(fn word ->
id = Meeseeks.attr(word, "VariantID")
indexes =
for index <- Meeseeks.all(word, xpath("//Index")) do
{
Meeseeks.attr(index, "Value"),
Meeseeks.attr(index, "type")
}
end
definition = Meeseeks.html(word)
{id, indexes, definition}
end)
end
end
SQLiteConverter.all_words()
{:ok, conn} = Exqlite.Sqlite3.open("swe_rus.sqlite")
# Create the structure
vocabulary_table = """
BEGIN;
DROP TABLE IF EXISTS "vocabulary";
CREATE TABLE "vocabulary" ("id" INTEGER PRIMARY KEY, "definition_id" INTEGER, "word" TEXT, "type" TEXT);
COMMIT;
"""
# CREATE INDEX "index_word" ON "vocabulary" ("word");
:ok = Exqlite.Sqlite3.execute(conn, vocabulary_table)
definitions_table = """
BEGIN;
DROP TABLE IF EXISTS "definitions";
CREATE TABLE "definitions" ("id" INTEGER PRIMARY KEY, "definition" TEXT);
COMMIT;
"""
:ok = Exqlite.Sqlite3.execute(conn, definitions_table)
SQLiteConverter.all_words()
|> Enum.map(fn {id, indexes, definition} ->
Enum.map(indexes, fn {word, type} ->
word_sql = "INSERT INTO vocabulary (definition_id, word, type) VALUES (?1, ?2, ?3)"
{:ok, statement} = Exqlite.Sqlite3.prepare(conn, word_sql)
:ok = Exqlite.Sqlite3.bind(conn, statement, [id, word, type])
:done = Exqlite.Sqlite3.step(conn, statement)
end)
definition_sql = "INSERT INTO definitions (id, definition) VALUES (?1, ?2)"
{:ok, statement} = Exqlite.Sqlite3.prepare(conn, definition_sql)
:ok = Exqlite.Sqlite3.bind(conn, statement, [id, definition])
:done = Exqlite.Sqlite3.step(conn, statement)
end)
Quick overview of the resulting file shows that its filesize is getting bigger (in comparison to original XML), but not much (~18MB of XML converts into ~26MB of SQLite). Conversion of 18MB file takes around 100 seconds.
Time to word lookups tremendously faster: 10-12ms instead of ~5s.
> We don’t notice any significant difference when we put an index on the “word” column, or when > it is not indexed.
We can lookup definitions like this:
SELECT d.id, d.definition FROM definitions d
JOIN vocabulary v ON d.id = v.definition_id
WHERE v.word LIKE "fordon"