Elasticlunr
Description
Elasticlunr is a small, full-text search library for use in the Elixir environment. It indexes JSON documents and provides a friendly search interface to retrieve documents.
The library is built for web applications that do not require the deployment complexities of popular search engines while taking advantage of the Beam capabilities.
Imagine how much is gained when the search functionality of your application resides in the same environment (Beam VM) as your business logic; search resolves faster, the number of services (Elasticsearch, Solr, and so on) to monitor reduces.
Getting Started
Mix.install([
{:kino, "~> 0.4"},
{:elasticlunr, "~> 0.6"}
])
What’s an Index?
An index is a collection of structured data that is referred to when looking for results that are relevant to a specific query.
In RDBMS, a table can be likened to an index, meaning that you can store, update, delete and search documents in an index. But the difference here is that an index has a pipeline that every JSON document passes through before it becomes searchable.
alias Elasticlunr.{Index, Pipeline}
# the library comes with a default set of pipeline functions
pipeline = Pipeline.new(Pipeline.default_runners())
index = Index.new(pipeline: pipeline)
The above code block creates a new index with a pipeline of default functions that work with the English language.
The new index does not define the expected structure of the JSON documents to be indexed. To fix this, let’s assume we are building an index of blog posts, and each post consists of the author
, content
, category
, and title
attributes.
index =
index
|> Index.add_field("title")
|> Index.add_field("author")
|> Index.add_field("content")
|> Index.add_field("category")
Indexing Documents
Following our example or use-case above, to make the blog posts searchable we need to add them to the index so that they can be analyzed and transformed appropriately.
documents = [
%{
"id" => 1,
"author" => "Mark Ericksen",
"title" => "Saving and Restoring LiveView State using the Browser",
"category" => "elixir liveview browser",
"content" =>
"There are multiple ways to save and restore state for your LiveView processes. You can use an external cache like Redis, your database, or even the browser itself. Sometimes there are situations where you either can’t or don’t want to store the state on the server. In situations like that, you do have the option of storing the state in the user’s browser. This post explains how you use the browser to store state and how your LiveView process can get it back later. We’ll go through the code so you can add something similar to your own project. We cover what data to store, how to do it securely, and restoring the state on demand."
},
%{
"id" => 2,
"author" => "Mika Kalathil",
"title" => "Creating Reusable Ecto Code",
"category" => "elixir ecto sql",
"content" =>
"Creating a highly reusable Ecto API is one of the ways we can create long-term sustainable code for ourselves, while growing it with our application to allow for infinite combination possibilites and high code reusability. If we write our Ecto code correctly, we can not only have a very well defined split between query definition and combination/execution using our context but also have the ability to re-use the queries we design individually, together with others to create larger complex queries."
},
%{
"id" => 3,
"author" => "Mark Ericksen",
"title" => "ThinkingElixir 079: Collaborative Music in LiveView with Nathan Willson",
"category" => "elixir podcast liveview",
"content" =>
"In episode 79 of Thinking Elixir, we talk with Nathan Willson about GEMS, his collaborative music generator written in LiveView. He explains how it’s built, the JS sound library integrations, what could be done by Phoenix and what is done in the browser. Nathan shares how he deployed it globally to 10 regions using Fly.io. We go over some of the challenges he overcame creating an audio focused web application. It’s a fun open-source project that pushes the boundaries of what we think LiveView apps can do!"
},
%{
"id" => 4,
"title" => "ThinkingElixir 078: Logflare with Chase Granberry",
"author" => "Mark Ericksen",
"category" => "elixir podcast logging logflare",
"content" =>
"In episode 78 of Thinking Elixir, we talk with Chase Granberry about Logflare. We learn why Chase started the company, what Logflare does, how it’s built on Elixir, about their custom Elixir logger, where the data is stored, how it’s queried, and more! We talk about dealing with the constant stream of log data, how Logflare is collecting and displaying metrics, and talk more about Supabase acquiring the company!"
}
]
index = Index.add_documents(index, documents)
Search Index
The search results is a list of maps and each map contains specific keys, matched
, positions
, ref
, and score
. See the definitions below:
- matched: this field tells the number of attributes where the given query matches
- score: the value shows how well the document ranks compared to other documents
- ref: this is the document id
- positions: this is a map that shows the positions of the matching words in the document
search_query = Kino.Input.text("Search", default: "elixir")
search_query = Kino.Input.read(search_query)
results = Index.search(index, search_query)
NB: Don’t forget to fiddle with the search input.
Nested Document Attributes
As seen in the earlier example all documents indexed were without nested attributes. But Imagine a situation where your data source returns documents with nested attributes, and you want to search by these attributes - it’s possible with Elasticlunr by specifying the top-level attribute.
Let’s say our data source returns a list of users with their address which is an object and you want to index this information so that you can query them.
# the library comes with a default set of pipeline functions
pipeline = Pipeline.new(Pipeline.default_runners())
users_index =
Index.new(pipeline: pipeline)
|> Index.add_field("name")
|> Index.add_field("address")
|> Index.add_field("education")
Automatically, Elasticlunr will flatten the nested attributes to the level that when using the advanced query DSL you can use dot notation to filter the search results. Now, let’s add a few user objects to the index:
documents = [
%{
"id" => 1,
"name" => "rose mary",
"education" => "BSc.",
"address" => %{
"line1" => "Brooklyn Street",
"line2" => "4181",
"city" => "Portland",
"state" => "Oregon",
"country" => "USA"
}
},
%{
"id" => 2,
"name" => "jason richard",
"education" => "Msc.",
"address" => %{
"line1" => "Crown Street",
"line2" => "2057",
"city" => "St Malo",
"state" => "Quebec",
"country" => "CA"
}
},
%{
"id" => 3,
"name" => "peters book",
"education" => "BSc.",
"address" => %{
"line1" => "Murry Street",
"line2" => "2285",
"city" => "Norfolk",
"state" => "Virginia",
"country" => "USA"
}
},
%{
"id" => 4,
"name" => "jason mount",
"education" => "Highschool",
"address" => %{
"line1" => "Aspen Court",
"line2" => "2057",
"city" => "Boston",
"state" => "Massachusetts",
"country" => "USA"
}
}
]
users_index = Index.add_documents(users_index, documents)
search_query = Kino.Input.text("Search users", default: "jason murry")
search_query = Kino.Input.read(search_query)
Index.search(users_index, search_query)
Index Manager
The manager includes different CRUD functions to help you manage your index after mutating the state. First of all, let’s get indexes to manage by the manager:
alias Elasticlunr.IndexManager
IndexManager.loaded_indices()
As seen above the list is empty. Now let’s add an index:
IndexManager.save(users_index)
IndexManager.loaded_indices()
|> Enum.any?(&(&1 == users_index.name))
|> IO.inspect(label: :users_index_exists)
IndexManager.loaded_indices()
The manager now has the users_index
in memory for access.
Query DSL
Like every other search engine, you can make more advanced search queries depending on your requirements, and I’m pleased to tell you that Elasticlunr has not left out such capabilities. So, in the proceeding parts of this docs, I will be highlighting the available query types provided by the library and how you can use them.
It’s important to note that Elasticlunr tries to replicate popular Query DSL (Domain Specific Language)
with the same behavior as Elasticsearch, which means the learning curve reduces if you have
experience using the search engine. For Elasticlunr, there are the bool
, match
, match_all
,
not
, and terms
query types you can use to retrieve insights about an index.
Bool
The bool
query is used with a combination of queries to retrieve documents matching the boolean
combinations of clauses. Consider these clauses to be everything that comes after the SELECT
statement in relational databases.
The bool
query is built using one or more clauses to achieve desired results, and each clause
has its type, see below:
Clause | Description
—|—
must
| The clause must appear in the matching documents, and this affects the document’s score.
must_not
| The clause must not appear in the matching document. Scoring is ignored because the clause is executed in the filter context.
filter
| Like must
, the clause must appear in the matching documents but scoring is ignored for the query.
should
| The clause should appear in the matching document.
It’s important to note that only scores from the must
and should
clauses contribute to the
final score of the matching document.
Index.search(index, %{
"query" => %{
"bool" => %{
"must" => %{
"terms" => %{"content" => "use"}
},
"should" => %{
"terms" => %{"category" => "elixir"}
},
"filter" => %{
"match" => %{
"id" => 3
}
},
"must_not" => %{
"match" => %{
"author" => "mika"
}
},
"minimum_should_match" => 1
}
}
})
You can use the minimum_should_match parameter to specify the number or percentage of should clauses returned documents must match. If the bool query includes at least one should clause and no must or filter clauses, the default value is 1. Otherwise, the default value is 0.
Match
The match
query is the standard query used for full-text search, including support for fuzzy
matching. The provided text is analyzed before matching it against documents.
Index.search(index, %{
"query" => %{
"match" => %{
"content" => %{
"query" => "liveview browser"
}
}
}
})
A match
query accepts one or more top-level fields you wish to search, in the example above,
it’s the content
field. Note that when you have more than one top-level fields, the match
query is rewritten to a bool
query internally by the library. Now, let’s see what parameters
are accepted by the match
query below:
Parameter | Description
—|—
query
| String you wish to find in the provided field.
expand
| Increase token recall, see token expansion.
fuzziness
| Maximum edit distance allowed for matching.
boost
| Floating point number used to decrease or increase the relevance scores of a query. Defaults to 1.0.
operator
| The boolean operator used to interpret the query
value. Available values for the operator
option are or
and and
. Defaults to or
.
minimum_should_match
| Minimum number of clauses that a document must match for it to be returned.
Match All
The most simple query, which matches all documents, gives them a score of 1.0 each.
Parameter | Description
—|—
boost
| Floating point number used to decrease or increase the relevance scores of a query. Defaults to 1.0.
Index.search(index, %{
"query" => %{
"match_all" => %{}
}
})
Not
The not
query inverts the result of the nested query giving the matched documents a score of
1.0 each.
Index.search(index, %{
"query" => %{
"not" => %{
"match" => %{
"content" => %{
"query" => "elixir"
}
}
}
}
})
Terms
The query return documents that contain the exact terms in a given field. The terms
query should
be used to find documents based on a precise value such as a price, a product ID, or a username.
Index.search(index, %{
"query" => %{
"terms" => %{
"content" => %{
"value" => "think"
}
}
}
})
A terms
query accepts one or more top-level fields you wish to search, in the example above,
it’s the content
field. Note that when you have more than one top-level fields, the terms
query is rewritten to a bool
query internally by the library. Now, let’s see what parameters
are accepted by the terms
query below:
Parameter | Description
—|—
value
| A term you wish to find in the provided field. The term must match exactly the field value to return a document.
boost
| Floating point number used to decrease or increase the relevance scores of a query. Defaults to 1.0.