Localized search on SUMO
My primary project at work is SUMO, the Firefox support site. It consists of a few parts, including a wiki, a question/answer forum, and a customized Twitter client for helping people with Firefox. It is also a highly localized site, with support in the code for over 80 languages. We don't have a community in all of those languages, but should one emerge, we are ready to embrace it. In other words, we take localization seriously.
Until recently, however, this embrace of multilingual coding didn't extend to our search engine. Our search engine (based on ElasticSearch) assumed that all of the wiki documents, question and answers, and forum posts were in English, and applied English based tricks to improve search. No more! On Monday, I flipped the switch to enable locale-specific analyzer and query support in the search engine, and now many languages have improved search. Here, I will explain just what happened, and how we did it.
Background
Currently, we use two major tricks to improve search: stemming and stop words. These help the search engine behave in a way that is more consistent with how we understand language, generally.
Stemming
Stemming is recognizing that words like "jump", "jumping", "jumped", and "jumper" are all related. They all stem from the common word "jump". In our search engine, this is done by enabling the ElasticSearch Snowball analyzer, which uses the Porter stemming algorithmm.
Unfortunately, Porter is English specific, because it stems algorithmically based on patterns in English, such as removing trailing "ing", "ed" or "er". The algorithm is much more complicated, but the point is, it really only works for English.
Stop Words
Stop words are words like "a", "the", "I", or "we" that generally carry little information in regards to a search engine's behavior. ES includes a list of these words, and removes them intelligently from search queries.
Analysis
ES is actually a very powerful system that can be used for many different kinds of search tasks (as well as other data slicing and dicing). One of the more interesting features that make it more than just full text search are it's analyzers. There are many built in analyzers, and there are ways to recombine analyzers and parts of analyzers to build custom behavior. If you really need something special, you could even write a plugin to add a new behavior, but that requires writing Java, so lets not go there.
The goal of Analysis is to take a stream of characters and create a stream of tokens out of them. Stemming and stop words are things that can play into this process. These modifications to analysis actually change the document that gets inserted into the ES index, so we will have to take that into account later. If we insert a document contains "the dog jumped" into the index, it would get indexed as something like
[{ "token": "dog", "start": 4, "end": 7 }, { "token": "jump", "start": 8, "end": 14 }]
This isn't really what ES would return, but it is close enough. Note how the tokens inserted are post-analysis version, that include the changes made by the stop words and stemming token filters. That means the analysis process is languages specific, so we need to change the analyzer depending on the language. Easy, right? Actually, yes. This consists of a few parts.
Choosing a language
SUMO is a Django app, so in settings.py, we define a map of languages to ES analyzers, like this (except with a lot more languages):
ES_LOCALE_ANALYZERS = {
'en-US': 'snowball',
'es': 'snowball-spanish',
}
Note: snowball-spanish
is simply the normal Snowball analyzer with an
option of {"language": "Spanish"}
.
Then we use this helper function to pick the right language based on a locale, with a fallback. This also takes into account the possibility that some ES analyzers are located in plugins which may not be available.
def es_analyzer_for_locale(locale, fallback="standard"):
"""Pick an appropriate analyzer for a given locale.
If no analyzer is defined for `locale`, return fallback instead,
which defaults to ES analyzer named "standard".
"""
analyzer = settings.ES_LOCALE_ANALYZERS.get(locale, fallback)
if (not settings.ES_USE_PLUGINS and
analyzer in settings.ES_PLUGIN_ANALYZERS):
analyzer = fallback
return analyzer
Indexing
Next, the mapping needs to be modified. Prior to this change, we explicitly listed the analyzer for all analyzed fields, such as the document content or document title. Now, we leave off the analyzer, which causes it to use the default analyzer.
Finally, we can set the default analyzer on a per document basis, by
setting the _analyzer
field when indexing it into ES. This ends up
looking something like this (this isn't the real code, because the real
code is much longer for uninteresting reasons):
def extract_document(obj):
return {
'document_title': obj.title,
'document_content': obj.content,
'locale': obj.locale,
'_analyzer': es_analyzer_for_locale(obj.locale),
}
Searching
This is all well and good, but what use is an index of documents if you can't query it correctly? Lets consider an example. If there is a wiki document with a title "Deleting Cookies", and a user searches for "how to delete cookies", here is what happens:
First, the document would have been indexed and analyzed, producing this:
[{ "token": "delet", "start": 0, "end": 8 }, { "token": "cooki", "start": 9, "end": 16 }]
So now, if we try and query "how to delete cookies", nothing will match! That is because we need to analyze the search query as well (ES does this by default). analyzing the search query results in:
[
{ "token": "how", "start": 0, "end": 3 },
{ "token": "delet", "start": 7, "end": 13 },
{ "token": "cooki", "start": 14, "end": 21 }
]
Excellent! This will match the document's title pretty well. Remember the ElasticSearch doesn't enforce that 100% of the query matches. It simply finds the best one available, which can be confusing in edge cases, but in the normal case it works out quite well.
There is an issue though. Let's try this example in Spanish. Here is the document title "Borrando Cookies", as analyzed by our analysis process from above.
[{ "token": "borr", "start": 0, "end": 8 }, { "token": "cooki", "start": 9, "end": 16 }]
and the search "como borrar las cookies":
[
{"token": "como", "start": 0, "end": 4},
{"token": "borrar", "start": 6, "end": 11},
{"token": "las", "start": 12, "end": 15},
{"token": "cooki", "start": 16 "end": 23}
]
... Not so good. In particular, 'borrar', which is another verb form of
'Borrando' in the title, got analyzed as English, and so didn't get
stemmed correctly. It won't match the token borr
that was generated in
the analysis of the document. So clearly, searches need to be analyzed
in the same way as documents.
Luckily in SUMO we know what language the user (probably) wants, because the interface language will match. So if the user has a Spanish interface, we assume that the search is written in Spanish.
The original query that we use to do searches looks something like this much abbreviated sample:
{
"query": {
"text": {
"document_title": {
"query": "como borrar las cookies"
}
}
}
}
The new query includes an analyzer field on the text match:
{
"query": {
"text": {
"document_title": {
"query": "como borrar las cookies",
"analyzer": "snowball-spanish"
}
}
}
}
This will result in the correct analyzer being used at search time.
Conclusion
This took me about three weeks off and on to develop, plus some chats
with ES developers on the subject. Most of that time was spent
researching and thinking about what the best way to do localized search
was. Alternatives include having lots of fields, like
document_title_es
, document_title_de
, etc, which seems icky to me,
or using multifields to achieve a similar result. Another proposed
example idea was to use different ES indexes for each language.
Ultimately I decided in the approach outlined above.
For the implementation, modifying the indexing method to insert the right data into ES was the easy part, and I knocked it out in an afternoon. The difficult part was modifying our search code, working with the library we use to interact with ES to get it to support search analyzers, testing everything, and debugging the code that broke when this change was made. Overall, I think that the task was easier than we had expected when we wrote it down in our quarterly goals, and I think it went well.
For more nitty-gritty details, you can check out the two commits to the mozilla/kitsune repo that I made these changes in: 1212c97 and 0040e6b.