This is the first of a two-part post on how we built DoneDone’s global search catalog. In this post, we’ll take a look at the software we use to power our global search, how queries work, and how we structure the records needed to perform a search. In next week’s post, we’ll dive into how the search catalog is built and actively maintained. So, let’s get going!
In 2013, we released global search for DoneDone. This feature allows you to search for any set of words across all issues you have access to. How did we build our search engine and how does it stay up-to-date? I hope this two-part post is a helpful guide for those of you wondering and looking to implement something similar.
Rather than add full-text cataloging to our master SQL Server database, we built the search database separately using Lucene.Net—a port of the Lucene search software originally written in Java by Doug Cutting. Isolating the search feature away from our master database was a pragmatic decision. It let us focus on building out search without having to worry about affecting performance on our master database.
A (quick) introduction to Lucene
Lucene is an open-source document database that comes with its own code library and querying language optimized for searching against text.
A document database takes the traditional approach of a NoSQL database: The entirety of the data concerning a business object lives inside one record, called a document. In contrast, our master database is relational —we tend to store pieces of information for a business object in multiple records across multiple tables to keep data normalized. In Lucene, we don’t care about normalization. We flatten out (and sometimes repeat) data so it’s optimized for search and retrieval.
A document consists of a set of fields. Fields have a unique string name and a string value. It also has a store type and an index type. We’ll touch upon the latter a bit later.
At a high level, search is straightforward. Once you’ve created a collection of documents within Lucene, you can run a query against the fields within these documents. Lucene will then return matching documents (i.e. a set of records) in order of relevance. You can then access the fields within these documents to display search results any way you want.
Lucene’s low-level magic
At a lower level, there’s a good amount of magic (read: code we don’t have to write) that goes on behind-the-scenes. With the querying language, you can tell Lucene which matches are required. You can also weigh matches against certain fields over others. As a simple example, if you have a
Description field in your documents, a search query like the one below will prioritize a match on title twice as much as it would on description.
Title:“Hello world” ^2 Description:“Hello world”
The benefit of weighting is order. With the right selection of weights, you can push more relevant matches to the top of your search results.
Lucene also takes care of highlighting and fragmenting descriptions where the best matches occur. Suppose we perform a search for “lazy dog” in the Description field of a document with this value:
“The quick brown fox jumps over the lazy dog. It is a very rainy day, so the fox is lucky that it didn’t slip when it jumped. The lazy dog was, as you might expect, none the wiser. The lazy dog is, after all, a lazy dog.”
In Lucene’s code library, the
GetBestFragments() method will return a tailored string. You can tell Lucene how to stylize relevant matches, how to concatenate matches within long strings, and how many matches to return. In this case, we tell it to display matches as bold text, use ellipses to separate fragments, and return only two matches:
“The quick brown fox jumps over the lazy dog…The lazy dog was, as you might expect, none the wiser.”
There are plenty more magical bits to Lucene. As is the case with most open-source projects, documentation is a bit hard to find (here’s one good resource). But, the library itself provides detailed comments for you to go hunting around.
Defining our document structure
Here’s what a typical search result looks like in DoneDone:
So, what does the anatomy of a document in our search database look like? Ideally, we want everything we need to display, search, filter, and manage a search result comprised within one single document. That includes the data exposed in our search results as well as data we need behind-the-scenes to manage each document.
From just looking at the data we display, we already know a few fields we’ll need in each document: the issue title, the issue number, the project, the date, and the description. We’ll also need the priority level (so that we can display the appropriate priority color) and the status type (we strike through the issue title if that issue is closed or fixed).
Here’s a breakdown of how we named those fields in our documents, whether they are used in the search query, and how they are displayed in the search results:
|Partial text matching or No*||Displayed in search results|
|Exact match or No*||Displayed in search results|
|Partial text matching||Fragments of the description are displayed in search results|
|No||Displayed in search results|
|No||If “closed” or “fixed”, the issue title will display with a strikethrough|
|No||Used to display the correct priority color to the issue number and title|
|No||Maps the master database ID of a project name so it can be displayed|
*-Search-ability depends on the document, which we’ll explain below.
How we index fields
As I mentioned earlier, a field not only consists of a name and a value, but an index type. The index type tells Lucene if and how to index that field in the database. For our purposes, we use one of three options:
If you don’t index a field, you won’t be able to search against the value of that field. If you choose an
ANALYZED index, Lucene will be able to partially match against the value of the field–you can specify if the query requires some or all of the text to match in order for a document to be returned. A
NOT_ANALYZED index requires an exact match for that field’s document to be returned.
In our case, we want to place an
ANALYZED index on the
Description field. This lets us do all the Lucene magic of partial text matching we discussed earlier. In contrast, we don’t place an index on the
Priority fields. Those fields simply come along for the ride if a document matches on the other fields, so they can be used in the displayed results.
We place a
NOT_ANALYZED index on the
ProjectID field. This field serves three purposes:
- First, we use it to map to a list of project ID/name pairs available in memory after the search executes. This allows us to display the project name alongside the search result.
- Secondly, we allow users to filter results by project by clicking on the “Viewing…” link. When a specific project is selected, we add an additional query parameter that tells Lucene to only return documents whose
ProjectIDvalue exactly matches the value from the incoming request.
- Lastly, and along similar lines, the
ProjectIDfield also ensures the user doing the search has permission to that search result. Since our search catalog is global, rather than partitioned by account, we need a way to ensure a user doesn’t get results from a project they don’t belong to. Along with each request, we pass in a list of
ProjectIDs that a user has access to in DoneDone. That list of IDs gets passed into the query. If a document’s
ProjectIDdoesn’t match any
ProjectIDs a user has access to, it doesn’t get returned.
IssueNum, things get a little more interesting.
Variable indexing on an issue’s title and number
In DoneDone, an issue starts with a title and a description. After that, there might be edits and various comments on the issue. We display this additional dialogue chronologically on the issue detail page. Internally, we store each update to an issue (including the initial creation of the issue) as an issue history record.
We want to break down the searchable pieces of an issue in a similar fashion. If we stuffed the contents of an entire issue into one search document, we’d lose the flexibility of better contextual matching. For instance, you might have a dozen matching results for a single issue spread across five different comments from five different people. We want to list those as five separate search results rather than one result. Doing this also lets us directly link to the matching comment for each result (via an in-page anchor), rather than to the top of the issue detail page.
In order to get this granularity in the search results, each issue history record in our master database corresponds to a single document in our search database. If an issue has 12 histories (including the creation of the issue), there will be 12 corresponding documents in our search database. The comment added to each history corresponds to the
Description field in the search document.
However, this also presents a conundrum. We include the
IssueNum for each search document. At first glance, we might want to add an
ANALYZED index on the
IssueTitle, just as we do for the
Description. We also might want to add a
NOT_ANALYZED index for the
IssueNum (this allows users to search for matches by issue number — e.g. #188).
But, if we applied the index to all search documents, then, a match on an issue’s title will return all documents for that issue. If an issue’s title matched a search query and had 12 histories, 12 documents for that issue would return.
Instead, we only apply an index on
IssueNum if the corresponding issue history record has a type of
CREATION. For all other histories (status updates, priority updates, fixer and tester reassignments, general comments or edits), we don’t apply an index at all. Instead, they are merely used for display purposes. The ability to index a field for certain documents lets you get pretty creative with your search logic!
Rounding out the document structure
So far, we’ve only discussed the fields in a search document that directly affect how a result displays. But, as I mentioned earlier, we need to track a few more fields that are needed to correctly update and manage existing documents.
Behind the scenes, there are a few other identifiers we need within the document to be able to manage additions, updates and deletions. We’ll explain why they’re needed next:
|Exact match||Used to update or delete documents if the corresponding issue record is updated or deleted in the master database|
|Exact match||Used to update a document if the corresponding history record is updated in the master database.|
|No||Determines whether the document’s |
We include the
IssueHistoryID (a search document’s corresponding issue history ID in the master database) for two reasons. First, it lets us create the URL for each result which includes an in-page anchor to the specific comment where the search query matched. Second, we leverage this ID if an issue history is updated. That’s why we put a
NOT_ANALYZED index on this field.
We include the
IssueHistoryType so that we can track whether the
IssueNum should be indexed (as we just described earlier). When the document is initially added to the search database, we don’t need to access this field. But, when an issue is updated, we will. More on this in our next post.
Finally, we include the
IssueID (the ID of an issue in the master database). If the title of an issue is ever updated, we’d need to update the title on all issue histories related to that issue. That’s also why we put a
NOT_ANALYZED index on this field.
So, now you know a bit more about Lucene.Net—the engine behind DoneDone’s global search. You also learned how we structure documents in our search database as well as how fields are indexed. All of this is great in theory, but how do we actually create and update these documents in real-time? Let’s find out!