How we built and maintain our global search catalog (Part 1)

dd-lucene-search

This is the first of a two-part post on how we built DoneDone’s global search catalog. In this post, we’ll take a look at the software we use to power our global search, how queries work, and how we structure the records needed to perform a search. In next week’s post, we’ll dive into how the search catalog is built and actively maintained. So, let’s get going!

In 2013, we released global search for DoneDone. This feature allows you to search for any set of words across all issues you have access to. How did we build our search engine and how does it stay up-to-date? I hope this two-part post is a helpful guide for those of you wondering and looking to implement something similar.

Rather than add full-text cataloging to our master SQL Server database, we built the search database separately using Lucene.Net—a port of the Lucene search software originally written in Java by Doug Cutting. Isolating the search feature away from our master database was a pragmatic decision. It let us focus on building out search without having to worry about affecting performance on our master database.

A (quick) introduction to Lucene

Lucene is an open-source document database that comes with its own code library and querying language optimized for searching against text.

A document database takes the traditional approach of a NoSQL database: The entirety of the data concerning a business object lives inside one record, called a document. In contrast, our master database is relational —we tend to store pieces of information for a business object in multiple records across multiple tables to keep data normalized. In Lucene, we don’t care about normalization. We flatten out (and sometimes repeat) data so it’s optimized for search and retrieval.

A document consists of a set of fields. Fields have a unique string name and a string value. It also has a store type and an index type. We’ll touch upon the latter a bit later.

At a high level, search is straightforward. Once you’ve created a collection of documents within Lucene, you can run a query against the fields within these documents. Lucene will then return matching documents (i.e. a set of records) in order of relevance. You can then access the fields within these documents to display search results any way you want.

Lucene’s low-level magic

At a lower level, there’s a good amount of magic (read: code we don’t have to write) that goes on behind-the-scenes. With the querying language, you can tell Lucene which matches are required. You can also weigh matches against certain fields over others. As a simple example, if you have a Title and Description field in your documents, a search query like the one below will prioritize a match on title twice as much as it would on description.

Title:“Hello world” ^2 Description:“Hello world”

The benefit of weighting is order. With the right selection of weights, you can push more relevant matches to the top of your search results.

Lucene also takes care of highlighting and fragmenting descriptions where the best matches occur. Suppose we perform a search for “lazy dog” in the Description field of a document with this value:

“The quick brown fox jumps over the lazy dog. It is a very rainy day, so the fox is lucky that it didn’t slip when it jumped. The lazy dog was, as you might expect, none the wiser. The lazy dog is, after all, a lazy dog.”

In Lucene’s code library, the Highlighter class’s GetBestFragments() method will return a tailored string. You can tell Lucene how to stylize relevant matches, how to concatenate matches within long strings, and how many matches to return. In this case, we tell it to display matches as bold text, use ellipses to separate fragments, and return only two matches:

“The quick brown fox jumps over the lazy dog…The lazy dog was, as you might expect, none the wiser.”

There are plenty more magical bits to Lucene. As is the case with most open-source projects, documentation is a bit hard to find (here’s one good resource). But, the library itself provides detailed comments for you to go hunting around.

Defining our document structure

Here’s what a typical search result looks like in DoneDone:

DoneDone's global search feature
DoneDone’s global search feature

So, what does the anatomy of a document in our search database look like? Ideally, we want everything we need to display, search, filter, and manage a search result comprised within one single document. That includes the data exposed in our search results as well as data we need behind-the-scenes to manage each document.

From just looking at the data we display, we already know a few fields we’ll need in each document: the issue title, the issue number, the project, the date, and the description. We’ll also need the priority level (so that we can display the appropriate priority color) and the status type (we strike through the issue title if that issue is closed or fixed).

Here’s a breakdown of how we named those fields in our documents, whether they are used in the search query, and how they are displayed in the search results:

Field NameSearched?Use
IssueTitlePartial text matching or No*Displayed in search results
IssueNumExact match or No*Displayed in search results
DescriptionPartial text matchingFragments of the description are displayed in search results
CreatedOnNoDisplayed in search results
StatusNoIf “closed” or “fixed”, the issue title will display with a strikethrough
PriorityNoUsed to display the correct priority color to the issue number and title
ProjectIDNoMaps the master database ID of a project name so it can be displayed

 

*-Search-ability depends on the document, which we’ll explain below.

How we index fields

As I mentioned earlier, a field not only consists of a name and a value, but an index type. The index type tells Lucene if and how to index that field in the database. For our purposes, we use one of three options: NO, ANALYZED, or NOT_ANALYZED.

If you don’t index a field, you won’t be able to search against the value of that field. If you choose an ANALYZED index, Lucene will be able to partially match against the value of the field–you can specify if the query requires some or all of the text to match in order for a document to be returned. A NOT_ANALYZED index requires an exact match for that field’s document to be returned.

In our case, we want to place an ANALYZED index on the Description field. This lets us do all the Lucene magic of partial text matching we discussed earlier. In contrast, we don’t place an index on the CreatedOn, Status, or Priority fields. Those fields simply come along for the ride if a document matches on the other fields, so they can be used in the displayed results.

We place a NOT_ANALYZED index on the ProjectID field. This field serves three purposes:

  • First, we use it to map to a list of project ID/name pairs available in memory after the search executes. This allows us to display the project name alongside the search result.
  • Secondly, we allow users to filter results by project by clicking on the “Viewing…” link. When a specific project is selected, we add an additional query parameter that tells Lucene to only return documents whose ProjectID value exactly matches the value from the incoming request.
  • Lastly, and along similar lines, the ProjectID field also ensures the user doing the search has permission to that search result. Since our search catalog is global, rather than partitioned by account, we need a way to ensure a user doesn’t get results from a project they don’t belong to. Along with each request, we pass in a list of ProjectIDs that a user has access to in DoneDone. That list of IDs gets passed into the query. If a document’s ProjectID doesn’t match any ProjectIDs a user has access to, it doesn’t get returned.

With IssueTitle and IssueNum, things get a little more interesting.

Variable indexing on an issue’s title and number

In DoneDone, an issue starts with a title and a description. After that, there might be edits and various comments on the issue. We display this additional dialogue chronologically on the issue detail page. Internally, we store each update to an issue (including the initial creation of the issue) as an issue history record.

asdfasdfasdf
Issue detail pages are composed of a series of issue history records

We want to break down the searchable pieces of an issue in a similar fashion. If we stuffed the contents of an entire issue into one search document, we’d lose the flexibility of better contextual matching. For instance, you might have a dozen matching results for a single issue spread across five different comments from five different people. We want to list those as five separate search results rather than one result. Doing this also lets us directly link to the matching comment for each result (via an in-page anchor), rather than to the top of the issue detail page.

In order to get this granularity in the search results, each issue history record in our master database corresponds to a single document in our search database. If an issue has 12 histories (including the creation of the issue), there will be 12 corresponding documents in our search database. The comment added to each history corresponds to the Description field in the search document.

However, this also presents a conundrum. We include the IssueTitle and IssueNum for each search document. At first glance, we might want to add an ANALYZED index on the IssueTitle, just as we do for the Description. We also might want to add a NOT_ANALYZED index for the IssueNum (this allows users to search for matches by issue number — e.g. #188).

But, if we applied the index to all search documents, then, a match on an issue’s title will return all documents for that issue. If an issue’s title matched a search query and had 12 histories, 12 documents for that issue would return.

Instead, we only apply an index on IssueTitle and IssueNum if the corresponding issue history record has a type of CREATION. For all other histories (status updates, priority updates, fixer and tester reassignments, general comments or edits), we don’t apply an index at all. Instead, they are merely used for display purposes. The ability to index a field for certain documents lets you get pretty creative with your search logic!

Rounding out the document structure

So far, we’ve only discussed the fields in a search document that directly affect how a result displays. But, as I mentioned earlier, we need to track a few more fields that are needed to correctly update and manage existing documents.

Behind the scenes, there are a few other identifiers we need within the document to be able to manage additions, updates and deletions. We’ll explain why they’re needed next:

Field NameSearched?Use
IssueIDExact matchUsed to update or delete documents if the corresponding issue record is updated or deleted in the master database
IssueHistoryIDExact matchUsed to update a document if the corresponding history record is updated in the master database.
IssueHistoryTypeNoDetermines whether the document’s IssueTitle and IssueNum is searchable or not

 

We include the IssueHistoryID (a search document’s corresponding issue history ID in the master database) for two reasons. First, it lets us create the URL for each result which includes an in-page anchor to the specific comment where the search query matched. Second, we leverage this ID if an issue history is updated. That’s why we put a NOT_ANALYZED index on this field.

We include the IssueHistoryType so that we can track whether the IssueTitle and IssueNum should be indexed (as we just described earlier). When the document is initially added to the search database, we don’t need to access this field. But, when an issue is updated, we will. More on this in our next post.

Finally, we include the IssueID (the ID of an issue in the master database). If the title of an issue is ever updated, we’d need to update the title on all issue histories related to that issue. That’s also why we put a NOT_ANALYZED index on this field.

——–

So, now you know a bit more about Lucene.Net—the engine behind DoneDone’s global search. You also learned how we structure documents in our search database as well as how fields are indexed. All of this is great in theory, but how do we actually create and update these documents in real-time? Let’s find out!

You'll love these too

You'll love these too

Plane