Quantcast
Channel: Ramblings on life & code » James Tan
Viewing all articles
Browse latest Browse all 10

Efficient partial keyword searches

$
0
0

There are many ways to model and index data in MongoDB for efficient querying. This is straightforward in many cases, but in others it may require a bit more thought and insight to get an optimal solution.

In this post,  we first take a look at the common use cases and corresponding indexing patterns. Then we examine the challenge of efficient partial and case insensitive keyword searches in MongoDB, along with a proposed solution.

Single field and compound indexes

Simple use cases, such as key-value lookups, require only a single field index on the appropriate fields and MongoDB takes care of the rest. It (generally) does not matter if the field contains integers, strings, dates, arrays, or other data types.

Things are more complicated if compound indexes are needed (e.g. when there are multiple fields in the query and/or sorting is required), as the order of the fields in the index can make a big difference in performance. It is best to measure and compare the relative performance on a representative data set to determine the best option, as it varies with data distribution. For details, see Jesse’s blog post.

Dynamic attributes

Some use cases are hard to optimize with compound indexes alone, as the field names may not be known upfront and/or require an impractically large number of indexes (which can significantly reduce write performance).  Fortunately, this can be tackled with the attribute lookup strategy, which is described in more detail on Asya’s blog.

Keywords – exact matches

Another common use case is to add a set of keywords or tags to individual documents in MongoDB for searching. This is explained in the keyword search pattern and works great for exact matches, but does not scale well when case-insensitivity and/or partial matches are needed.

For example, consider a collection called items with the following schema:

> db.items.findOne()
{ "_id" : ObjectId("55a9352c3a8670cdc9acd7c7"),
  "keys": [ "pYqCoxht", "6t9WDot0" ],
  ...
}

If we create an index on the keys field, we can then do exact matches on it very efficiently [1]:

// Create the index
> db.items.ensureIndex({ keys: 1 })

// Exact match query
> db.items.find({ keys: "pYqCoxht" }).explain(1)
{ ...
  "executionStats": {
    "nReturned": 1,
    "executionTimeMillis": 0,
    "totalKeysExamined": 1,
    "totalDocsExamined": 1,
    ...
  }
}

Note that  

.explain(1)
  is shorthand for
.explain({ verbose: "allExecutionPlans" })
  to actually execute the query (MongoDB 3.0 and above), in order to obtain the execution statistics. For details, see cursor.explain().

Keywords – case insensitive searches

However, case insensitive searches (using regular expressions) are far more expensive as MongoDB cannot use the index effectively (there are 8 million documents in this items collection, each with 2 items in the keys array field):

// Case insensitive regular expression query
> db.items.find({ keys: /^pYqCoxht$/i }).explain(1)
{ ...
  "executionStats": {
    "nReturned": 1,
    "executionTimeMillis": 50984,
    "totalKeysExamined": 16000000,
    "totalDocsExamined": 8000000,
    ...
  }
}

Option #1: Convert the strings in keys to upper case before storing them in MongoDB, and then perform the same conversion on the query value so that an exact match can be used instead. For example:

// Strings in the 'keys' field are all uppercased 
> db.items.findOne()
{ "_id" : ObjectId("55a9352c3a8670cdc9acd7c7"),
  "keys": [ "PYQCOXHT", "6T9WDOT0" ],
  ...
}

// Convert user specific value to uppercase first before executing exact match query
> db.items.find({ keys: "pYqCoxht".toUpperCase() }).explain(1)
{ ...
  "executionStats": {
    "nReturned": 1,
    "executionTimeMillis": 0,
    "totalKeysExamined": 1,
    "totalDocsExamined": 1,
    ...
  }
}

Option #2: Use text indexes, which are case insensitive. For example:

// Create a text index instead of a regular one
> db.items.ensureIndex({ keys: "text" })

// Use the text index with $text
> db.items.find({ $text: { $search: "pYqCoxht" }).explain(1)
{ ...
  "executionStats": {
    "nReturned": 1,
    "executionTimeMillis": 0,
    "totalKeysExamined": 1,
    "totalDocsExamined": 1,
    ...
  }
}

However, take note of the restrictions and the fact that text indexes uses stemming to determine the root word, so the search results may be different from the exact match.

Keywords – partial (and case insensitive) searches

Partial keyword searches can be performed with regular expressions, but if this is not left anchored (e.g. starts with) MongoDB will again not be able to use the index efficiently. For example, using the same search value as the previous example but with the first and last characters removed:

// Unanchored partial search with regular expression
> db.items.find({ keys: /YqCoxh/ }).explain(1)
{ ...
  "executionStats": {
    "nReturned": 1,
    "executionTimeMillis": 50984,
    "totalKeysExamined": 16000000,
    "totalDocsExamined": 8000000,
    ...
  }
}

Performance wise this is similar to the case insensitive regular expression search discussed in the previous section. Text indexes do not perform partial matches/substrings so they cannot be used here. What can we do to make this faster?

One solution is to precompute all (upper cased) suffixes and store them for efficient left anchored regular expression searches. To do so, one can use the following reference Javascript function:

function makeSuffixes(values) {
    var results = [];
    values.sort().reverse().forEach(function(val) {
        var tmp, hasSuffix;
        for (var i=0; i<val.length-2; i++) {
            tmp = val.substr(i).toUpperCase();
            hasSuffix = false;
            for (var j=0; j<results.length; j++) {
                if (results[j].indexOf(tmp) === 0) {
                    hasSuffix = true;
                    break;
                }
            }
            if (!hasSuffix) results.push(tmp);
        }
    });
    return results;
}

This can copied and pasted in the MongoDB shell, and then executed. For example, using the original keys values from the initial example, we can compute the suffixes:

> makeSuffixes([ "pYqCoxht", "6t9WDot0" ])
["PYQCOXHT",
 "YQCOXHT",
 "QCOXHT",
 "COXHT",
 "OXHT",
 "XHT",
 "6T9WDOT0",
 "T9WDOT0",
 "9WDOT0",
 "WDOT0",
 "DOT0",
 "OT0"
]

We can now add these computed suffixes to the documents accordingly. For example:

> db.items.findOne()
{ "_id"     : ObjectId("55a9352c3a8670cdc9acd7c7"),
  "keys"    : [ "pYqCoxht", "6t9WDot0" ],
  "suffixes": [ "PYQCOXHT", "YQCOXHT", "QCOXHT", "COXHT", "OXHT", "XHT", "6T9WDOT0",
                      "T9WDOT0", "9WDOT0", "WDOT0", "DOT0", "OT0" ],
  ... 
}

One can now perform partial and case insensitive searches efficiently with left anchored regular expressions. For example:

// Left anchored regular expression query
> db.items.find({ keys: /^YQCOXH/ }).explain(1)
{ ...
  "executionStats": {
    "nReturned": 1,
    "executionTimeMillis": 0,
    "totalKeysExamined": 2,
    "totalDocsExamined": 1,
    ...
  }
}

This approach naturally increases the document and index size, but it is a well worth trade-off as it hugely speeds up such partial searches.

Try it out yourself

You can try this (and your own variants) by using the sample data generator suffix-generator.js. For example, save this locally and run it with the default settings:

$ mongo --quiet suffix-generator.js
THREADS    = 4
BATCH_SIZE = 2000
COUNT      = 100
COLL_NAME  = items

Dropped collection
Inserting 800000 docs...
3.17% (25378 inserts/sec)
...
100.00% (27586 inserts/sec)
Took 00:29

Creating index:
{ "keys" : 1 }
Took 00:08

Creating index:
{ "suffixes" : 1 }
Took 00:35

This will generate 800,000 documents using 4 threads in the test database, items collection. Each document has two random strings in the keys array field, with the computed suffixes in the suffixes field. Single field indexes on the fields keys and suffixes respectively are also created.

With the WiredTiger storage engine, you should get a collection with statistics similar to the following:

> db.items.stats(1024*1024)
{ "count" : 800000,
  "size" : 186,
  "avgObjSize" : 243,
  "storageSize" : 96,
  "nindexes" : 3,
  "totalIndexSize" : 118,
  "indexSizes" : {
    "_id_" : 6,
    "keys_1" : 21,
    "suffixes_1" : 90
  },
  ...
}

Note that the index size for suffixes is about 4x larger than keys.

The default values for

THREADS
 ,
BATCH_SIZE
 ,
COUNT
  (number of batches thread), and
COLL_NAME
  (collection name) can be overridden. For example:
$ mongo --quiet suffix-generator.js --eval "var THREADS=1; var COUNT=10"

Do let me know in the comments or pull request if any amendments or further improvements should be made.


Viewing all articles
Browse latest Browse all 10

Trending Articles