JMP

TwitterReddit

Full Text Search with IndexedDB

singpolyma@singpolyma.net

While working on Borogove there has been a desire to have full-text search of locally-stored chat message history. For web-based apps the main storage engine in use is IndexedDB, which is a rather low-level system and certainly doesn’t have easy facilities for this kind of operation. So what’s the simplest, performant way to implement a full-text search on IndexedDB?

In case you are wondering “what is full-text search?” for our purposes we are going to be looking at a search where we want to find any message that contains all the words the user has typed, in any order

Table Scan

While this won’t be our final solution, it is almost always the right place to start. If your data is small (say, under 10k documents) then a table scan is pretty quick, and definitely the simplest option

// Helper so we can use promises with IndexedDB
function promisifyRequest(request) {
	return new Promise((resolve, reject) => {
	  request.oncomplete = request.onsuccess = () => resolve(request.result);
	  request.onabort = request.onerror = () => reject(request.error);
	});
}

const stopwords = ["and", "if", "but"];
function tokenize(s) {
	return s.toLowerCase().split(/\s*\b/)
		.filter(w => w.length > 1 && w.match(/\w/) && !stopwords.includes(w));
}

function stemmer(s) {
	// Optional, https://www.npmjs.com/package/porter2 or similar
}

async function search(q) {
	const qTerms = new Set(tokenize(q).map(stemmer));
	const tx = db.transaction(["messages"], "readonly");
	const store = tx.objectStore("messages");
	const cursor = store.openCursor();

	const result = [];
	while (true) {
        const cresult = await promisifyRequest(cursor);
        if (!cresult?.value) break;

		if (new Set(tokenize(cresult.value.text).map(stemmer)).isSupersetOf(qTerms) {
			result.push(cresult.value);
		}
		cresult.continue();
	}

	return result;
}

Even though we aren’t doing anything fancy with the database yet, there are still a lot of important building blocks here

First we tokenize the query. This means to chunk it up into “words”. Words don’t have to be exactly real words, they can be consistent parts of words or anything else, so long as you tokenize your query and document text in the same way. Here we use a simple strategy where we trust the \b word-break regex pattern, strip any extra whitespace, ignore any empty words or words made of only non-word characters (again, as determined by \w, nothing fancy), and also ignore any stopwords. A “stopword” is a very common word that is not useful to include in the search query such as “and”. Mostly, stopwords are useful to avoid blowing up the index size later, but we include it here for now for consistency

Next we stem the tokens. This is optional and depends on the kind of search you’re trying to build. The purpose of a stemmer is to make searches for eg “flying” also match messages containing the word “fly”

Then we iterate over all the items in the object store. If you wanted the results in a particular order, you could instead iterate over an index on the store in a particular order. We tokenize and stem the text from each item and check if it contains all the “words” from qTerms, if so it is part of the results.

Use an Index

Now what if you have many messages (one million or more, perhaps) and the simple table scan is just too slow. How can we speed this up? With IndexedDB we only get one kind of index: an ordered index, usually build on a B-Tree. This is not exactly what many full-text indexes are built on, but still we can get a very big performance boost without too much more complexity.

First in the onupgradeneeded migrator we need to create the index:

tx.objectStore("messages").createIndex("terms", "terms", { multiEntry: true });

This is a “multiEntry” index which means that the index will get one entry for each item in the array stored at this key, rather than indexing on the array as a whole piece. So when we store a new message we need to include the terms as an array:

tx.objectStore("messages").put({ text, terms: [...new Set(tokenize(text).map(stemmer))] });

Now, this does not let us search by our full query, but rather only by one word. How does this help us? Well, we can iterate over only documents which match at least one of the terms in our query. Which one should we pick? Counting items is pretty fast, so let’s pick whichever one has the smallest number of results:

async function search(q) {
	const qTerms = new Set(tokenize(q).map(stemmer));
	const tx = db.transaction(["messages"], "readonly");
	const store = tx.objectStore("messages");
	const index = store.index("terms");
	// Figure out which search term matches the fewest messages
	let probeTerm = null;
	let probeScore = null;
	for (const term of qTerms) {
		const score = await promisifyRequest(index.count(IDBKeyRange.only(term)));
		if (!probeTerm || score < probeScore) {
			probeTerm = term;
			probeScore = score;
		}
	}
	// Using the smallest list of messages that match one term
	// Find the ones that match every term
	const result = [];
	const cursor = index.openCursor(IDBKeyRange.only(probeTerm));
	while (true) {
		const cresult = await promisifyRequest(cursor);
		if (!cresult?.value) break;
		if (new Set(cresult.value.terms).isSupersetOf(qTerms)) {
			result.push(cresult.value);
		}
		cresult.continue();
	}

	// Sort results
	return result.sort((a, b) => a.timestamp < b.timestamp ? -1 : (a.timestamp > b.timestamp ? 1 : 0));
}

The operation to count the index for each term is pretty fast, but if you found this prohibitive, you could also store these counts in their own keys and update them as you insert new messages. Once we know which is the smallest, we then do the same scan as before, but only over that much smaller subset. The tokenized and stemmed terms are stored so we can compare against those directly here rather than doing it again.

Now if we want it sorted we have to do it ourselves, just like a DB engine would with this kind of query where the order we want does not match the index order.

On a test set of one million messages, this simple index was enough to take the performance from unusable grinding to almost instant responses, since the number of messages in the smallest term is usually still well under 10k.

Featured Image

Computing International Call Rates with a Trie

singpolyma@singpolyma.net

A few months ago we launched International calling with JMP.  One of the big tasks leading up to this launch was computing the rate card: that is, how much calls to different destinations would cost per minute.  While there are many countries in the world, there are even more calling destinations.  Our main carrier partner for this feature lists no fewer than 59881 unique phone number prefixes in the rates they charge us.  This list is, quite frankly, incomprehensible.  One can use it to compute the cost of a call to a particular number, but it gives no confidence about the cost of calls in general.  Many items on this list are similar, and so I set out to create a better list.

My first attempt was a simple one-pass algorithm.  This would record each prefix with its price and then if a longer prefix with a different price were discovered it would add that as well.  This removes the most obvious effectively-duplicate data, but still left a very large list.  I added our markup and various rounding rules (since increments of whole cents are easier to understand in most cases anyway, for example) which did cut down a bit further, but it became clear that the one-pass was not going to be sufficient.  Consider:

  1. +00 at $0.01
  2. +0010 at $0.02
  3. +0011 at $0.02
  4. +0012 at $0.02
  5. +0013 at $0.02
  6. +0014 at $0.02
  7. +0015 at $0.02
  8. +0016 at $0.02
  9. +0017 at $0.02
  10. +0018 at $0.02
  11. +0019 at $0.02

There are many sets of prefixes that look like this in the data.  Of course the right answer here is that +001 is $0.02, which is much easier to understand than this list, but the algorithm cannot know that until it has seen all 10 overlapping prefixes.  Even worse:

  1. +00 at $0.01
  2. +0010 at $0.02
  3. +0011 at $0.02
  4. +0012 at $0.02
  5. +0013 at $0.02
  6. +0014 at $0.02
  7. +0015 at $0.03
  8. +0016 at $0.02
  9. +0017 at $0.02
  10. +0018 at $0.02
  11. +0019 at $0.02

From this input we would like:

  1. +00 at $0.01
  2. +001 at $0.02
  3. +0015 at $0.03

So just checking if the prefixes we have so far are a fully-overlapped set is not enough.  Well, no problem, it’s not that much data, perhaps I can implement a brute-force approach and be done with it.

Brute force is very slow.  On this data it completed, but as I found I kept wanting to tweak rounding rules and other parts of the overlap detection the speed became really problematic.  So I was searching for a non-bruteforce way that would be optimal across all prefixes and fast enough to re-run often in order to play with the effects of rounding rules.

Trie

As I was discussing the problem with a co-worker, trying to speed up lookups we were thinking about trees.  Maybe a tree where traversal to the next level was determined by the next digit in the prefix?  As we explored what this would look like, it became obvious that we were inventing a Trie.  So I grabbed a gem and started monkeypatching things.

Most Trie implementations are about answering yes/no questions and don’t store anything but the prefix in the tree.  I wanted to be able to “look down” from any node in the tree to see if the data was overlapping, and so storing rates right in the nodes seemed useful:

def add_with(chars, rate)
    if chars.empty? # leaf node for this prefix
        @rate = rate
        terminal!
    else
        add_to_children_tree_with(chars, rate)
    end
end

But sometimes we have a level that doesn’t have a rate, so we need to compute its rate from the majority-same rate of its children:

def rate
    # This level has a known rate already
    return @rate if @rate

    groups =
        children_tree.each_value.to_a         # Immediate children
        .select { |x| x.rate }                # That have a rate
        .combination(2)                       # Pairwise combinations
        .select { |(x, y)| x.rate == y.rate } # That are the same
        .group_by { |x| x.first.rate }        # Group by rate
    unless groups.empty?
        # Whichever rate has the most entries in the children is our rate
        @rate = groups.max_by { |(_, v)| v.length }.first
        return @rate
    end

    # No rate here or below
    nil
end

This algorithm is naturally recursive on the tree, so even if the immediate children don’t have a rate they will compute from their children, etc.  And finally a traversal to turn this all back into the flat list we want to store:

def each
    if rate
        # Find the rate of our parent in the tree,
        # possibly computed in part by asking us
        up = parent
        while up
            break if up.rate
            up = up.parent
        end

        # Add our prefix and rate to the list unless parent has it covered
        yield [to_s, rate] unless up&.rate == rate
    end

    # Add rates from children also
    children_tree.each_value do |child|
        child.each { |x| yield x }
    end
end

This (with rounding rules, etc) cut the list from our original of 59881 down to 4818.  You can browse the result.  It’s not as short as I was hoping for, but many destinations are manageable now, and thanks to a little bit of Computer Science we can tweak it in the future and just rerun this quick script.

Creative Commons Attribution ShareAlike