Every once in a while I come across a problem that becomes a fun challenge. This time it was fuzzy searching.
This article is about me discovering a long used algorithm in text search. None of this is new, just new to me. It is the product of having a meaningful conversation with a coworker at MongoDB.
Understanding the problem is simple. Given the search term gtwdgshw, return all matches containing all search characters in order but allowing for space in-between each search character. For example, both gtkwidgetshow and gtkwidgetsethaswindow should match this search term.
Due to the number of similar results, scoring each result is important to make the mostly likely candidate the first. In this case, I probably want to see gtkwidgetshow above the result for gtkwidgetsethaswindow.
What I have seen others due in the past is fairly complex. It comes down to checking for sub-strings of the search term in every string in the corpus. Let's take a look at what that might look like.
That would conceptually look something like the following.
Is "gtwdgshw" in "gtk_widget_show"? No. Is "gtwdgsh" in "gtk_widget_show"? No. Is "gtwdgs" in "gtk_widget_show"? No. Is "gtwdg" in "gtk_widget_show"? No. Is "gtwd" in "gtk_widget_show"? No. Is "gt" in "gtk_widget_show"? Yes. Is "wdgshw" in "k_widget_show"? No. ...
Clearly, this is the brute force approach.
I could simplify this a bit by going from small sub-strings to larger ones, but conceptually it is the same design.
Now lets do that for every string in our corpus. Face palm. This could take forever.
Imagine I'm searching through a corpus containing every GNOME function symbol, type name, property name, and more. Add to that my desire to run this search for every key-press-event in my application. If I want this to feel interactive I only have about 100 milliseconds from the moment of key-press to displaying results.
This is just the wrong algorithm for the job.
This could get out of hand really quick.
The more correct algorithm is broken into three parts. Indexing, Searching, and Scoring. However, Searching and Scoring will happen at the same time.
In the basic form the algorithm consists of indexing the position of every character in every string in our corpus. So for each character of each string, we have a two-part tuple consisting of (String Identifier, Character Position).
The two-part tuple is stored in the index appropriated for that character. So if I want to find all of the strings containing the letter a I simply get the index consisting of every string and position where the letter a occurs.
If we keep each of these indexes sorted, first by String Identifier and then by Character Position, we allow ourselves a neat trick in the searching portion of our algorithm.
I won't give away the trick just yet, but it looks something like one of those horse racing games you find at carnivals.
The first step is indexing the corpus. I've explained the information I need to index above, but lets visualize it quick.
Given the corpus
0: gtk_widget_show 1: gtk_widget_hide 2: gtk_widget_set_has_window
I would have created the following indexes
_: (0,3) (0,10) (1,3) (1,10) (2,3) (2,10) (2,14) (2,18) a: (2,16) d: (0,6) (1,6) (1,13) (2,6) (2,22) e: (0,8) (1,8) (1,14) (2,8) (2,12) g: (0,0) (0,7) (1,0) (1,7) (2,0) (2,7) h: (0,12) (1,11) (2,15) i: (0,5) (1,5) (1,12) (2,5) (2,20) k: (0,2) (1,2) (2,2) n: (2,21) o: (0,13) (2,23) s: (0,11) (2,11) (2,17) t: (0,1) (0,9) (1,1) (1,9) (2,1) (2,9) (2,13) w: (0,4) (0,14) (1,4) (2,4) (2,19) (2,24)
Notice the ordering of the above table. It is sorted first by String Identifier and then by Character Position. This property allows us to search the index always making forward progress.
This is an implementation detail.
To store a copy of all of the strings I use a contiguous region of memory. Simply store each of the strings one after another and return an offset within the heap. This is a bit more ideal for a large corpus than what GStringChunk does because it is wasteful. I should note that this does have the overhead of lots of realloc() calls but inserts are less frequent than look-ups.
Now for the fun part, I'm off to the races!
Imagine the carnival game where the horses are running across a horizontal path. Each horse is competing with the others.
I find this image illustrative because of how we make forward progress on the search.
We start by making an array of indexes. If we have the search term gtwdgshw, then we will have an array of 8 indexes. One for each character in the search term.
0: <g> (0,0) (0,7) (1,0) (1,7) (2,0) (2,7) 1: <t> (0,1) (0,9) (1,1) (1,9) (2,1) (2,9) (2,13) 2: <w> (0,4) (0,14) (1,4) (2,4) (2,19) (2,24) 3: <d> (0,6) (1,6) (1,13) (2,6) (2,22) 4: <g> (0,0) (0,7) (1,0) (1,7) (2,0) (2,7) 5: <s> (0,11) (2,11) (2,17) 6: <h> (0,12) (1,11) (2,15) 7: <w> (0,4) (0,14) (1,4) (2,4) (2,19) (2,24)
Additionally, we need an iteration variable (the often used int i) for each of the indexes. This is because we never need go backwards while searching each of the indexes!
So our initial iteration state looks like the following.
0: 0 1: 0 2: 0 3: 0 4: 0 5: 0 6: 0 7: 0
Every time we find a match within an index we will traverse down to the next index, continue where we left off. Since the index is sorted, the match within the word we are looking for MUST be at or after the current state.
So if we start by looking through the g index, we come across the first element.
g index: (0, 0) (String Identifier 0 at Character Position 0) g state: increment g state to 1
Now we can move down to the second index. The index for t.
We are looking for an item in the index that contains String Identifier 0 at a Character Position greater than 0. We find a match immediately at element 0 of the t index. If we did not, we would continue to increment the state for t until we reached an element containing a String Identifier greater than 0 and then work our way back up.
t index: (0, 1) (string Identifier 0 at Character Position 1) t state: increment t state to 1.
We keep doing this until we run out of indexes. When we do, we have found a match and can go back to the top, our index for the first character, g.
While searching the indexes, I chose to track the distance between each discovered character for use in scoring. So using the search term gtk, all strings in the corpus containing gtk consecutively would get a score of 2. While getchar_unlocked would have a score of 13.
The lower the score, the better the match.
This is not enough if you have a large corpus of similar strings such as all gtk_*() functions. So we add the length of the string to the score as well. This means that longer strings are less of a match than shorter ones.
Add a few divisions here and there to give us a value between 0.0 and 1.0 as well as inverting scores so larger numbers are better and we have something useful.
You can find the implementation described above in my git repository. It is implemented in C and requires the GLib library.
-- Christian Hergert 2013-09-13
Back to Index