domingo, mayo 04, 2014

Substring matching in Python (run between naive, Boyer-Moore, and Suffix Array)

A  few days ago I found this very interesting problem: given a list of strings L, write a function that returns the elements of L which contains some substring S.

For example, given L=["Casa", "Perro", "Gato", "Onomatopeya", "internacionalizacion", "Om nom nom"] and S="nom", we want the result of find(L, S) = ['Onomatopeya', 'Om nom nom'].

Naïve Version

On Python this sounds simple enough, and we can write:
def find1(L, S):
    return [x for x in (L) if S in x]
However, for a big enough L and S we can see the runtime of this function depends not only on the size of L but also on the size of S. That's it, the runtime complexity of find1 in BigOh notation is: O(n.m.s), where:
  • n=len(L)
  • m=max([len(x) for x in L])
  • s=len(S)
The plot of time for find1 for a fixed L and where we increase the size of S looks like this:


Boyer-Moore-Horspool

It turns out that since version 2.5, Python's "in" operator is implemented internally using a modified version of Boyer-Moore algorithm for substring searching. The details are here.

We can take advantage of this detail by creating a temporary structure for faster lookups. We pre-process L so we can make fast queries.

The idea is to construct a big string W with the concatenation of all the elements of L, using a special char as separator, a char that is not present on S nor any element of L. For example:
L = ["Casa", "Perro", "Gato", ...]
W = "Casa\nPerro\nGato\n..."
Then, finding if a substring S is present in any of the elements of L can be answered by just writing: 
S in W
This allows us to answer whether a substring is present or not. To actually construct the resulting list of elements of L which contain S we need another helper structure. We build T, a list of integers that, for every elements in L, equals the starting index of this element in W. Continuing with the example:
T = [0, 5, 11, 16, ... ]
This means the first element, "Casa", starts at index 0 in W; the second element "Perro" starts at index 5 in W, etc. And this structure allows us to quickly determine the index in W for every element in, and we lookup the index by doing a binary search on T.

The runtime complexity for constructing this intermediate index is O(n), with O(n) memory usage.

Our new find function should then:
  • find the first position of S in W as p
  • determine for which element of L this index relates to, by doing a binary search on T
  • from p+1 onwards, find again the next S in W.
Since an element of L can contain many times the same substring S we may jump to the next word on W.

On code, the find2 function looks like:
def find2(L, S):
    # Using the native Boyer-Moore implementation of the "in" operator
    R = []
    i = W.find(S)
    while i != -1:
        p = bisect.bisect_right(T, i) - 1
        e = L[p]
        #assert S in e
        R.append(e)
        i = W.find(S, T[p] + len(e))
    return R
The runtime complexity of this new find function is: O(n.m). We still need to take into account the length of each element of L since BMH algorithm is (mostly) linear on the W string. 

The plot of runtime for find1 vs. find2 looks like the following graphic. Again, we are leaving a fixed L and increasing the size of S:



Suffix Array

There is a third way to solve this problem, by means of constructing a suffix array

This amazing data structure offers a runtime complexity of O(log N) for suffix lookups, where N is the length of the string. Incidentally, it also allows to lookup for substrings, since we just lookup until a suffix on the SA has S as prefix.

Again, we need to construct an intermediate index, which is again very simple: sort all the possible suffixes on W. The trick is how to do it: we shall not keep every possible suffix as an string, but just a list of starting positions for every suffix, and sort this list by the actual string of the suffix.

In code, the construction of the SA table is really simple:
# Suffix Array Table
SL = list(range(len(W)))
SL.sort(key=lambda x: W[x:x+100])
The runtime complexity for constructing this intermediate index is O(n.log n), with O(n) memory usage.

To find a specific suffix we should binary search the SA table, using the element on SL to determine where in W the suffix starts.

Since a substring may appear many times on many elements of S, we may have many sufixes starting with S. The good news is, since the list of suffixes is sorted, all this suffixes will be one after another on the SL table. But since we are doing a binary search on the list of suffixes, we can't be sure on where the middle pointer will jump in this contiguous sequence of suffixes, all starting with S. 

Therefore, when we find the position of some suffix we should go back a little to make sure we are starting on the first suffix on the sequence of suffixes that start with S.

On code, our new find3 function looks like:
def find3(L, S):
    # Suffix array
    start = 0
    end = len(SL)
    while start < end:
        mid = start + (end - start) // 2
        pa = SL_key_fn(W, SL[mid], 100)
        pb = SL_key_fn(S, 0, len(S))
        if pa < pb:
            start = mid + 1
        elif pb < pa:
            end = mid
        else:
            # A word may contain the same S multiple times
            R = set()
            while mid > 0 and W.startswith(S, SL[mid]):
                mid = mid - 1
            if not W.startswith(S, SL[mid]):
                mid = mid + 1
            while mid < len(SL) and W.startswith(S, SL[mid]):
                p = bisect.bisect_right(T, SL[mid]) - 1
                e = L[p]
                assert S in e
                R.add(p)
                mid = mid + 1
            return [(L[i]) for i in R]
    return []
The SL_key_fn function was a failed experiment to enhance the performance of the lookups. This function today is:
def SL_key_fn(data, x, llen):
    return data[x:x+llen]
Which is the same as the key on the SA table sorter.

The runtime performance of the find3 function is: O(log (n.m)), and the plot of the three functions looks like this:



Drawbacks

This SA implementation in Python is using a lot of temporary memory for sorting the table. My implementation on my laptop is using 2.4GB of RAM to sort an L of 150k elements. There's been some discussion about this memory issue on this blog post and in this Stack Overflow question.

Special thanks

Python Argentina community is a great place to look for help for all your spanish Python programming needs. 

Carpintería de fin de semana