Multiple String Matching Problem

The data structure that is often used for string matching is Trie. The problem is as follows.

Write a function that takes in a long string and an array of short strings, all of which has a smaller length than the long string. The function should return an array of booleans where each boolean represents if each string of the arrays of short strings is contained in the long string.

The challenge is to not use any built-in string searching methods such as .find in python.

Example
Input: "this is a string, hahahaha", ["this", "is", "LOL", "wrong", "Joker", "hahaha", "haha"]
Output: [True, True, False, False, False, True, True]

First Solution

The first solution that comes to mind is to iterate through the arrays of short strings and compare it letter by letter to the long string.

def multiStringMatcher(longString, shortStrings):
    return [isInLongString(longString, shortString) for shortString in shortStrings]

def isInLongString(longString, shortString):
    for i in range(len(longString)):
        shortStrPtr = 0
        for longStrPtr in range(i, len(longString)):
            if shortString[shortStrPtr] == longString[longStrPtr]:
                shortStrPtr += 1
                if shortStrPtr == len(shortString):
                    return True
                continue
            else:
                break
    return False

The solution first iterates through the array of short strings.

For each short string, it iterates through the long string and checks if the current letter of the long string is equal to the current letter of the short string.

If it does, the function isInLongString returns True.

If it does not, the search starts again from the next letter of the long string and the first letter of the short string.

It checks if the letters are equal and the process continues for every short string.

The time and space complexity are rather straightforward. The space complexity is O(n) space where n is the number of short strings in the array of short strings.

The time complexity is O(mno) where m is the length of the long string, n is the number of short strings in the array of short strings and o is the length of the longest string in the array of short strings.

The time complexity is O(mno) as for every short string, we are iterating through the long string. And for every iteration, we are at most going to be iterating through the length of the longest string in the array of short strings if it is found in the long string.

Second Solution

A second solution would be to build a suffix tree that contains every suffix of the long string. A long string of “string” would thus have a suffix tree starting from a root node pointing to the following nodes…

g → *, where * represent the end symbol  
n → g → *  
i → n → g → *  
r → i → n → g → *, and so on until...  
s → t → r → i → n → g → *  

Then, iterating through the array of short strings, we can find a match by simply checking if the first letter is in the current node of the tree and traversing down the trie if it exists.

def multiStringMatcher(longString, shortStrings):
    trie = SuffixTrie()
    trie.insert(longString)
    return [isInSuffixTrie(longString, shortString, trie) for shortString in shortStrings]

def isInSuffixTrie(longString, shortString, suffixTrie):
    currNode = suffixTrie.root
    for i in range(len(shortString)):
        currLetter = shortString[i]
        if shortString[i] in currNode:
            currNode = currNode[currLetter]
        else:
            return False
    return True
    
class SuffixTrie:
    def __init__(self):
        self.root = {}
        self.endSymbol = "*"
        
    def insert(self, string):
        for i in range(len(string)):
            currNode = self.root
            for j in range(i, len(string)):
                if string[j] not in currNode:
                    currNode[string[j]] = {}
                currNode = currNode[string[j]]
            currNode[self.endSymbol] = True

What is the time and space complexity for this solution?

We need to build a suffix tree that will take O(m²) where m is the length of the long string. We are iterating through every position of the long string and inserting the substring from that position till the end of the string. Once that is done, we need to iterate through the array of short strings and check if it is present in the trie. This would take O(no) time where n is the number of strings in the array of short strings and o is the length of the longest string in the array.

The resulting time complexity would be O(m² + no). The space complexity would be O(m² + n). The m² comes from the trie as we are storing every suffix of the string.

Third Solution

The last solution would be to build a trie from the array of short strings. Then iterating through the long string and looking up the trie at every iteration to see if there are any matches.

def multiStringMatcher(bigString, smallStrings):
    trie = Trie()
    result = {}
    for i in range(len(smallStrings)):
        trie.insert(smallStrings[i])
    for i in range(len(bigString)):
        currNode = trie.root
        for j in range(i, len(bigString)):
            currChar = bigString[j]
            if currChar in currNode:
                currNode = currNode[currChar]
            else:
                break
            if "*" in currNode:
                result[currNode["*"]] = True
    return [string in result for string in smallStrings]

class Trie:
    def __init__(self):
        self.root = {}
        self.endSymbol = "*"
        
    def insert(self, string):
        curr = self.root
        for i in range(len(string)):
            if string[i] not in curr:
                curr[string[i]] = {}
            curr = curr[string[i]]
        curr[self.endSymbol] = string

The time and space complexity for this solution would be O(no + mo) where n is the number of short strings in the array of short strings, o is the length of the longest string in the array of short strings and m is the length of the long string.

The O(no) comes from the building of the trie from the array of short strings. The O(mo) comes from iterating through the long string and at every iteration, we check if there are matches in the trie. The space complexity would be O(no + n) which simplifies to O(no).

Conclusion

Which solution is the better one if we are only concerned about time complexity? Comparing the third solution, O(mo + no) and the second solution, O(m² + no). We know that m² must be greater than mo since o must be smaller than m. The length of each string in the array of short strings must be smaller than the long string as specified in the problem. The third solution is therefore faster than the second solution.

Then comparing the first solution, O(mno) and the third solution, O(mo + no). For the majority of large inputs, it is likely that mo + no is smaller than mno. Since o(mn) is likely to be larger than o*(m + n), the third solution is the best in terms of time complexity.