The KMP Algorithm is a powerful pattern matching algorithm that executes in linear complexity.

Here’s what that means -

Let’s say I have a source text I’m trying to find a certain substring in. Let this source text/string have a length n and let my substring I’m searching for in the source text, have a length m.

Now, the naive simple nested-for-loop code will give you a complexity of O (m*(n-m+1))

The KMP Algorithm does this in time O (m+n)

Oh, one drawback though, it also has a space complexity of O (m) because there’s some pre-processing involved.

So now that you know how powerful it is, let’s dive in! I’ll be splitting this article into 2 sections -

  • Pre-Processing
  • String Matching


So let’s start off with the pre-processing – the longest prefix suffix array.


The Pre-Processing

This is the part of the algorithm which fetches a space complexity of O (m) and a time complexity of also O (m)

The pre-processing phase is where we identify certain properties of the given pattern and store certain information we’ll find handy later on.

This “information” we store is the length of the longest prefix which is also the suffix.

I’ll elaborate.

Let lps[ i ] represent the length of the longest suffix that is also a prefix, in the first (i) characters. Here’s an example.

Let my pattern string be P =“abcdabcde”

Now here are all the prefixes, with the longest prefix suffix, highlighted in bold -

a
ab
abc
abcd
abcda
abcdab
abcdabc
abcdabcd
abcdabcde

And they’re corresponding lps [ ] values are = { 0, 0, 0, 0, 1, 2, 3, 4, 0 } since we want the length of the bold suffix prefix.

For now, let’s just operate under the assumption we need this longest prefix suffix array. You’ll see why we need it in the second phase.

But for now our task is to generate this lps array optimally.

Obviously, an O(|m|^2) solution is evident where we check the length of matching suffixes for every prefix but we can’t have that. We need to perform this in O(|m|) and that’s the key part of this pre-processing.

For the duration of this article I’ll coin the term “prefix-suffix” which would refer to the existence of a substring that is both a prefix AND suffix of the string.

The way we accomplish this is by doing the following -

  • We maintain two integers i (initialized to 1) and j (initialized to 0)
  • Now j keeps repeatedly being pointed back to the potential start of a suffix which is a prefix-suffix with relevance to the current character (that i points to) - Remember, i is always ahead of j. Now visualize and re-read this point. (i.e. in abcdabcd, when i equals 4, j will point to 1, as index 0 is the same character as that of index 4 and the next alternate potential matching character will be 0+1 = 1)
  • Now, we iterate over the given pattern to construct the lps array and we do this as follows -
  • If pat[ i ] == pat[ j ], we assign lps[ i ] to (j+1) since we know that our current character pat[ i ] could also serve as pat[ j ] and so when we're doing the actual string matching in the next phase, the (i+1)th character could refer to even the (j+1)th character as ambiguity arises in where to locate the string when the first character matches. You'll see in the next point.

  • If pat[ i ] != pat[ j ] then we have two cases.
  • If j != 0 then we update j as j = lps[ j-1 ] because we may be referring to a different occurrence of the previous character in the given pattern
    and there's still a chance a common prefix suffix can exist and we were looking at the wrong place. (The case of ambiguity mentioned above)
  • If j == 0, well then this symbol is occurring for the first time or shares no common prefix-suffix, for example the last characters in "abcd" or "abcae" have the length o
    f their longest prefix suffix as zero. Proceed to the next character, meaning i++



And that’s it! We’re done with the pre-processing phase this way!

We now have our lps array ready in O(length of pattern). Here’s how that goes in implementation -

vector<int> build(string pattern){
     int m=pattern.length(); //Find pattern length
     vector<int> lps(m,0);  //Declare lps array 
     int i=1;  //init forward pointer
     int j=0;  //init backward pointer
     while(i < m){
          if(pattern[i]==pattern[j]){  //if forward equals backward
               lps[i]=j+1;  //the current character at forward can also lead to the next backward character 
               i++;  //go to next forward
               j++;  //go to next backward
          }
          else{ //if forward does not equal backward
               if(j!=0)  //if length of prefix isn't zero, we go to prev backward prefix of same character 
                    j=lps[j-1]; 
               else  //Else if j equals zero, i.e. if there is no prefix
                    lps[i++]=0;  //assign corresponding length and go to next forward
          }
     return lps;
}

Phase 1 complete.



The string matching

Okay, so now we’ve created our lps array in linear time. Now let’s leverage that to find if our pattern appears in our given string.
We do this by following the algorithm below -

  • Declare two integers i and j and initialize both of them to zero. i here points to the text and j here points to the pattern's index.
  • Now as we iterate over the text, we have two cases -
  • Case 1: pattern[ j ] == text[ i ] where in we just go to the next characters in both the text and the pattern. i++; j++;
  • Now if j == pattern.length() then we have traversed the entire pattern meaning we have just come across an entire occurrence of the pattern. Yay! Also, now we reinitialize j to lps[ j-1 ] in the event that the last character of the pattern is part of a prefix of the same pattern (i.e. the next occurrence of the pattern has already begun.)
  • Case 2: the current pattern character does not match the text character, then if j is not zero, we update it to lps[ j-1 ] and if j is zero, just proceed to the next text character.


And that’s it! Ultimately, you’ll exit your loop and at which point you’ll have come across every potential occurrence of the pattern in the given string.
That’s how you match patterns in linear time! :)
And to wrap up, here’s my C++ implementation.

int check(string pat, string txt, int n, int m){
     int i, j; 
     i = 0; j = 0; 
     vector<int> lps = build(pat); 
     while(i < n) { 
          if(pat[j] == txt[i]){ 
               i++; j++; 
          } 
          if(j==m){ 
               cout<<"Found!\n"; 
               j = lps[j-1]; 
          } 
          if(j < m & i < n && pat[j]!=txt[i]){ 
               if(j!=0)j = lps[j-1]; 
               else i++; 
          } 
     } 
     return 0; 
}


That’s that, it takes a while to understand the flow of the code but imitating it on pen and paper definitely helps.

Cheers,
Aditya Ramesh