KMP String Search Algorithms
by Isai Damier

/*********************************************************************
 * Author: Isai Damier
 * Title: Knuth Morris Pratt String Search Algorithms
 * Project: geekviewpoint
 * Package: algorithm.search.string
 *
 * Statement:
 *   You are probably familiar with the figure of speech "Needle in a
 *   haystack." All the same, the analogy is not difficult to
 *   understand: it refers to the difficult task of finding something
 *   small in a much larger space.
 *
 *   The string-matching problem is a formal equivalent of the
 *   "needle in a haystack" metaphor. Given a text T[1...n] and a
 *   pattern P[1...m] such that both T and P are comprised by the
 *   same alphabet, check whether P is a substring of T.
 *
 * Sample Input: Text="The universe is a software simulation"
 *               Pattern="verse"
 * Sample Output: 7
 *
 *   Time complexity of Solution:
 *   Worst: Worst Case O(n+m) where |text|=n and |pattern|=m
 *
 * Approach:
 *
 *   The Knuth-Morris-Pratt (KMP) String Search Algorithm was proposed
 *   by Prof. Donald Knuth, Vaughan Pratt, and independently by
 *   James H. Morris. Unlike a naive string searching algorithm that
 *   uses for-loops to test every shift of a given text against a
 *   given pattern for a match; the KMP algorithm intelligently skips
 *   non-promising shifts of the text.
 *
 *   If you are familiar with integrated circuit design or finite
 *   automata theory, then the KMP String Matching algorithm is very
 *   easy: for a given text and a given pattern, turn the character
 *   sequence of the pattern into a finite automaton (i.e. "transition
 *   diagram" to chip designers) and then check if the automaton
 *   accepts the text. You can learn or review about state transition
 *   functions/diagrams at www.teahlab.com.
 *
 *   Since most programmers are not familiar with state transition
 *   functions, we will explain KMP by comparing it with the Naive
 *   String Matching algorithm. The Naive String Matcher, as shown
 *   below, uses the variable t to shift through the text and the
 *   variable p to compare each shift of the text to the pattern.
 *   Because the algorithm does not discriminate against the shifts,
 *   it uses two vanilla for-loops to shift through every key of the
 *   text and of the pattern.
 *
 *     int tSize = text.length();
 *     int pSize = pattern.length();
 *     for(int t=0; t<tSize; t++)//shift thru text
 *     for(int p=0; p<pSize && t + p < tSize; p++)//shift thru pattern
 *       if(text.charAt(t+p) != pattern.charAt(p)) break;
 *       else if(p == pattern.length()-1) return t;
 *
 *   Similarly to the naive algorithm, KMP uses the variable t to
 *   shift through the text and the variable p to compare a given
 *   shift of the text to the pattern. However, KMP understands that
 *   not every shift of the text can be a successful match against the
 *   pattern. Hence, instead of increasing t by one every time and
 *   restarting p at 0 every time, KMP uses information stored in a
 *   prefix table to increment t and modify p.
 *
 *   The KMP algorithm starts by examining the needle/pattern itself
 *   for possible locations where a match may begin. The results of
 *   the examination are stored in a table S. Then as the haystack/text
 *   is searched, the data in S is used to skip characters in the
 *   haystack/text where a match could not possibly begin.
 *   See the prefix table function for further details.
 *
 ********************************************************************/ 
 public int KMPMatcher(String haystack, String needle) throws Exception {
  //preliminaries
  if (null == haystack || haystack.isEmpty()) {
    throw new Exception("invalid haystack");
  }
  if (null == needle || needle.isEmpty()) {
    throw new Exception("invalid needle");
  }
  //algorithm
  int[] S = new int[needle.length()];
  prefixTable(needle.toCharArray(), S);
  int t = 0, p = 0;

  while (p + t < haystack.length()) {
    if (haystack.charAt(t + p) != needle.charAt(p)) {
      t = t + p - S[p];
      p = -1 < S[p] ? S[p] : 0;
    } else if (p++ == needle.length() - 1) {
      return t;
    }
  }
  return -1;
}
package algorithm.search.string;

import java.util.Arrays;
import java.util.List;
import org.junit.Test;
import static org.junit.Assert.*;

/***
 * The haystacks are packed from the works of poet
 * Thomas Stearns Eliot (September 26, 1888 – January 4, 1965)
 ***/
public class StringSearchAlgorithmsTest {

  /**
   * Test of KMPMatcher method, of class StringSearchAlgorithms.
   */
  @Test
  public void testKMPMatcher() throws Exception {
    System.out.println("KMPMatcher");
    
    String haystack = 
       "Under a juniper-tree the bones sang, scattered and shining "
     + "We are glad to be scattered, we did little good to each other, "
     + "Under a tree in the cool of the day, with the blessing of sand, "
     + "Forgetting themselves and each other, united "
     + "In the quiet of the desert. This is the land which ye "
     + "Shall divide by lot. And neither division nor unity "
     + "Matters. This is the land. We have our inheritance.";

    String needle = "the blessing of sand";
    
    StringSearchAlgorithms KMP = new StringSearchAlgorithms();
    int expResult = 164;
    int result = KMP.KMPMatcher(haystack, needle);
    assertEquals(expResult, result);
    
    
    needle = "our inheritance.";
    expResult = 372;
    result = KMP.KMPMatcher(haystack, needle);
    assertEquals(expResult, result);
    
    needle = "..";
    expResult = -1;
    result = KMP.KMPMatcher(haystack, needle);
    assertEquals(expResult, result);

    needle = "Under";
    expResult = 0;
    result = KMP.KMPMatcher(haystack, needle);
    assertEquals(expResult, result);

    needle = "under";
    expResult = -1;
    result = KMP.KMPMatcher(haystack, needle);
    assertEquals(expResult, result);
  }
}