# KMP: Find All Matches of Substringby Isai Damier, Android Engineer @ Google

/*********************************************************************
* Author: Isai Damier
* Title: Knuth Morris Pratt: Find All Occurrences of Substring
* Project: geekviewpoint
* Package: algorithm.search.string
*
* Statement:
*   You are probably familiar with the figure of speech "Needle in a
*   haystack." All the same, the analogy is not difficult to
*   understand: it refers to the difficult task of finding something
*   small in a much larger space.
*
*   The string-matching problem is a formal equivalent of the
*   "needle in a haystack" metaphor. Given a text T[1...n] and a
*   pattern P[1...m] such that both T and P are comprised by the
*   same alphabet, find all P's in T.
*
* Sample Input: Text="Weialala leia Wallala leialala"
*               Pattern="eia"
* Sample Output: 1, 10, 23
*
*   Time complexity of Solution:
*   Worst: Worst Case O(n+m) where |text|=n and |pattern|=m
*
* Approach:
*
*   The Knuth-Morris-Pratt (KMP) String Search Algorithm was proposed
*   by Prof. Donald Knuth, Vaughan Pratt, and independently by
*   James H. Morris. Unlike a naive string searching algorithm that
*   uses for-loops to test every shift of a given text against a
*   given pattern for a match; the KMP algorithm intelligently skips
*   non-promising shifts of the text.
*
*   If you are familiar with integrated circuit design or finite
*   automata theory, then the KMP String Matching algorithm is very
*   easy: for a given text and a given pattern, turn the character
*   sequence of the pattern into a finite automaton (i.e. "transition
*   diagram" to chip designers) and then check if the automaton
*   accepts the text. You can learn or review about state transition
*   functions/diagrams at www.teahlab.com.
*
*   Since most programmers are not familiar with state transition
*   functions, we will explain KMP by comparing it with the Naive
*   String Matching algorithm. The Naive String Matcher, as shown
*   below, uses the variable t to shift through the text and the
*   variable p to compare each shift of the text to the pattern.
*   Because the algorithm does not discriminate against the shifts,
*   it uses two vanilla for-loops to shift through every key of the
*   text and of the pattern.
*
*     int tSize = text.length();
*     int pSize = pattern.length();
*     for(int t=0; t<tSize; t++)//shift thru text
*     for(int p=0; p<pSize && t + p < tSize; p++)//shift thru pattern
*       if(text.charAt(t+p) != pattern.charAt(p)) break;
*       else if(p == pattern.length()-1) visit(t);
*
*   Similarly to the naive algorithm, KMP uses the variable t to
*   shift through the text and the variable p to compare a given
*   shift of the text to the pattern. However, KMP understands that
*   not every shift of the text can be a successful match against the
*   pattern. Hence, instead of increasing t by one every time and
*   restarting p at 0 every time, KMP uses information stored in a
*   prefix table to increment t and modify p.
*
*   The KMP algorithm starts by examining the needle/pattern itself
*   for possible locations where a match may begin. The results of
*   the examination are stored in a table S. Then as the haystack/text
*   is searched, the data in S is used to skip characters in the
*   haystack/text where a match could not possibly begin.
*   See the prefix table function for further details.
*
********************************************************************/
List<Integer> KMP_all_matches(String haystack, String needle)throws Exception{
//preliminaries
if (null == haystack || haystack.isEmpty()) {
throw new Exception("invalid haystack");
}
if (null == needle || needle.isEmpty()) {
throw new Exception("invalid needle");
}

List<Integer> resultSet = new ArrayList<Integer>();
if (haystack.length() < needle.length()) {
return resultSet;
}

//algorithm
int[] S = new int[needle.length()];
prefixTable(needle.toCharArray(), S);
int t = 0, p = 0;

while (p + t < haystack.length()) {
if (haystack.charAt(t + p) != needle.charAt(p)) {
t = t + p - S[p];
p = -1 < S[p] ? S[p] : 0;
} else if (p++ == needle.length() - 1) {
t++;
p = 0;
}
}
return resultSet;
}
package algorithm.search.string;

import java.util.Arrays;
import java.util.List;
import org.junit.Test;
import static org.junit.Assert.*;

/***
* The haystacks are packed from the works of poet
* Thomas Stearns Eliot (September 26, 1888 – January 4, 1965)
***/
public class StringSearchAlgorithmsTest {

/**
* Test of KMP_all_matches method, of class StringSearchAlgorithms.
*/
@Test
public void testKMP_all_matches() throws Exception {
System.out.println("KMP_all_matches");
String haystack = "I that was near your heart was removed therefrom "
+ "To lose beauty in terror, terror in inquisition. "
+ "I have lost my passion: why should I need to keep it "
+ "Since what is kept must be adulterated? "
+ "I have lost my sight, smell, hearing, taste, and touch: "
+ "How should I use them for your closer contact?";

String needle = "I have";
StringSearchAlgorithms KMP = new StringSearchAlgorithms();
List<Integer> expResult = Arrays.asList(98, 191);
List<Integer> result = KMP.KMP_all_matches(haystack, needle);
assertEquals(expResult, result);

haystack = "Dry bones can harm no one. "
+ "Only a cock stood on the rooftree "
+ "Co co rico co co rico ";
needle = "co ";
expResult = Arrays.asList(64, 69, 72, 75, 80);
result = KMP.KMP_all_matches(haystack, needle);
assertEquals(expResult, result);
}
}