#
Longest Common Subsequence

/*********************************************************************** * Author: Isai Damier * Title: Longest Common Subsequence * Project: geekviewpoint * Package: algorithms * * Statement: * Given two sequences, find the longest common subsequence * between them. * * Time Complexity: O(n*m) * where m and n are the size of the input arrays, respectively * * Sample Input: awaited, alpine * Sample Output: aie * * DEFINITION OF SUBSEQUENCE: * A sequence is a particular order in which related objects follow * each other (e.g. DNA, Fibonacci). A sub-sequence is a sequence * obtained by omitting some of the elements of a larger sequence. * * SEQUENCE SUBSEQUENCE OMISSION * {3,1,2,5,4} {1,2} 3,5,4 * {3,1,2,5,4} {3,1,4} 2,5 * * SEQUENCE NOT SUBSEQUENCE REASON * {3,1,2,5,4} {4,2,5} 4 should follow 5 * * STRATEGY: * An easy way to find a longest common subsequence of characters between * two words is to first track the lengths of all the common sequences * and then from those lengths pick a maximum; finally, from that * maximum length, trace out the actual longest common subsequence * between the two words. * * Let's use a table C (i.e. array C) to track the lengths of all common * sub-sequences between the two input words. Naturally, C starts with * all cells set to zero. Then to fill up C, we walk thru the two words, * comparing characters. Each time we find a match, we mark the * corresponding cell in C as one plus whatever maximum we had seen so * far for that particular sub-sequence. * * When we are done filling up C, we find a maximum length z in C. Then * using the indices of z, we trace that actual sequence and return it * as the answer. * * The ensuing code shows the details of how we fill C and then from C * extract the actual LCS. * * ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ * * The remaining paragraphs contain details that you may not particularly * care for. You may skip them and go right to the code. * * Finding the LCS of two sequences X and Y, i.e. LCS(X,Y), is similar * in structure to finding the Fibonacci of some number n, i.e. fib(n). * Both problems exhibit optimal substructures. Which is to say, * for instance, the solution to fib(5) contains the solution to fib(4) * which in turn contains fib(3), and so on. Similarly, the solution * of LCS(X,Y) contains the solution of LCS(X-1,Y), which in turn * contains the solution of LCS(X-1,Y-1), and so on. Whenever a problem * exhibits "optimal substructure", we can use recursion to solve it: * * _ * | fib(n-1)+fib(n-2) if n > 1 * fib(n) = | n if n == 1 or n == 0 * |_ * * For the LCS recurrence relation, a few notes are necessary: * Let C(i,j) be the length of the prefix of LCS(X,Y) given by * * C(i,j) = |LCS(X[1..i],Y[1..j])| * * so that if the length of X is m and the length of Y is n, * * C(m,n) = |LCS(X,Y)| * * Then the recurrence relation for filling C is * _ * | C(i-1,j-1)+1 if X(i) == Y(j) * C(i,j) = | max{C(i-1,j),C(i,j-1)} otherwise * |_ * * In addition to "optimal substructures", the LCS(X,Y) like the fib(n) * contains overlapping subproblems. What we mean is this: In solving * f(5), we meet f(2) both as a subproblem of f(4) and as a subproblem * of f(3). * * fib(5) * / \ * fib(4) fib(3) :fib(5)=fib(4)+fib(3) * / \ / \ * fib(3) fib(2) fib(2) fib(1) :fib(5)=fib(3)+fib(2)+fib(2)+fib(1) * * Whenever a problem exhibits "overlapping subproblems", if you use * recursion as a strategy you will end up solving the same subproblems * more than once. Therefore, it makes more sense to track subproblems * using a table so that once you solve a subproblem, you can simply * lookup it's solution next time you need. * * CONCLUSION: * In the foregoing "strategy" section, we essentially described * dynamic programming. * * Dynamic Programming: (aka Dynamic Tabling) is a strategy for solving * problems, which exhibit optimal substructures and overlapping * subproblems, by using a table to track the solution of subproblems * that have already been solved so as not to solve a subproblem more * than once. This tracking or note taking is usually referred to as * memoization, meaning writing in a memo. * **********************************************************************/ package algorithm.programming.dynamic; import java.util.ArrayList; import java.util.List; public class LongestCommonSubsequence { /**** * Space Complexity: O(n*m) * where m and n are the size of the input arrays, respectively * * This solution first computes a table of lengths and then from that * table computes an actual longest common subsequence. ****/ public List<Integer> LCS(int[] A, int[] B) { int[][] m = lcsTable(A, B); List<Integer> sequence = actualLCS(m, A, B); return sequence; } /**** * LCS table of lengths ****/ private int[][] lcsTable(int[] A, int[] B) { int[][] m = new int[A.length + 1][B.length + 1]; for (int x = 1; x <= A.length; x++) { for (int y = 1; y <= B.length; y++) { if (A[x - 1] == B[y - 1]) { m[x][y] = 1 + m[x - 1][y - 1]; } else { m[x][y] = Math.max(m[x][y - 1], m[x - 1][y]); } } } return m; } /**** * Extract an actual LCS from the length table, which effectively * contains all the possible longest common sequences. ****/ private List<Integer> actualLCS(int[][] m, int[] A, int[] B) { List<Integer> result = new ArrayList<Integer>(); int x = A.length, y = B.length; while (x > 0 && y > 0) { if (A[x - 1] == B[y - 1]) { result.add(A[x - 1]); x--; y--; } else if (m[x][y - 1] > m[x - 1][y]) { y--; } else { x--; } } return result; } }

package algorithm.programming.dynamic; import java.util.Arrays; import java.util.List; import org.junit.Test; import static org.junit.Assert.*; public class LongestCommonSubsequenceTest { /** * Test of LCS method, of class LongestCommonSubsequence. */ @Test public void testLCS() { System.out.println("LCS"); int[] fibonacci = {0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987}; int[] prime = {2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103}; int[] rand = {0, 1, 21, 45, 144, 610, 987}; int[] bad = {102, 104, 106, 108, 110, 112, 114}; Integer[] exp1 = {89, 13, 5, 3, 2}; Integer[] exp2 = {987, 610, 144, 21, 1, 0}; //no shared end points LongestCommonSubsequence seq = new LongestCommonSubsequence(); List<Integer> actual = seq.LCS(fibonacci, prime); assertTrue(Arrays.equals(exp1, actual.toArray(new Integer[0]))); //share both end points actual = seq.LCS(fibonacci, rand); assertTrue(Arrays.equals(exp2, actual.toArray(new Integer[0]))); //share no points actual = seq.LCS(fibonacci, bad); assertEquals(0, actual.size()); //share all points actual = seq.LCS(fibonacci, fibonacci); assertEquals(fibonacci.length, actual.size()); int i = fibonacci.length; for (int e : actual) { assertEquals(e, fibonacci[--i]); } } }