Longest Common Subsequence
by Isai Damier, Android Engineer @ Google
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | /*********************************************************************** * Author: Isai Damier * Title: Longest Common Subsequence * Project: geekviewpoint * Package: algorithms * * Statement: * Given two sequences, find the longest common subsequence * between them. * * Time Complexity: O(n*m) * where m and n are the size of the input arrays, respectively * * Sample Input: awaited, alpine * Sample Output: aie * * DEFINITION OF SUBSEQUENCE: * A sequence is a particular order in which related objects follow * each other (e.g. DNA, Fibonacci). A sub-sequence is a sequence * obtained by omitting some of the elements of a larger sequence. * * SEQUENCE SUBSEQUENCE OMISSION * {3,1,2,5,4} {1,2} 3,5,4 * {3,1,2,5,4} {3,1,4} 2,5 * * SEQUENCE NOT SUBSEQUENCE REASON * {3,1,2,5,4} {4,2,5} 4 should follow 5 * * STRATEGY: * An easy way to find a longest common subsequence of characters between * two words is to first track the lengths of all the common sequences * and then from those lengths pick a maximum; finally, from that * maximum length, trace out the actual longest common subsequence * between the two words. * * Let's use a table C (i.e. array C) to track the lengths of all common * sub-sequences between the two input words. Naturally, C starts with * all cells set to zero. Then to fill up C, we walk thru the two words, * comparing characters. Each time we find a match, we mark the * corresponding cell in C as one plus whatever maximum we had seen so * far for that particular sub-sequence. * * When we are done filling up C, we find a maximum length z in C. Then * using the indices of z, we trace that actual sequence and return it * as the answer. * * The ensuing code shows the details of how we fill C and then from C * extract the actual LCS. * * ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ * * The remaining paragraphs contain details that you may not particularly * care for. You may skip them and go right to the code. * * Finding the LCS of two sequences X and Y, i.e. LCS(X,Y), is similar * in structure to finding the Fibonacci of some number n, i.e. fib(n). * Both problems exhibit optimal substructures. Which is to say, * for instance, the solution to fib(5) contains the solution to fib(4) * which in turn contains fib(3), and so on. Similarly, the solution * of LCS(X,Y) contains the solution of LCS(X-1,Y), which in turn * contains the solution of LCS(X-1,Y-1), and so on. Whenever a problem * exhibits "optimal substructure", we can use recursion to solve it: * * _ * | fib(n-1)+fib(n-2) if n > 1 * fib(n) = | n if n == 1 or n == 0 * |_ * * For the LCS recurrence relation, a few notes are necessary: * Let C(i,j) be the length of the prefix of LCS(X,Y) given by * * C(i,j) = |LCS(X[1..i],Y[1..j])| * * so that if the length of X is m and the length of Y is n, * * C(m,n) = |LCS(X,Y)| * * Then the recurrence relation for filling C is * _ * | C(i-1,j-1)+1 if X(i) == Y(j) * C(i,j) = | max{C(i-1,j),C(i,j-1)} otherwise * |_ * * In addition to "optimal substructures", the LCS(X,Y) like the fib(n) * contains overlapping subproblems. What we mean is this: In solving * f(5), we meet f(2) both as a subproblem of f(4) and as a subproblem * of f(3). * * fib(5) * / \ * fib(4) fib(3) :fib(5)=fib(4)+fib(3) * / \ / \ * fib(3) fib(2) fib(2) fib(1) :fib(5)=fib(3)+fib(2)+fib(2)+fib(1) * * Whenever a problem exhibits "overlapping subproblems", if you use * recursion as a strategy you will end up solving the same subproblems * more than once. Therefore, it makes more sense to track subproblems * using a table so that once you solve a subproblem, you can simply * lookup it's solution next time you need. * * CONCLUSION: * In the foregoing "strategy" section, we essentially described * dynamic programming. * * Dynamic Programming: (aka Dynamic Tabling) is a strategy for solving * problems, which exhibit optimal substructures and overlapping * subproblems, by using a table to track the solution of subproblems * that have already been solved so as not to solve a subproblem more * than once. This tracking or note taking is usually referred to as * memoization, meaning writing in a memo. * **********************************************************************/ package algorithm.programming.dynamic; import java.util.ArrayList; import java.util.List; public class LongestCommonSubsequence { /**** * Space Complexity: O(n*m) * where m and n are the size of the input arrays, respectively * * This solution first computes a table of lengths and then from that * table computes an actual longest common subsequence. ****/ public List<Integer> LCS( int [] A, int [] B) { int [][] m = lcsTable(A, B); List<Integer> sequence = actualLCS(m, A, B); return sequence; } /**** * LCS table of lengths ****/ private int [][] lcsTable( int [] A, int [] B) { int [][] m = new int [A.length + 1 ][B.length + 1 ]; for ( int x = 1 ; x <= A.length; x++) { for ( int y = 1 ; y <= B.length; y++) { if (A[x - 1 ] == B[y - 1 ]) { m[x][y] = 1 + m[x - 1 ][y - 1 ]; } else { m[x][y] = Math.max(m[x][y - 1 ], m[x - 1 ][y]); } } } return m; } /**** * Extract an actual LCS from the length table, which effectively * contains all the possible longest common sequences. ****/ private List<Integer> actualLCS( int [][] m, int [] A, int [] B) { List<Integer> result = new ArrayList<Integer>(); int x = A.length, y = B.length; while (x > 0 && y > 0 ) { if (A[x - 1 ] == B[y - 1 ]) { result.add(A[x - 1 ]); x--; y--; } else if (m[x][y - 1 ] > m[x - 1 ][y]) { y--; } else { x--; } } return result; } } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 | package algorithm.programming.dynamic; import java.util.Arrays; import java.util.List; import org.junit.Test; import static org.junit.Assert.*; public class LongestCommonSubsequenceTest { /** * Test of LCS method, of class LongestCommonSubsequence. */ @Test public void testLCS() { System.out.println( "LCS" ); int [] fibonacci = { 0 , 1 , 1 , 2 , 3 , 5 , 8 , 13 , 21 , 34 , 55 , 89 , 144 , 233 , 377 , 610 , 987 }; int [] prime = { 2 , 3 , 5 , 7 , 11 , 13 , 17 , 19 , 23 , 29 , 31 , 37 , 41 , 43 , 47 , 53 , 59 , 61 , 67 , 71 , 73 , 79 , 83 , 89 , 97 , 101 , 103 }; int [] rand = { 0 , 1 , 21 , 45 , 144 , 610 , 987 }; int [] bad = { 102 , 104 , 106 , 108 , 110 , 112 , 114 }; Integer[] exp1 = { 89 , 13 , 5 , 3 , 2 }; Integer[] exp2 = { 987 , 610 , 144 , 21 , 1 , 0 }; //no shared end points LongestCommonSubsequence seq = new LongestCommonSubsequence(); List<Integer> actual = seq.LCS(fibonacci, prime); assertTrue(Arrays.equals(exp1, actual.toArray( new Integer[ 0 ]))); //share both end points actual = seq.LCS(fibonacci, rand); assertTrue(Arrays.equals(exp2, actual.toArray( new Integer[ 0 ]))); //share no points actual = seq.LCS(fibonacci, bad); assertEquals( 0 , actual.size()); //share all points actual = seq.LCS(fibonacci, fibonacci); assertEquals(fibonacci.length, actual.size()); int i = fibonacci.length; for ( int e : actual) { assertEquals(e, fibonacci[--i]); } } } |