Space Efficient LCS
by Isai Damier, Android Engineer @ Google
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | #======================================================================= # Author: Isai Damier # Title: Longest Common Subsequence # Project: geekviewpoint # Package: algorithms # # Statement: # Given two sequences, find the longest common subsequence # between them. # # Time Complexity: O(n*m) # where m and n are the size of the input lists, respectively # # Sample Input: awaited, alpine # Sample Output: aie # # DEFINITION OF SUBSEQUENCE: # A sequence is a particular order in which related objects follow # each other (e.g. DNA, Fibonacci). A sub-sequence is a sequence # obtained by omitting some of the elements of a larger sequence. # # SEQUENCE SUBSEQUENCE OMISSION # [3,1,2,5,4] [1,2] 3,5,4 # [3,1,2,5,4] [3,1,4] 2,5 # # SEQUENCE NOT SUBSEQUENCE REASON # [3,1,2,5,4] [4,2,5] 4 should follow 5 # # STRATEGY: # An easy way to find a longest common subsequence of characters between # two words is to first track the lengths of all the common sequences # and then from those lengths pick a maximum; finally, from that # maximum length, trace out the actual longest common subsequence # between the two words. # # Let's use a table C (i.e. list C) to track the lengths of all common # sub-sequences between the two input words. Naturally, C starts with # all cells set to zero. Then to fill up C, we walk thru the two words, # comparing characters. Each time we find a match, we mark the # corresponding cell in C as one plus whatever maximum we had seen so # far for that particular sub-sequence. # # When we are done filling up C, we find a maximum length z in C. Then # using the indices of z, we trace that actual sequence and return it # as the answer. # # The ensuing code shows the details of how we fill C and then from C # extract the actual LCS. # # ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ # # The remaining paragraphs contain details that you may not particularly # care for. You may skip them and go right to the code. # # Finding the LCS of two sequences X and Y, i.e. LCS(X,Y), is similar # in structure to finding the Fibonacci of some number n, i.e. fib(n). # Both problems exhibit optimal substructures. Which is to say, # for instance, the solution to fib(5) contains the solution to fib(4) # which in turn contains fib(3), and so on. Similarly, the solution # of LCS(X,Y) contains the solution of LCS(X-1,Y), which in turn # contains the solution of LCS(X-1,Y-1), and so on. Whenever a problem # exhibits "optimal substructure", we can use recursion to solve it: # # _ # | fib(n-1)+fib(n-2) if n > 1 # fib(n) = | n if n == 1 or n == 0 # |_ # # For the LCS recurrence relation, a few notes are necessary: # Let C(i,j) be the length of the prefix of LCS(X,Y) given by # # C(i,j) = |LCS(X[1..i],Y[1..j])| # # so that if the length of X is m and the length of Y is n, # # C(m,n) = |LCS(X,Y)| # # Then the recurrence relation for filling C is # _ # | C(i-1,j-1)+1 if X(i) == Y(j) # C(i,j) = | max[C(i-1,j),C(i,j-1)] otherwise # |_ # # In addition to "optimal substructures", the LCS(X,Y) like the fib(n) # contains overlapping subproblems. What we mean is this: In solving # f(5), we meet f(2) both as a subproblem of f(4) and as a subproblem # of f(3). # # fib(5) # / \ # fib(4) fib(3) :fib(5)=fib(4)+fib(3) # / \ / \ # fib(3) fib(2) fib(2) fib(1) :fib(5)=fib(3)+fib(2)+fib(2)+fib(1) # # Whenever a problem exhibits "overlapping subproblems", if you use # recursion as a strategy you will end up solving the same subproblems # more than once. Therefore, it makes more sense to track subproblems # using a table so that once you solve a subproblem, you can simply # lookup it's solution next time you need. # # CONCLUSION: # In the foregoing "strategy" section, we essentially described # dynamic programming. # # Dynamic Programming: (aka Dynamic Tabling) is a strategy for solving # problems, which exhibit optimal substructures and overlapping # subproblems, by using a table to track the solution of subproblems # that have already been solved so as not to solve a subproblem more # than once. This tracking or note taking is usually referred to as # memoization, meaning writing in a memo. #======================================================================= #======================================================================= # Space Complexity: O(min(n,m)) # where m and n are the size of the input lists, respectively. # # There are a number of reduced space LCS algorithms. This is just # one approach. #======================================================================= def spaceEfficientLCS_length( A, B ): if len ( A ) > len ( B ): return LCS_length( A, B ) return LCS_length( B, A ); def LCS_length( B, A ): m, n = len ( A ) , len ( B ) T = [[ 0 for y in range ( n + 1 )] for x in range ( 2 )] for i in range ( m - 1 , - 1 , - 1 ): ndx = i & 1 ; # odd or even for j in range ( n - 1 , - 1 , - 1 ): if A[i] = = B[j]: T[ndx][j] = 1 + T[ 1 - ndx][j + 1 ] else : T[ndx][j] = max ( T[ 1 - ndx][j], T[ndx][j + 1 ] ) return T[ 0 ][ 0 ] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import unittest from dynamic_programming import LongestCommonSubsequence as subsequence class Test( unittest.TestCase ): def testSpaceEfficientLCS_length( self ): fibonacci = [ 0 , 1 , 1 , 2 , 3 , 5 , 8 , 13 , 21 , 34 , 55 , 89 , 144 , 233 , 377 , 610 , 987 ] prime = [ 2 , 3 , 5 , 7 , 11 , 13 , 17 , 19 , 23 , 29 , 31 , 37 , 41 , 43 , 47 , 53 , 59 , 61 , 67 , 71 , 73 , 79 , 83 , 89 , 97 , 101 , 103 ] rand = [ 0 , 1 , 21 , 45 , 144 , 610 , 987 ] bad = [ 102 , 104 , 106 , 108 , 110 , 112 , 114 ] exp1, exp2 = 5 , 6 # no shared end points actual = subsequence.spaceEfficientLCS_length( fibonacci, prime ) self .assertEquals( exp1, actual ) # share both end points actual = subsequence.spaceEfficientLCS_length( fibonacci, rand ) self .assertEquals( exp2, actual ) # share no points actual = subsequence.spaceEfficientLCS_length( fibonacci, bad ) self .assertEquals( 0 , actual ) # share all points actual = subsequence.spaceEfficientLCS_length( fibonacci, fibonacci ) self .assertEquals( len ( fibonacci ), actual ) |