Longest Common Subsequence
by Isai Damier, Android Engineer @ Google

Code
Unit Test
Sponsors

#=======================================================================
# Author: Isai Damier
# Title: Longest Common Subsequence
# Project: geekviewpoint
# Package: algorithms
#
# Statement:
# Given two sequences, find the longest common subsequence
# between them.
#
# Time Complexity: O(n*m)
#  where m and n are the size of the input lists, respectively
#
# Sample Input: awaited, alpine
# Sample Output: aie
#
# DEFINITION OF SUBSEQUENCE:
# A sequence is a particular order in which related objects follow
# each other (e.g. DNA, Fibonacci). A sub-sequence is a sequence
# obtained by omitting some of the elements of a larger sequence.
#
# SEQUENCE       SUBSEQUENCE     OMISSION
# [3,1,2,5,4]     [1,2]            3,5,4
# [3,1,2,5,4]     [3,1,4]          2,5
#
# SEQUENCE       NOT SUBSEQUENCE   REASON
# [3,1,2,5,4]     [4,2,5]           4 should follow 5
#
# STRATEGY:
# An easy way to find a longest common subsequence of characters between
# two words is to first track the lengths of all the common sequences
# and then from those lengths pick a maximum; finally, from that
# maximum length, trace out the actual longest common subsequence
# between the two words.
#
# Let's use a table C (i.e. list C) to track the lengths of all common
# sub-sequences between the two input words. Naturally, C starts with
# all cells set to zero. Then to fill up C, we walk thru the two words,
# comparing characters. Each time we find a match, we mark the
# corresponding cell in C as one plus whatever maximum we had seen so
# far for that particular sub-sequence.
#
# When we are done filling up C, we find a maximum length z in C. Then
# using the indices of z, we trace that actual sequence and return it
# as the answer.
#
# The ensuing code shows the details of how we fill C and then from C
# extract the actual LCS.
#
#        ~~~~    ~~~~    ~~~~    ~~~~    ~~~~    ~~~~    ~~~~
#
# The remaining paragraphs contain details that you may not particularly
# care for. You may skip them and go right to the code.
#
# Finding the LCS of two sequences X and Y, i.e. LCS(X,Y), is similar
# in structure to finding the Fibonacci of some number n, i.e. fib(n).
# Both problems exhibit optimal substructures. Which is to say,
# for instance, the solution to fib(5) contains the solution to fib(4)
# which in turn contains fib(3), and so on. Similarly, the solution
# of LCS(X,Y) contains the solution of LCS(X-1,Y), which in turn
# contains the solution of LCS(X-1,Y-1), and so on. Whenever a problem
# exhibits "optimal substructure", we can use recursion to solve it:
#
#             _
#            | fib(n-1)+fib(n-2)    if n > 1
#   fib(n) = | n                    if n == 1 or n == 0
#            |_
#
# For the LCS recurrence relation, a few notes are necessary:
# Let C(i,j) be the length of the prefix of LCS(X,Y) given by
#
#   C(i,j) = |LCS(X[1..i],Y[1..j])|
#
# so that if the length of X is m and the length of Y is n,
#
#   C(m,n) = |LCS(X,Y)|
#
# Then the recurrence relation for filling C is
#             _
#            | C(i-1,j-1)+1            if X(i) == Y(j)
#   C(i,j) = | max[C(i-1,j),C(i,j-1)]  otherwise
#            |_
#
# In addition to "optimal substructures", the LCS(X,Y) like the fib(n)
# contains overlapping subproblems.  What we mean is this: In solving
# f(5), we meet f(2) both as a subproblem of f(4) and as a subproblem
# of f(3).
#
#             fib(5)
#           /        \
#       fib(4)        fib(3)       :fib(5)=fib(4)+fib(3)
#       /    \       /     \
#   fib(3) fib(2)  fib(2)  fib(1)  :fib(5)=fib(3)+fib(2)+fib(2)+fib(1)
#
# Whenever a problem exhibits "overlapping subproblems", if you use
# recursion as a strategy you will end up solving the same subproblems
# more than once. Therefore, it makes more sense to track subproblems
# using a table so that once you solve a subproblem, you can simply
# lookup it's solution next time you need.
#
# CONCLUSION:
# In the foregoing "strategy" section, we essentially described
# dynamic programming.
#
# Dynamic Programming: (aka Dynamic Tabling) is a strategy for solving
#   problems, which exhibit optimal substructures and overlapping
#   subproblems, by using a table to track the solution of subproblems
#   that have already been solved so as not to solve a subproblem more
#   than once. This tracking or note taking is usually referred to as
#   memoization, meaning writing in a memo.
#======================================================================= 
 #====================================================================
# Space Complexity: O(n*m)
# where m and n are the size of the input lists, respectively
#
# This solution first computes a table of lengths and then from that
# table computes an actual longest common subsequence.
#====================================================================
def LCS( A, B ):
    m = lcsTable( A, B )
    return actualLCS( m, A, B )


#=======================================================================
# LCS table of lengths
#=======================================================================
def lcsTable( A, B ):
    m = [[0 for y in range( len( B ) + 1 )] for x in range( len( A ) + 1 )]
    for x in range( 1, len( A ) + 1 ):
      for y in range( 1, len( B ) + 1 ):
        if A[x - 1] == B[y - 1]:
          m[x][y] = 1 + m[x - 1][y - 1]
        else:
          m[x][y] = max( m[x][y - 1], m[x - 1][y] )

    return m



#=======================================================================
# Extract an actual LCS from the length table, which effectively
# contains all the possible longest common sequences.
#=======================================================================

def actualLCS( m, A, B ):
    result = []
    x , y = len( A ), len( B )
    while x > 0 and y > 0:
      if A[x - 1] == B[y - 1]:
        result.append( A[x - 1] )
        x -= 1
        y -= 1
      elif m[x][y - 1] > m[x - 1][y]:
        y -= 1
      else:
        x -= 1

    return result

import unittest
from dynamic_programming import LongestCommonSubsequence as subsequence

class Test( unittest.TestCase ):

  def testLCS( self ):
    fibonacci = [0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
                  610, 987]
    prime = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53,
              59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103]
    rand = [0, 1, 21, 45, 144, 610, 987]
    bad = [102, 104, 106, 108, 110, 112, 114]

    exp1 = [89, 13, 5, 3, 2]
    exp2 = [987, 610, 144, 21, 1, 0]

    # no shared end points
    actual = subsequence.LCS( fibonacci, prime )
    self.assertEquals( exp1, actual )

    # share both end points
    actual = subsequence.LCS( fibonacci, rand )
    self.assertEquals( exp2, actual )

    # share no points
    actual = subsequence.LCS( fibonacci, bad )
    self.assertEquals( 0, len( actual ) )

    # share all points
    actual = subsequence.LCS( fibonacci, fibonacci )
    actual.reverse()
    self.assertEquals( fibonacci, actual )

Longest Common Subsequence by Isai Damier, Android Engineer @ Google

Longest Common Subsequence
by Isai Damier, Android Engineer @ Google