Longest Common Subsequence
by Isai Damier, Android Engineer @ Google

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
/***********************************************************************
 * Author: Isai Damier
 * Title: Longest Common Subsequence
 * Project: geekviewpoint
 * Package: algorithms
 *
 * Statement:
 *   Given two sequences, find the longest common subsequence
 *   between them.
 *
 *  Time Complexity: O(n*m)
 *    where m and n are the size of the input arrays, respectively
 *
 * Sample Input: awaited, alpine
 * Sample Output: aie
 *
 * DEFINITION OF SUBSEQUENCE:
 *   A sequence is a particular order in which related objects follow
 *   each other (e.g. DNA, Fibonacci). A sub-sequence is a sequence
 *   obtained by omitting some of the elements of a larger sequence.
 *
 *   SEQUENCE       SUBSEQUENCE     OMISSION
 *   {3,1,2,5,4}     {1,2}            3,5,4         
 *   {3,1,2,5,4}     {3,1,4}          2,5
 *
 *   SEQUENCE       NOT SUBSEQUENCE   REASON
 *   {3,1,2,5,4}     {4,2,5}           4 should follow 5
 *    
 * STRATEGY:
 *   An easy way to find a longest common subsequence of characters between
 *   two words is to first track the lengths of all the common sequences
 *   and then from those lengths pick a maximum; finally, from that
 *   maximum length, trace out the actual longest common subsequence
 *   between the two words.
 *
 *   Let's use a table C (i.e. array C) to track the lengths of all common
 *   sub-sequences between the two input words. Naturally, C starts with
 *   all cells set to zero. Then to fill up C, we walk thru the two words,
 *   comparing characters. Each time we find a match, we mark the
 *   corresponding cell in C as one plus whatever maximum we had seen so
 *   far for that particular sub-sequence.
 *
 *   When we are done filling up C, we find a maximum length z in C. Then
 *   using the indices of z, we trace that actual sequence and return it
 *   as the answer.
 *
 *   The ensuing code shows the details of how we fill C and then from C
 *   extract the actual LCS.
 *
 *          ~~~~    ~~~~    ~~~~    ~~~~    ~~~~    ~~~~    ~~~~
 *
 *   The remaining paragraphs contain details that you may not particularly
 *   care for. You may skip them and go right to the code.
 *
 *   Finding the LCS of two sequences X and Y, i.e. LCS(X,Y), is similar
 *   in structure to finding the Fibonacci of some number n, i.e. fib(n).
 *   Both problems exhibit optimal substructures. Which is to say,
 *   for instance, the solution to fib(5) contains the solution to fib(4)
 *   which in turn contains fib(3), and so on. Similarly, the solution
 *   of LCS(X,Y) contains the solution of LCS(X-1,Y), which in turn
 *   contains the solution of LCS(X-1,Y-1), and so on. Whenever a problem
 *   exhibits "optimal substructure", we can use recursion to solve it:
 *  
 *               _
 *              | fib(n-1)+fib(n-2)    if n > 1
 *     fib(n) = | n                    if n == 1 or n == 0
 *              |_
 *             
 *   For the LCS recurrence relation, a few notes are necessary:
 *   Let C(i,j) be the length of the prefix of LCS(X,Y) given by
 *
 *     C(i,j) = |LCS(X[1..i],Y[1..j])|
 *
 *   so that if the length of X is m and the length of Y is n,
 *  
 *     C(m,n) = |LCS(X,Y)|
 *
 *   Then the recurrence relation for filling C is
 *               _
 *              | C(i-1,j-1)+1            if X(i) == Y(j)
 *     C(i,j) = | max{C(i-1,j),C(i,j-1)}  otherwise
 *              |_
 *
 *   In addition to "optimal substructures", the LCS(X,Y) like the fib(n)
 *   contains overlapping subproblems.  What we mean is this: In solving
 *   f(5), we meet f(2) both as a subproblem of f(4) and as a subproblem
 *   of f(3).
 *
 *               fib(5)
 *             /        \
 *         fib(4)        fib(3)       :fib(5)=fib(4)+fib(3)
 *         /    \       /     \
 *     fib(3) fib(2)  fib(2)  fib(1)  :fib(5)=fib(3)+fib(2)+fib(2)+fib(1)
 *
 *   Whenever a problem exhibits "overlapping subproblems", if you use
 *   recursion as a strategy you will end up solving the same subproblems
 *   more than once. Therefore, it makes more sense to track subproblems
 *   using a table so that once you solve a subproblem, you can simply
 *   lookup it's solution next time you need.
 *
 * CONCLUSION:
 *   In the foregoing "strategy" section, we essentially described
 *   dynamic programming.
 *
 *   Dynamic Programming: (aka Dynamic Tabling) is a strategy for solving
 *     problems, which exhibit optimal substructures and overlapping
 *     subproblems, by using a table to track the solution of subproblems
 *     that have already been solved so as not to solve a subproblem more
 *     than once. This tracking or note taking is usually referred to as
 *     memoization, meaning writing in a memo.
 *
 **********************************************************************/
 package algorithm.programming.dynamic;
 
import java.util.ArrayList;
import java.util.List;
 
public class LongestCommonSubsequence {
 
  /****
   * Space Complexity: O(n*m)
   *    where m and n are the size of the input arrays, respectively
   *
   * This solution first computes a table of lengths and then from that
   * table computes an actual longest common subsequence.
   ****/
  public List<Integer> LCS(int[] A, int[] B) {
    int[][] m = lcsTable(A, B);
    List<Integer> sequence = actualLCS(m, A, B);
    return sequence;
  }
 
  /****
   * LCS table of lengths
   ****/
  private int[][] lcsTable(int[] A, int[] B) {
    int[][] m = new int[A.length + 1][B.length + 1];
    for (int x = 1; x <= A.length; x++) {
      for (int y = 1; y <= B.length; y++) {
        if (A[x - 1] == B[y - 1]) {
          m[x][y] = 1 + m[x - 1][y - 1];
        } else {
          m[x][y] = Math.max(m[x][y - 1], m[x - 1][y]);
        }
      }
    }
    return m;
  }
 
  /****
   * Extract an actual LCS from the length table, which effectively
   * contains all the possible longest common sequences.
   ****/
  private List<Integer> actualLCS(int[][] m, int[] A, int[] B) {
    List<Integer> result = new ArrayList<Integer>();
    int x = A.length, y = B.length;
    while (x > 0 && y > 0) {
      if (A[x - 1] == B[y - 1]) {
        result.add(A[x - 1]);
        x--;
        y--;
      } else if (m[x][y - 1] > m[x - 1][y]) {
        y--;
      } else {
        x--;
      }
    }
    return result;
  }
}