A FAST PARALLEL ALGORITHM FOR FINDING THE LARGEST COMMON 4-CONNECTED COMPONENT FROM TWO MATRICES

Original scientific paper We describe a new design of parallel algorithm for solving the two-dimensional longest common substring (2D LCS) problem, taking advantage of the multi-core graphic processing unit architecture offered by Compute Unified Device Architecture (CUDA). In this article we also define the 2D LCS problem as finding the largest common 4-connected component from two input matrices and present an algorithm which can exactly solve this problem in 0 (mnst/P) time with a P-core GPU.


Introduction
Finding the longest string that is a substring common to two given strings is a classic problem in string analysis.This is the LCS [1] problem (different from the longest common subsequence [2] problem, which finds the longest subsequence common to two sequences).Many applications rely on the availability of efficient LCS routines as a basis for their own efficiency.Biological applications such as DNA microarrays [3] often need to compare the DNA sequences of two different genes.File comparison (UNIX diff command, CVS, file fragments predicting [4]) is fundamentally a LCS or longest common subsequence problem.The LCS routine is also a building block of string operations and string matching applications like spell checkers.It is therefore important to continue to explore efficient LCS techniques on emerging computing architecture.
Compute Unified Device Architecture (CUDA) [5] is a parallel computing platform and programming model introduced by NVIDIA.Since GPUs have been at the leading edge of chip-level parallelism for a long time, CUDA provides a general-purpose platform for programmers to exploit the massive parallel computational power of GPU.GPUs have a parallel throughput architecture taking a different approach with CPUs that emphasizes executing many concurrent threads, rather than executing a single thread quickly.Although increasing the CPU clock speed has always been a reliable source for improved performance, the novel GPU approach manages to get rid of the long sophisticated computing pipeline and makes a great breakthrough in multi-core architecture.Current NVIDIA GPU, for example GeForce GTX TITAN, contains up to 2688CUDA cores running at 0,837 GHz, and in contrast to the corresponding consumer product in CPUs which is the Intel® Core™ i7-3970X with 6 cores running at 3,5 GHz, GPUs have a tremendous advantage in parallelism, especially when targeting a data-parallelism problem [6].
In this paper, we present a detailed method with analysis of what we believe is the first algorithm which exactly solves the two-dimensional longest common substring (2D LCS) problem and also benefits from multicore GPUs using CUDA.There has been an increasing motivation for the 2D LCS operation.For example, computer vision, pattern recognition and computational biology [7,8] have begun their research in twodimensional space.

Related work
For the sake of clarity, we first give the definition of original LCS problem.
Definition 2.1: Find the longest substring common to two given strings.
Input: Given string  =  1  2 ⋯   of length m and string  =  1  2 ⋯   of length n.
Output: The result string  =  1  2 ⋯   of length  is a common substring of  and  and  is maximized.
The classical method for the LCS problem based on the dynamic programming principle is as follows: The idea is to find all the lengths of the longest common suffix (LCSuff) for all pairs of prefixes of the strings and store these lengths in a matrix.Construct a score matrix  of size ( + 1) × ( + 1) , in which [, ](1 ≤  ≤ , 1 ≤  ≤ ) records the length of the longest common suffix for two prefixes  1… and  1… .The length is filled by the following recurrence equation: The maximum of these lengths is the length of the result string  (LCSubstr): where 1 ≤  ≤ , 1 ≤  ≤  .
After we find out the value of  and point (, ) corresponding to the particular point (, ) where can reach its maximum, the result string  can be reconstructed by tracing the diagonal from the corresponding point to upper left in matrix .
Because the main operation of this algorithm is filling the score matrix and each operation takes constant time, its time complexity is (mn).

Figure1
The dynamic programming LCS score matrix  of input strings "EFEF" and "FEEFE"

2D LCS Problem
Extending the longest common substring problem to a two dimensional problem, the input should be two matrices which can be not only character matrices [8] but also number matrices (images).Unlike the original problem, the result of the 2D LCS problem does not need to have the same structure as its inputs, which is actually the largest 4-connected component [9] that is common to two input matrices.As we solve the 2D LCS problem by the dynamic programming principle (in contrast to 2D Suffix Tree [10]), the algorithm can be used to process images directly (number matrices).The formal definition of 2D LCS is given below.

Algorithm 4.1 Method
The main idea of our algorithm for solving 2D LCS problem is based on the dynamic programming principle.We extend the classic method mentioned in the previous section to 2D space and enhance its performance by parallelism.The main data structure of our algorithm is a data cube instead of a matrix, in which the two input matrices stay outside the top and left faces of the data cube and the data cube itself stores the corresponding comparison result of the two matrices.
However, we cannot find the largest 4-connected component common to  and  by only analysing this data cube.If we view the data cube as a stack of matrices from back to front as in Fig. 2(b), then each matrix is the score matrix of the strings in the corresponding columns of  and  .Thus we cannot compare strings from different columns of two matrices, and it would lead to only one optimal substructure of the original problem as  2 in Fig. 2

Step (1)
Before going to step (1), we need to ensure the number of columns in matrix  is larger than that in .If . < ., exchange matrix  with matrix .
We can now shift matrix  through  like Fig. 3. Assuming matrix  as the stable surface and matrix  as the moving surface that shifts one column upward for each iteration, there are three conditions depending on the relative positions of matrix  and  in 2D space as Fig. 4. Definingℎ as the iterator, it also represents the relative position of  and , which is approximately the distance between the last column of  and the first column of .Copying one element from one matrix to another takes constant time so that the time for extracting one matrix depends on its size.If we execute step (1) on CPU, at each iteration it takes at most ( + ) = ( × max (, )) time and particularly () if  is much bigger than  because in condition (b) we only need to extract matrix  .Since iterator ℎ = ( + ) = (max(, ) = () (. > ., thus  > ), the total running time for step (1) is �ℎ( + )� = ( × max (, )) and () if either  is much bigger than  or s approaches ().
We can also execute step (1) on GPU by CUDA programming architecture, and it is very trivial to parallelize step (1).We set up the thread number for running COPY (CUDA kernel for extracting one matrix) as the size of its target matrix, and each thread in the GPU copies one element where its thread ID is corresponding to one in the target matrix.Assuming the GPU has  cores, the total running time on GPU for step (1) is (/).

Step (2)
We can now extend the original algorithm for LCS problem to 2D space, but for the purpose of easy-toparallelize, the algorithm would be decomposed into 2 parts.
First part.In this part, the algorithm would compare the temporary matrix  with  and fill the data cube with the comparison result.Define the data cube as  and the top-left-front corner of the data cube as the index origin which + axis points downwards, + axis points backwards and + axis points rightwards.The equation for filling the data cube is described below.The time complexity of this part is () for each iteration.The total time complexity is (ℎ) = ().We can also parallelize this part by CUDA.Set up the thread number for running COMPARE (CUDA kernel for the first part algorithm) as  × .Then each thread starts filling the cube one element a time from left face to right face ( 0 <  ≤ . ) and manipulates only m elements.Each element in matrix  ejects one thread with a thread ID corresponding to the (, ) position in .Assuming the GPU has  cores, the total running time on GPU for the first part algorithm is (/).
Second part.In the second part, the goal of the algorithm is to find out the largest 4-connected component common to matrix  and .In the original LCS problem, the mirror line as in Fig. 1 which contains information about the common substring of two input matrices has become a mirror plane as in Fig. 5 after adding one more dimension to the LCS problem.After comparing and contrasting Eq. ( 9) and Eq. ( 1), the data cube has not been calculating the size of each common 4connected component.Thus we have to finish the statistics on the size of each common 4-connected component in every mirror inclined plane inside the data cube.The statistics work is very similar to the process of connect component labelling.It searches the mirror plane line-to-line, top to bottom, to calculate the size of every connect component it met instead of assigning blob labels.The algorithm consists of two procedures: one is called SEARCH_PIECE which searches the plane pixel by pixel and marks it as "visited", but when it comes a pixel belonged to a connect component and the pixel has not been visited yet, it calls EAT_PIECE; another procedure is called EAT_PIECE which tries to figure out the size of the connect component that the pixel belongs in and marks every pixel in this connect component as "visited".
It also updates the temporary result for the plane that stores the size of connected component if the new size is bigger.

EAT_PIECE
Define the index for the top-left pixel in the mirror plane as (, ) and the value of the corresponding cell in data cube  as (, ).The value of data cube consists of two bits: one stands for visited or unvisited; the other stands for comparison result.The process of the whole second part algorithm can also be run on GPU.Set up the thread number for running SEARCH_PIECE (CUDA version) as ( + ) .Then each thread searches one mirror plane in the data cube and all mirror planes in the cube can be searched concurrently.Assuming the GPU has  cores, the total running time on GPU for the second part algorithm is ( × max(, ) /).

Notations and time analysis
At last, the result extraction is trivial.One way is to store the temporary largest component and its size among the components extracted by EAT_PIECE whenever each thread for the mirror plane comes across.Then compare the results among all threads at each iteration by the CUDA REDUCTION scheme [11].Another method is similar to the first one, but only stores the index that is the trigger element of EAT_PIECE and size of the temporary largest component.After all iterations, we only need to run EAT_PIECE to extract the result component from the last remaining index which belongs to one element of the result largest 4-connected component.
The most time consuming part is the second part of step (2) of the whole 2D LCS algorithm.Thus the algorithm has a time requirement of () if  approaches ().Assuming the GPU has  cores, the algorithm can run on GPU in (/) time.If  ≥  + , the algorithm takes () time.

Experiments
In the experiment, we contrast our algorithm with MatrixMatchMaker (MMMvII [12,13]) algorithm which finds the largest common submatrices between pairs of phylogenetic distance matrices to detect protein coevolution.Although the MMMvII algorithm has different result structure that is more relaxed than the 2D LCS definition, it is still a good candidate for comparison.After this experiment, our algorithm will be more functional and believable.We believe that it is the first algorithm which can find the largest common 4-connected component from two matrices.
Because we only need to compare the time consumption of two algorithms that actually solve two different problems, the input test data is simplified as two square matrices which are generated randomly.We test ten sizes of square matrix which vary from 110 × 110 to 200 × 200.The MMMvII results are collected by summiting the input test data to the official website [14].The test platform of our algorithm is NVIDA GeForce GTX 550 Ti with 192 CUDA cores.In Fig. 6, it is easy to figure out that our algorithm has an average 2,27 speedup in contrast to MMMVII even when it is solving a much more complex problem.

A kind of variation
The algorithm can still be improved the degree of parallelism, but the variation version requires more hardware resource.It needs  1 GPUs and each GPU has  2 cores.The main idea is to unfold the iteration of the algorithm by multithreading.
In step (1) of the algorithm, we create one thread at the beginning of each iteration and deploy the rest of algorithm to the new thread.This new thread would pick up one idle GPU to finish the rest of the algorithm.Thus this variation algorithm takes (/ 1  2 ) time.If  1 ≥ h and  2 ≥  + , it takes () time which is also the theoretical limit for the 2D LCS problem.

Conclusions
We give the definition of the 2D LCS problem as finding the largest 4-connected component from two matrices and give the first algorithm which can achieve this task in (/) time.We also introduce a variation of this algorithm working at (/ 1  2 ) time.This algorithm can be used as a basic building block or parallel scheme of many biological and image problems and when the developer has a well-defined problem size and enough hardware resource, the variation algorithm can work at () time.

Figure 2
Figure 2 The 2D LCS Data Cube

Definition 3 . 1 :
Find the largest 4-connected component common to two given matrices.Input: Give matrix  with  rows and  columns and matrix  with  rows and  columns.Output: The result 4-connected component  of size  is a common component of  and  and  is maximized.The definition of the 4-connected component is slightly different from its original meaning in binary valued digital imaging.The new definition is given below.Definition 3.2: A set of matrix elements common to  and , , is a 4-connected component if for every pair of elements   and   in  , there exists a sequence of elements   , … ,   such that: (a) All elements in the sequence are in set  i.e. are matrix elements common to  and .(b) Every 2 elements that are adjacent in the sequence are 4-neighbors.

Figure 3
Figure 3 Shift A through B

Figure 4
Figure 4 Three conditions of shifting

Figure 6
Figure 6 Experiment result of the MMMvII and 2D LCS algorithm because of missing one degree of freedom.Moreover, the data cube cannot deal with the situation that  and  have different number of columns.Thus, the algorithm needs three basic steps to solve 2D LCS problem: (i) Shift matrix  through  by columns and extract corresponding matrices , .(ii) Compare matrix  and  and find the size of the largest 4-connected component common to ,  in the exact same columns.(iii) Repeat last two steps until  and  have no more cross columns, and the 4-connected component common that is the largest component in size among all of them is the result.