A HYBRID BACKTRACKING ALGORITHM FOR AUTOMATIC TEST DATA GENERATION

Original scientific paper As a fundamental issue in software testing, automatic test data generation is of crucial importance, which is essentially a constraint satisfaction problem and solved by search algorithms. In our previous research, branch and bound was proposed as our constraint solver and the look-ahead methods were elaborated. Based on interval arithmetic and symbolic execution, this paper focuses on the look-back or backtracking method, which is the hybridization of forward checking and conflict-directed backjumping, with the aim of improving the efficiency of backtracking in the search procedure. The closures of variables are used to facilitate the localization of the conflicts which cause dead ends. Empirical experiments prove the effectiveness of the proposed hybrid backtracking method and its applicability in engineering.


Introduction
As a key stage to ensure software reliability, software testing plays an indispensable part in software development.Automating software testing is a hot research topic and is of practical value to industry [1].And being a fundamental issue in software testing, path-wise test data generation has always been a hotspot because a large number of problems in software testing can be converted into it in one way or another.
Theoretically, automatic path-wise test data generation is a constraint satisfaction problem (CSP) [2], where the path is made up of a node sequence in a control flow graph (CFG) [3].To be exact, X={x 1 , x 2 ,…, x n } is a set of variables; R={r 1 , r 2 ,…, r k } is a finite set of constraints or relations between the variables along the path; D={D 1 , D 2 ,…, D n } is a set of domains, and D i ∈D (i=1, 2,…, n) is a finite set of possible values for x i .For the path concerned (denoted as p), D is defined on the basis of the acceptable ranges of the variables.One solution to the problem is a set of values to instantiate each variable within its domain, denoted as {<x 1 , V 1 >,<x 2 , V 2 >,…,<x i , V i >,<x i+1 , V i+1 >,…,<x n , V n >}(V i ∈D i ) to make p feasible, which means that each constraint in R should be satisfied [4].
To solve the CSP and further obtain the test data, it is required to abstract, propagate, and solve these constraints.Many researchers have made great endeavors in this field.Korel proposed an automatic test data generation method and the definition of branch function as well [5], which have inspired many researchers.Demillo and Offutt [6] put forward a technique adopting bisection and algebraic constraints to solve out the test data designed to find particular types of faults, but there was not enough heuristic information to guide the search.ADTEST [7] proposed by Gallagher et al.only considered one input variable or one predicate, and kept the solving process iterated, which was inefficient and inapplicable to realworld programs.Gupta et al. [8] found out a dynamic way of generating test data for a particular path.Robschink [9] statically converted a program into a form called Static Single Assignment (SSA), and usually resulted in systems with huge number of constraints which included irrelevant variables on some occasions.BINTEST [10] developed by Beydeda et al. guided the search procedure with bisection, which could eliminate the domains of variables that possibly included some solutions.Euclide proposed by Arnaud Gotlieb [11] was a testing tool based on constraint solving, and was employed to verify safety-critical C programs, combining symbolic and numerical analyses, constraint propagation, integer linear relaxation and search-based test data generation.Pachauri et al. [12] introduced three methods to order branches for selecting targets for coverage testing, and conducted experiments to evaluate branch ordering through memory and elitism to increase the performance of test data generation.
For the purpose of constructing an efficient constraint solving engine for automatic test data generation, we proposed best-first-search branch and bound (BFS-BB) [13] in our previous work, which is a heuristic method adopting a classical artificial intelligence algorithm, namely, branch and bound [14].As the enhancement of BFS-BB, we optimize the part of backtracking in this paper by combining forward checking and conflictdirected backjumping, so as to increase the search efficiency.The closures of variables are used to facilitate the backtracking process.
Using BFS-BB that adopted forward checking (introduced in Section 2) as the backtracking method, we tested some real-world benchmarks in an engineering project aa200c available at http://www.moshier.net/,and it was found that the paths with backtracking accounted for 32 % of all the paths, whereas the searching time consumed on them accounted for 81 % of the total time consumption.The efficiency of backtracking is a key problem that influences the performance of the test data generation method.Therefore, we need to design a more efficient backtracking method for BFS-BB, and we also attempt to make evaluations on the influence of our method on the efficiency of backtracking empirically.
The remainder of this paper is organized as follows.Section 2 introduces the relevant concepts of backtracking.In Section 3, we elaborate the hybrid backtracking algorithm and its backtracking strategies.Section 4 conducts experiments to analyze and evaluate the proposed backtracking strategies.Conclusion of this paper and directions for future research are presented in Section 5.

Backtracking search
The methods used to solve CSPs are usually based on backtracking search [15].In summary, a backtracking algorithm finds the solution by extending partial problem solutions.There is a current partial solution which the backtracking algorithm attempts to extend at every search stage and finally becomes the full solution.In the search procedure, variables are divided into three sets: past variables (PV, already instantiated), current variable (being instantiated currently), and future variables (FV, not yet instantiated).An inconsistency is a consistent partial solution composed of the instantiations of past variables.The partial solution cannot be added any more variables and cannot be part of the full solution.A dead end is encountered after all the values in the domain of the current variable have been tried out.Under this situation, some instantiated variables will become uninstantiated, which means they are eliminated from the current partial solution.
The techniques used to improve a search algorithm are classified as look-ahead and look-back methods.Look-ahead methods take effect every time the search is ready to extend the current partial solution, and look-back methods take effect whenever a dead end occurs and the search is ready for the backtracking step.The look-ahead methods in BFS-BB were elaborated in our previous work [13].In this paper, the look-back methods are our focus, which include the functions that determine which variable to backtrack to by analyzing the reason for the dead end and decide what new constraints to record so that the same conflict will not appear again in the search [16].
Chronological backtracking (BT) [17] is the simplest yet the most widely used backtracking algorithm, as it chronologically backtracks to the variable last instantiated when a dead end occurs.Rather than backtracking to the last instantiated variable chronologically, backjmping (BJ) [18] goes to the deepest past variable in conflict with the current variable.In conflict-directed backjumping (CBJ) [19], each variable has its conflicting set consisting of the past variables which failed consistency checks with its current instantiation.Unlike the above backward methods, forward checking (FC) [20] conducts consistency check forward, namely, the check is between the current variable and future variables.BFS-BB adopted FC as the backtracking strategy.The hybrid algorithm forward checking and conflict-directed backjumping (FC-CBJ) draws on the advantages of both FC and CBJ, and was proved to be the champion among common backtracking algorithms [19].

The hybrid backtracking strategy
As mentioned in Section 2, FC-CBJ performs more efficiently than other backtracking algorithms, and we adopt FC-CBJ as our backtracking algorithm in this paper.Interval arithmetic [21] and symbolic execution [3,22] are adopted to facilitate the process of FC-CBJ.Besides, the closures of variables are proposed to localize the conflict more precisely.We elaborate FC and CBJ in Sections 3.1 and 3.2, respectively.The hybrid searching algorithm is proposed in Section 3.3.

Forward checking
Forward checking is accomplished by interval arithmetic.To better explain the procedure, we define branch condition as below.
Definition 1.Let B be the set of {true, false} and D a be the set of the domains of all the variables before the a th branch, if there are k branches along the path, the branch condition Br(n qa , n qa+1 ): D a →B (a∈ [1, k]) where n qa is a branching node is computed by Eq. ( 1).
In Eq. ( 1), D a meets the previous a-1 branch conditions and will be input to compute the a th branch condition.a D ~ which is a set of volatile domains is the result of computing Br(n qa , n qa+1 ) with D a , and meets the a th branch condition.
satisfies all the previous a-1 branch conditions and the a th branch condition simultaneously, so that interval arithmetic can proceed to compute the remaining branch conditions, whereas ~means that the domain of at least one variable (which may be a future variable) is annihilated and R is not met.For the k branches along the path, all the k branch conditions should be true to make R met.This process is demonstrated by Fig. 1.
Accurately, the input of interval arithmetic is the set of the domains of all the variables represented as D 1 , and the branch condition corresponding to the branch (n q1 , n q1+1 ) where n q1 is the first branching node evaluated.Typically, the branch condition Br(n q1 , n q1+1 ) cannot be satisfied by all the values in D 1 , but some values in a subset D 2 ⊆D 1 ensuring the traversal of the branch (n q1 , n q1+1 ), i.e., 2 ) ( 1 Next the branch condition Br(n q2 , n q2+1 ) is evaluated by the set of the domains of all the variables (D 2 ).Again, generally Br(n q2 , n q2+1 ) is only satisfied by a subset D 3 ⊆D 2 .This process carries on along p until all the branch conditions are satisfied and D k+1 is returned as the set of the domains of all the variables.The process is the propagation of the branch conditions along p in the form of , where D 1 ⊇D 2 ⊇D 3 …⊇D k ⊇D k+1 , shown as Fig. 1(a).But if in this procedure Br(n qh , n qh+1 ) (1≤h≤k) is false, indicating the detection of a conflict, then interval arithmetic is terminated, shown as Fig. 1(b).In our method, the process of calculating each branch condition is considered to be a constraint check, which is a coarse-grained checking manner compared with some arc consistency checking algorithms, for example, AC-3 [23].It can be found that in the procedure of interval arithmetic, it is always the set of the domains of all the variables that is involved in the computation, which ensures possible domain reduction or annihilation of all the variables including those in FV.For convenience, we suppose that the instantiation of variables is in the predefined order: x 1 , x 2 ,…, x n .But in the actual implementation, dynamic ordering is adopted, in which the state of the search determines the variable to instantiate next.Assuming that the current variable is x i , the input of forward checking is the set of the domains of all the variables represented as }, which includes three parts.The domains of the past variables are all fixed values with the same top and bottom value which have been verified consistent by forward checking; the domain of the current variable is a fixed value with the same top and bottom value which is being verified by forward checking; the domains of future variables are basically a range of values which have been filtered by the past instantiations and will be filtered by the current instantiation.Since the main purpose of forward checking is to judge whether the assignment V i for the current variable x i will lead to an inconsistency or possible domain annihilation, we just use forward check (V i ) to denote it.In our previous research [24], interval arithmetic is iterated in order to increase the precision of forward checking.

Conflict-directed backtracking
A backtracking search can be regarded as traversing a search tree, and we give the definition of level.
Definition 2. The level of a variable marks its order of instantiation in the search process.It is exactly the number of instantiated variables.For example, level 0 marks the root of the search tree, and no variable has been instantiated; level 1 marks x 1 which is the first variable to be instantiated, and the like.The levels nearer to the root are shallower levels, and the levels farther from the root are deeper levels.
Fig. 2 clearly shows the so-called CBJ, which is taken between two variables that are not on neighboring levels.Since the backjumping is directed by conflict, the following are some definitions related to conflict.
The closure of a variable is a data structure storing the variable and the variables that have both direct and indirect relations with it: a mapping associating a variable with the variables bounded by all the relations along the path.All the variables mentioned in this paper are symbolic variables, and a simple example as shown in Fig. 3 explains closure.If we attempt to generate test data for the path passing all the if statements, then we have Tab. 1 about the symbolic variable and closure of each variable.Based on the above analysis, it can be concluded that if any past instantiation is responsible for the current inconsistency, that instantiation must have been launched to one or more variables in the same closure as the current variable, and the following is the definition of conflicting variable.
Definition 5.If the current instantiation <x i , V i > causes an inconsistency, and there is a set of past variables in the same closure as x i and the corresponding domains are annihilated, then the deepest variable x j in this set is the conflicting variable of the current inconsistency.

Hybrid backtracking algorithm
Since we combine FC and CBJ (adopting the closures of variables) as a hybrid backtracking algorithm, it is referred to as BFS-BB-Hybrid Backtracking (BFS-BB-HB) in this paper.Some notations in BFS-BB-HB are explained by Tab. 2. The path concerned is the input of BFS-BB-HB, including the constraints to be met (R), the set of input variables (X), and the domains corresponding to the variables (D).The test data (result) is null.First, the set of related variables and closure are calculated for each variable as shown by lines 2-10.When FV is not an empty set, which means there are still variables uninstantiated, the following steps are taken.The variables in FV are sorted to determine the current variable x*, and a value V* is selected from the domain of x*(D*), which are shown by lines 12 and 13, and have been elaborated in our previous work [25].S conflict (x*) is initialized as shown by line 14.Forward checking (line 15) is conducted to judge whether V* for x* will cause any inconsistency and reduce the domains of future variables.If forward checking succeeds, then <x*,V*> is added to result, and x* is removed from FV and put into PV, as shown by lines 27-29.Then the ordering of FV begins.If forward checking fails, indicating that an inconsistency is caused, then the conflicting variable should be determined.The past variables with annihilated domains are put into the set of the conflicting variables of x*, and quicksort is used to find the deepest variable (x f ), which is the conflicting variable.In that case, CBJ will directly go to the level of x f as shown by lines 17-21.In some cases, the conflicting set of x* is empty, then BT is inevitable and the variable on the level just above x* becomes the current variable (line 22).As shown by lines 23-25, the conflicting information is adopted to select a value for the current variable, which is introduced in our previous work [25].And another forward checking begins (line 26) until FV becomes empty.Finally, result is returned as the test data that makes the path feasible as shown by line 30.
During BFS-BB-HB search, a variable may exist in any of the four sets, which are future, current, past, and conflicting, depending on the different search operations taken on it, as shown by Fig. 4. At the very beginning of the search, all the variables are in FV, whereas when the search ends, all the variables are in PV.The transformation of variables is also illustrated by Fig. 4.

Experimental analyses and empirical evaluations
To test the effectiveness of BFS-BB-HB, we carried out a large number of experiments.The paths to be covered were provided by CTS.The algorithms were implemented in Java and run on the platform of Java Runtime Environment (JRE).Since closure is an important factor in our backtracking strategy, in Section 4.1 experiments were made to evaluate the effectiveness of our method in terms of generation time and the number of backtracking for varying numbers of closures.In Section 4.2, the hybrid backtracking method FC-CBJ (adopted by BFS-BB-HB) and the chronologically backtracking method FC (adopted by BFS-BB) were compared, and the test beds were some CSP problems.BFS-BB-HB was used to test an engineering project in Section 4.3.

Testing different numbers of closures
First, we evaluated the influence of the closures of variables on the performance of test data generation.This was made by repeatedly running BFS-BB-HB on generated test programs, each of which included 10 input variables.Using statement coverage, in each test the program had 10 variables and n (n∈ [1,10]) if statements, and there was only one path to be covered, which is made up of branches totally true, i.e., all the expressions were in the same form as the corresponding predicates.Each expression contained all the 10 variables, which may be in different closures.We tried to make each closure contain roughly the same number of variables, which means when there were 1, 2, and 5 closures, each closure contained the same number of variables, namely, 10, 5, and 2, respectively; while when there were 3 closures, the numbers of variables in each closure were 3, 3, and 4, respectively, and those numbers were 2, 2, 3, 3 for 4 closures.Take 2 closures for an example, each if statement was an expression in the form of Eq. ( 3).
In Eq. ( 3), a n1 , a n2 , … ,a n10 (n=1,2,…,10) were numbers generated by random function, rel_op ∈ {>,≥,<,≤,=,≠}, and const[n] [2] was an array composed of constants generated by random function.The randomly generated a ni (i=1, 2,…, 10), const[n] [1], and const[n] [2] were selected to make the path feasible.Thus, the linear relation between the variables in the same closure was constructed in the strongest manner.The programs were each tested 100 times, and the average time required to generate the test data and the average number of backtracking for each test were recorded.The environment that the experiments were performed in was Windows 7 with 32-bits, Pentium 4 with 3.8 GHz and 4 GB memory.The result of the comparison is shown by Fig. 5 and Fig. 6.
From Fig. 5 (a), it can be found that for a fixed number of closures, average generation time increased with the number of expressions, which was more obvious when there were 6-10 expressions in the programs under test (PUTs).The reason is that the complexity of constraints increased with the number of expressions, and the search was basically backtrack free when there were not too many constraints.Fig. 5 (b) shows that for the same number of expressions, generation time decreased with the number of closures, because more closures meant less variables involved in each constraint, reducing the complexity of the search.This was also true of the relationship between average number of backtracking and the number of closures, which is shown by Fig. 6.

Comparison between BFS-BB-HB and BFS-BB
We also carried out comparison experiments between the method proposed in this paper (BFS-BB-HB) and the method proposed in [13] (BFS-BB).The environment that the experiments were performed was Windows 7 with 32bits, Pentium 4 with 3.00 GHz and 4 GB memory.Some CSP problems from http://www.csplib.org/Problems/were used as the test beds.A step backward in the search tree shows that the backtracking operation is taken once.
The first selected CSP problem was the n-queens problem.For each test bed (n=4,5,…,9), the experiments were conducted 100 times, and average number of backtracking and average time consumption were adopted for comparison.The testing result is shown in Tab. 3, which was in accordance with the distribution of the solutions to the n-queens problem when n varies from 4 to 9 as shown by Tab. 4, where we can find that the solutions to 6-queen are less than those to 5-queen.And since 1-queen is self-evident, and there are no solutions for 2-queen and 3-queen, our experiments started with 4queen.For all the tests, BFS-BB-HB performed better than BFS-BB as shown in bold, and our hybrid backtracking strategy functioned well for n-queens problem.The following are two other selected CSP problems, which are magic squares and sequences (n=4) and magic hexagon.The experiments were carried out 100 times for each test bed, with average number of backtracking, average number of constraint checks and average time consumption recorded for comparison.Tab. 5 presents the experimental results, showing that BFS-BB-HB outperformed BFS-BB for all the cases.For the number of backtracking, BFS-BB-HB took 12 % and 42 % of BFS-BB respectively, as shown in bold in column 4. For the number of constraint checks, BFS-BB-HB accounted for 24 % and 48 % of BFS-BB respectively, as shown in bold in column 7.For time consumption, BFS-BB-HB occupied 21 % and 47 % of BFS-BB respectively, as shown in bold in column 10.
In short, BFS-BB-HB performed well for the selected CSP problems, especially in the improvement of backtracking efficiency.

Testing a project in engineering
In this part, we made experiments using BFS-BB-HB to test engineering projects.The experiments adopted statement coverage, and were performed in the environment of Windows 7 with 32-bits, Intel Pentium (R) G640 with 2.80 GHz and 2 GB memory.The PUTs were from aa200c mentioned in Section 1, and the selected ones contained diversified data structures and types.The result is shown in Tab. 6. Constraint check was elaborated in Section 3, and it was used to evaluate forward checking of BFS-BB-HB.Two conclusions can be drawn from Tab. 6.First, BFS-BB-HB performed better when there were not pointers in the PUTs.We should exert more efforts to strengthen the capability of handling pointers.Second, since the number of variables in expressions will influence the search efficiency [24], the same number of constraint checks may lead to different time consumption.Generally speaking, BFS-BB-HB performed within acceptable time, but it still needs optimization.

Conclusion and future work
The need for automating the testing procedure has become increasingly urgent in recent years.And as a fundamental issue in software testing, path-wise test data generation is of particular significance.We put forward an intelligent constraint solver and elaborated the lookahead methods in our previous research.In this paper, we studied the look-back methods in detail, which function when a dead end occurs.The proposed backtracking method hybridized forward checking and conflict-directed backjumping, and the closures of variables were introduced in the process of detecting conflict.The testing of some CSP problems and some engineering benchmarks proved the effectiveness of the hybrid backtracking method and its applicability in engineering.
Our future research will continue to increase the backtracking efficiency as well as enhance the effectiveness of the test data generation method.Parallel computing among different closures will be touched upon in our future research.

Figure 1
Figure 1The conflict detecting process of interval arithmetic

Figure 2 3 . 4 .
Figure 2 An example of CBJ Definition 3. If two variables x i and x j have relation r h ∈ R (h ∈ [1, k]), then x i and x j are directly related variables to each other.The set of the directly related variables of x i is denoted as S rel (x i ).In terms of test data generation, the variables in the same predicate are regarded as directly related variables to each other.Definition 4. The set of the variables that can be derived from any relation to x i both directly and indirectly forms the closure of x i , denoted as C(x i ), which is calculated iteratively by Eq. (2).

Figure 3
Figure 3 The program test

Figure 4
Figure 4 The transformation of the states of variables in the search process

Figure 5
Figure 5 The relationship between average generation time and the number of closures

Figure 6
Figure 6 The relationship between average number of backtracking and the number of closures

Table 1
The symbolic variable and closure corresponding to each variable in program test

Table 2
Some notations used in BFS-BB-HB

Table 3
The comparison result of testing the n-queens problem using BFS-BB-HB and BFS-BB

Table 4
The number of solutions for n-queens problem when n varies from 1 to 9

Table 5
The result of testing two CSP problems

Table 6
The result of testing aa200c by BFS-BB-HB