LARGE SCALE SOFTWARE TEST DATA GENERATION BASED ON COLLECTIVE CONSTRAINT AND WEIGHTED COMBINATION METHOD

Original scientific paper Software reliability test is to test software with the purpose of verifying whether the software achieves reliability requirements and evaluating software reliability level. Statistical-based software reliability testing generally includes three parts: building usage model, test data generation and testing. The construction of software usage model should reflect user's real use as far as possible. A huge number of test cases are required to satisfy the probability distribution of the actual usage situation; otherwise, the reliability test will lose its original meaning. In this paper, we first propose a new method of structuring software usage model based on modules and constraint-based heuristic method. Then we propose a method for the testing data generation in consideration of the combination and weight of the input data, which reduces a large number of possible combinations of input variables to a few representative ones and improves the practicability of the testing method. To verify the effectiveness of the method proposed in this paper, four groups of experiments are organized. The goodness of fit index (GFI) shows that the proposed method is closer to the actual software use; we also found that the method proposed in this paper has a better coverage by using Java Pathfinder to analyse the four sets of internal code coverage.


Introduction
Reliability is a key indicator for the safe functioning of modern technological systems [1], such as air traffic control, railway transportation, and medical devices.Reliability is defined here as the probability of the failurefree operation of a software system for a specified period in a specified environment [2,3].Software reliability test (SRT) is to test software with the purpose of verifying whether the software achieves reliability requirements and evaluating software reliability level based on the operational profile which acquires failure data to estimate the reliability of a software product in quantifiable terms [4,5].A huge number of test cases with lengthy execution periods are currently required to satisfy the probability distribution of the actual usage situation.These test cases lead to a long execution cycle time of the SRT, a primary reason for the difficulties in applying SRT widely in engineering science today.
The most commonly used method of SRT is a statistical testing method based on the usage model, building software usage model and generating test cases based on the operational profile [6].Markov model is the most widely used model, and the traditional test method is to generate a series of operation sequences through the Markov model [7].The construction of software usage model should reflect user's real use as far as possible.A huge number of test cases are required to satisfy the probability distribution of the actual usage situation; otherwise, the reliability test will lose its original meaning.While using the Markov model to generate testing data, the most common method is one that generates testing data randomly [8].However, this approach does not take the interaction among different operations into account, making this method generate redundant test data; meanwhile, different data may have different priorities according to the importance and using frequencies of the data.As a result, we need to assign weights to each type of input data, and higher weight means higher priority.Because of the above problems, in this paper, we first propose a new method of structuring Software usage model based on modules and heuristic method.Then we propose a method for the testing data generation in consideration of the combination and weight of the input data, which reduces a large number of possible combinations of input variables to a few representative ones and improves the practicability of the testing method.In order to verify the efficiency of our method, we perform four experiments.We compare the goodness of fit index (GFI) of our method with other methods in experiments.We also analyse the code coverage ratio of our method by using the tool Java Pathfinder.Results of these experiments show that our method could reduce the redundancy of test data and improve the testing efficiency while guaranteeing the coverage ratio of test data.
Our work focuses on improving the practicability by reflecting user's real use as far as possible in both the stage of modeling and data generation stage.The main contributions of our work are: (1) we first propose a new method of structuring Software usage model based on modules and heuristic method; This method is more suitable for complex large-scale software systems; (2) Combination of three testing data generation technology, partitioning, combination and random.We firstly traverse all possible paths in the usage model and calculate the weight of each of them.We assume each type of input data is a discontinuous finite parameter (element) set.We assign a weight for every parameter.We propose a method for the testing data generation in consideration of the combination and weight of the input data, which reduces a large number of possible combinations of input; (3) In order to verify the effectiveness of the method proposed in this paper, four groups of experiments are organized.The GFI shows that the third method is closer to the actual software use; (4) we also found that the method proposed in this paper has better internal code coverage.

Usage models 2.1 Structuring the usage model
The Markov usage model (UM) can describe software usage scenario easily, its definition can be found in [7,8,9].Researchers have proposed many kinds of methods of structuring usage model, which can be summarized to the following methods [9,10,11]: Musa's Method, based on expert experiences, historical data and complex software model.Musa's method is only a guiding thought of structuring usage model and lacks a specific implementation.The method of [9] is quite simple and cannot fit complex software.The method of [10,11] cannot reflex calling relation and constraint relation among modules.In addition, with the growing of the quantity of modules, the complexity of this method is increasing sharply.To remedy these insufficiencies, this article proposes a method of structuring Software usage model based on modules and heuristic method.We need the following steps to structure a usage model for a complex software: (1) structuring the usage model under system (UMS); (2) structuring the usage models under module for every module (UMM); (3) finally, we get an UM through combining the UMS with the UMM via the module invoking state.In this paper, the UMS is a triple 〈, , 〉 and can be presented as a directed graph.As shown in figure 1, there is a set of points that can be expressed as a set of states,  = { 1 ,  2 , ⋯   }, ⊆ is a set of module calling state,  = { 1 ,  2 , ⋯   };  is a set of edges and can be expressed as a set of operations,  = { 1 ,  2 , ⋯   },  ⊆  × .We add a new kind of states called "module calling" states.Take  2 in figure 1 as an example; this state stands for the SUT invoking module 2 via its interface.
The Module usage model UMM is a triple 〈, , 〉 and can be presented as a directed graph.Unlike UMS, UMM can have multiple initial states which are determined by the module's interface (as shown in figure 3).As shown in Fig. 2,  13 is an initial state and indicates that the interfaces of this module  1 , and  25 and  26 are the initial states of module  2 .When UMS is under an invoking module state, it will search for the corresponding UMM by the name of the invoking module state, and then enter the UMM via the correct interface.

Transition probability
It is difficult to obtain the transition probability among states of an UM directly from historical data and experts; however, it is easy to obtain the constraint description (linear or non-linear) of each operation from an expert [11].Then we calculate the operation transition probabilities according to constraints.Experts can offer constraints as follows: Certain constraints: Such as ( 1 ) = 0.5.Linear constraints: Generally, the constraints in the linear function are the linear relation, include equality and inequality relation such as 0.3 ≤ ( 1 ) ≤ 0.8，( 1 ) = 2( 2 )，( 1 ) + ( 3 ) = 0.6，( 3 ) < 3( 1 ).
The operation probabilities that belong to the same state should satisfy: 1, 0 1.
According to the principle of max entropy, the larger the entropy of a random variable is, the more objective what it reflects will be.We can use the principle of max entropy to calculate operation probabilities of UM: Under certain constraints, when the information entropy of transfer of states of UM is maximum, the UM is the closest to the actual usage of SUT, and this moment the operation transition probabilities are the ideal value that we want to calculate.After we obtain the constraints, we can calculate operation transition probabilities according to them.We convert the problem of calculating operation transition probabilities to an optimization problem as follows: Maximum: ( ) Under Constraint: 1, 0 1.
We also use a genetic algorithm to solve this optimization problem and then we can get the synthetical operation transition probabilities as a significant property of UM.
The operation transition probabilities are calculated in this way, which is collected from a single expert.In order to make the result of calculation more objective, we need to synthesize the opinions of multiple experts.We adopt a method based on KL divergence to synthesize operation transition probabilities root in multiple experts [12].KL divergence is relative entropy, it is an unsymmetrical measure of two random variables, the smaller the KL divergence is, the smaller will the discrepancy of the two variables be.Operation transition probabilities of UM are a discrete random variable.For two discrete random variables  = { 1 ,  2 , …   } and  = { 1 ,  2 , …   } , the KL divergence from X to Y and the KL divergence between X and Y are: So, we can adopt   to measure the discrepancy of X and Y.
According to the constraints offered by experts, we calculate out operation transition probabilities in-group of  = { 1 ,  2 , …   }, then we can search for a transfer probability  whose KL divergence to  is the smallest.This kind of transfer probability not only synthesizes opinions of multiple experts but also has the least discrepancy of all opinions of experts.In this way, we convert the problem of synthesizing multiple groups of operation transition probabilities to an optimized optimization problem as follows: Let: , ( ) ( ) ( ) Under Constraint: 1, 0 1.
We also use a genetic algorithm to solve this optimization problem and then we can get the synthetical operation transition probabilities as a significant property of UM.

Types of input data and the interaction among operations
We define  = { 1 ,  2 , ⋯ ,   } as all the operation sequences (all paths in Markov model) in Markov model.If there are m operations in one operation sequence   , we can denote   = 〈 1 ,  2 , ⋯   〉 , and we can get finite set = { 1 ,  2 , ⋯   },   denotes one operation in Markov model.In a finite set , there are [] types of input data in operation f i .Then, we get finite set   = {1,2, ⋯ []}, 1 ≤  ≤  .We define n-gram testing sequence as follows: = � , � × is a  ×  matrix, of which the  ℎ column represents the operation   of operation sequences.All the elements in the  ℎ column are from the collection   .Given a positive integer, if  can guarantee that any adjacent  (assuming , ⋯  +  − 1 ) columns can satisfy the condition that N-dimensional combination of elements in   , ⋯  +−1 appears at least once, then we call  N-dimensional coverage array, and (, , ) for short.Each line of  represents the test data of the operation sequence.Apparently,means the number of test data of the operation sequence.
Adjacent Matrix: We usually get the adjacent matrix according to the Markov model, and  , represents transition probability of point  to .If there is no access from  to , then we make  , = 0.The Weight of Operation Sequence: For an operation sequence   ,  ∈ [1, ], the   is the transition probability of the  ℎ operation sequence.We define the weight of operation sequence   as ( ) .
We assume each type of input data is a discontinuous finite parameter (element) set.We assign a weight ℎ� , � ∈ [0,1] for every parameter  , (the  ℎ value of collection   ).The value of ℎ� , � equals the probability of  , (the probability of the situation that the input data of ℎ  ℎ operation belongs to ℎ  ℎ type).
The weight of the parameter combination: while we make a N-dimensional adjacent combination for N parameters, the weight of the combination is 1 ( ).
In this paper, we use the directed graph to represent a usage model.Figure 4 shows all the states that start with an initial state and end with an ending state during the running process of vehicle software in hand-held terminal software developed by java.The edges in the graph represent operations of users and each edge is marked with a transition probability, which represents the probability of users, executes the corresponding operation.In an actual operation of software, as shown in figure 5, each step of the operation may have a variety of types of input data.For example, while classified according to data types, input data can be classified as integer, character and Boolean and so on.While it is classified by the effect of data, the input data also can be classified as valid data and invalid data.Meanwhile, there are often some interactions between different operations.Considering the diversity of input data and the interaction among operations, we need to generate test data for each operation sequence.Moreover, in the practical application of software, there are often strong interactions between adjacent operations, while there are often weak or even none interactions between non-adjacent operations.As a result, this focuses on the generating testing data method counting interactions between adjacent operations [13].
3 The algorithm of N-dimensional adjacent combination

Outline of method
This method can generate testing data and calculate the weight of testing data according to the Markov usage model.First, we get the adjacent matrix  of the Markov model.Then we use algorithm 1 to traverse all paths in the Markov model and calculate the probability of every path.As a result, we get all the possible operation sequences and weight them as follows: For each   , we use algorithm 2 to make a Ndimensional adjacent combination of n operations in  .
Then we could get the  of   .We get the proportion of each test data according to the formula:

Path generation algorithm
Algorithm 1 traverses every possible path in the usage model and calculates the weight of each path.Input parameters of function ℎ()are the starting node number, the ending node number, the adjacent matrix , the current path, and the current weight.For arbitrary node  in matrix, there are three kinds of situations: (1) There is no path between starting node and node , then we ignore the node  keeping and look for next node.
(2) If node is the ending node, this means that we have a path, and we need to record the path and the weight of path.
(3) If node  is not the ending node and there is a path between node  and starting node, this means that node  is one of the nodes in the path.Then we recursively call the function ℎ() with node  being the value of parameter starting node.

N-dimension coverage algorithm
The output coverage array  of algorithm 2 is a matrix with  lines and  columns, and  is a constant which we have already known.At first, we need to calculate the scale of matrix .And we get the parameter  by Then we need to calculate the value of the first  columns by making the N-dimension combination of the first columns.Therefore, we get the value of the first N lines of matrix.Based on this we need to extend our parameter to right.For example, after the first N lines are settled, we need to calculate the value of the  ℎ column.For parameter +1 , the number of times that each  − 1dimension combination of  2 ,

Testing data evaluating
Different methods have different influence on a test, but the only evaluation criterion is whether the test data can reflect the actual operation accurately.Therefore, we import GFI to evaluate the method.We definite represents the times that test case n occurs and   represents the times that test case  is assigned by some method.We assume that the χ  2 is the maximum value of χ 2 and there is the poorest correlation with reality.χ  2 means the ideal condition (having the best correlation with reality).
According to the formula 15, it is not difficult to find that the value of χ  2 should be 0. So, we can modify the formula GFI , and then we get a modified formula It is evident that  is a constant between 0 and 1.A method is closer to reality while the of the method is closer to 1.

Experiment
In order to verify the validity of the method proposed in this paper, we make experiments from the perspective of GFI and the coverage ratio of code.

Analysis of GFI
We design four groups of experiments to verify that the methods proposed in this paper have better GFI.
Method 1: We only use algorithm 1 to traverse the Markov model to get all operation sequences.We generate testing data for each operation sequence randomly without considering the diversity of input data and the interaction among operations.
Method 2: We use algorithm 1 and algorithm 2 to generate testing data for each operation sequence.This method considers the diversity of input data and the interaction among operations, but it ignores the weight of each operation sequence.It assigns testing data for each operation sequence averagely.
Method 3: On the basis of method 2, this method takes the weight of each operation sequence into consideration, and assigns testing data for each operation sequence according to the weight.
Method 4 (Corresponding to the Actual Situation in table 1): We import the actual usage statistics of a vehicle as the control group.
We take figure 2 as an example: the one of testing sequence is {, , , , , }, and the weight is 1 × 0.75 × 0.25 × 0.5 × 1 = 0.093750.If we need to generate 100000 test data, and we know that this operation will need 9375 testing data.
We will evaluate methods in three groups by calculating GFI of three methods.For the operation in figure 2, there are 192 kinds of combination of testing data.According to our analysis, we can see that while considering the interaction among three adjacent operations, there are only 32 kinds of combination of testing data.We get the number 32 from the formula 14.This means that there are 160 invalid kinds of combination of testing data in the 192 kinds of combination of testing data.From table 1 we can see that method 1 assigns many testing data to the 160 invalid kinds of combination, which makes method 1 have the lowest efficiency.When compared to method 1, method 2 takes the interaction among three adjacent operations into consideration and assigns all testing data to the 32 valid kinds of combination.As a result, the method has a higher efficiency than method 1.However, there is also some disadvantages, the method assigns testing data averagely which makes number 9 and number 19 have the same number of testing data.However, this is quite different from the reality according to group 4.Moreover, method 3 improves this defect by considering the weight of each operation sequence.Method 3 assigns more testing data for the operation sequence with higher weight such as the number 19 and assigns less testing data for the operation sequence with lower weight such as the number 9.This makes method closer to reality.χ 1 2 ,  2 2 and χ 3 2 represent the χ 2 of method 1, method 2 and method 3 separately.We can know that the maximum value is χ 1 2 .So, we make χ  2 equal to χ 1 2 by calculating, and the  of each method is shown in table 1.  1 = 0 shows that method1 is almost independent of reality due to ignoring the interaction among operation sequences;  2 = 0.6018 shows that method 2 can partly reflect the reality;  3 = 0.965 shows that method 3 is closest to the reality.Method 3 is the optimal method and is with the highest practical value.

Analysis of code coverage
The coverage ratio is a very important index in software testing.There are many kinds of coverage ratio, such as method coverage ratio, branch coverage ratio, and condition coverage ratio.In this part, we use the tool Java Pathfinder [14] to analyse the code coverage ratio from the perspective of method coverage ratio and branch coverage ratio.
The JavaPathfinder is used to find defects in Java programs, so you also need to give it the properties to check for as input.The JavaPathfinder gets back to you with a report that says if the properties hold and/or which verification artefacts have been created by the Java Pathfinder for further analysis.So, we need to generate testing data as input data of the program under test by ourselves.We run the program on the Java Pathfinder to find defects in the program.In order to evaluate our method, we generate testing data by our method (method 3).And we get a report about the method coverage ratio and the branch coverage ratio from the Java Pathfinder.As is shown in table 2, method coverage ratio could reach 91.3 %, while the branch coverage ratio could reach 87.9 %.We guarantee both the method coverage ratio and the branch coverage ratio with less testing data by using method 3.

Relation work 5.1 The construction of software usage model
Random testing selects test data uniformly at random from the input domain of the program.When the random selection is based on some operational profile, it is sometimes called statistical or operational testing and can be used to make reliability estimates [4,15].
Statistical testing can help software testing and be used to assess software reliability.The construction of software usage model should reflect user's real use as far as possible.A huge number of test cases are required to satisfy the probability distribution of the actual usage situation; otherwise, the reliability test will lose its original meaning.Researchers have proposed many kinds of methods of structuring usage model, which can be summarized to the following methods: Musa's Method [9], based on expert experiences [11], historical data and complex software model [16].Musa's method is only a guiding thought of structuring usage model lacking a specific implementation.The method of [9] is quite simple and cannot be applied to complex software.The method of [11] cannot reflex calling relation and constraint relation among modules.In addition, with the growing of the quantity of modules, the complexity of this method is increasing sharply.
Different from the above methods, we propose a new method of structuring software usage model based on modules and heuristic method.It is difficult to obtain the transition probability among states of an UM directly from historical data and experts; however, it is easy to obtain the constraint description (linear or non-linear) of each operation from an expert.The opinions from multiple experts should be synthesized to make the result of calculation more objective.We synthesize the transition probabilities among multiple experts based on KL divergence [12].

Test case generation
Statistical testing generates test data by sampling from a probability distribution defined over the software's input domain.The distribution is chosen carefully so that it satisfies an adequacy criterion based on the testing objective, typically expressed in terms of functional or structural properties of the software.A huge number of test cases are currently required to satisfy the probability distribution of the operational profile and the actual usage situation.These test cases lead to a long execution cycle time of the SRT, the primary reason for the difficulties in applying SRT widely.How to choose a few test values from a large data space is important for SRT.
The most common method is the one that generates testing data randomly [15].However, this approach does not take the interaction among different operations into account, making this method generate redundant test data.The basic idea is to split the data space into equivalence classes and choose one representative from each equivalence class, with the hope that the elements of this class are equivalent in terms of their ability to detect failures [4].Pairwise and N-way coverage criteria [17] are popular forms of data coverage criteria.Combination strategies are test case selection methods that identify test cases by combining values of the different test object input parameters based on some combinatorial strategy [13].In combinatorial testing, the issue is to reduce a large number of possible combinations of input variables to a few representative ones.
Boundary analysis and domain analysis are widely accepted as fault detection heuristics and can be used as coverage criteria for test generation [18,19].For ordered data types, the partitioning of a range of values into equivalence classes is usually complemented by picking extra tests from the boundaries of the intervals [20,21].
Test data are generated by sampling from a probability distribution chosen so that each element of the software structure is exercised with a high probability.However, deriving a suitable distribution is difficult for all but the simplest of programs.The [22] demonstrates that automated search is a practical method of finding near-optimal probability distributions for real-world programs, and that test sets generated from these distributions continue to show superior efficiency in detecting faults in the software.
Considering the above issues, our work focuses on improving the method by considering the interaction among different operations and the weight of operation data.We propose a method for the testing data generation in consideration of partitioning (weighted parameters), combination and random.We first traverse all possible paths in the usage model and calculate the weight of each of them.We assume each type of input data can be denoted by a discontinuous finite parameter set.We assign a weight for every parameter.Our method reduces a large number of possible combinations of input.We also found that the method proposed in this paper has a better coverage by using the Java Pathfinder to analyse the four sets of internal code coverage.

Model-based testing
At present, there are many kinds of model-based testing tools for reliability estimate.The JUMBL [23] is an academic model-based statistical testing tool.Test inputs are generated by traversing the usage model while respecting transition probabilities in JUMBL: the test cases with the greatest probability are generated first.While using the Markov model to generate testing data, the most common method is one that generates testing data randomly [24][25][26].However, those approaches do not consider interaction among different operations and the weight of testing data.
The shortest searching paths are found and the redundant test sequences are reduced based on the natural law of ants foraging in W. Zheng's works [27].Different from it, we consider not only the test sequence problem but also the test data problem.At the same time, we give the modelling method for large-scale software.As described in this paper, the Algorithm 1 traverses every possible path in the usage model and calculates the weight of each path.
In L. Fernandez-sanz's works [28], the specifications considered the logical starting point to generate a set of test cases which covers most of the functional testing needed to validate a software product.The research [28] is a typical method for automatically generating a complete set of functional test cases from UML activity diagrams.At the same time, there is also a prioritization according to software risk information in L. Fernandezsanz's works.Different from Fernandez-sanz's works, our research focuses on software reliability testing and verifying whether the software achieves reliability requirements and evaluates software reliability level.We try to generate test cases to satisfy the probability distribution of the actual usage situation.

Conclusion and future work
In this paper, from the perspective of engineering application, we focus on improving the practicability by reflecting user's real use as far as possible both in the stage of modelling and data generation stage.We first propose a new method of structuring Software UM and calculate its transition probability based on modules and heuristic method.In the stage of data generation, we propose the method that takes the interaction among operations and the weight of operation sequence into account.However, there are also some disadvantages in our method.Our method performs well when it is applied to the software that interaction only exists in adjacent operations.But, it will lead to degradation in performance when our method is applied to the software where interaction does not only exist in adjacent operations.This limits the applicability of our method, and we will improve it in future.

Figure 1 A
Figure 1 A usage model under system

Figure 2 Figure 3
Figure 2 The usage model of model m1

Figure 4 Figure 5
Figure 4 Markov usage model of vehicle software

Algorithm 1
Path Generation Input: the serial number of initial state: start; the serial number of final state: end; adjacent matrix G = {the adjacent matrix of Markov model}; initial path; initial weight (default 0).Output: K = {all paths from initial state to final state}; Weight = {weight of all operation sequences} getPaths(start,end,G,path,weight ) hasFlag[start]=true; //set the 3 , ⋯   occurs should be  +1 .But the actual number of times that each  − 1dimension combination of  2 ,  3 , ⋯   is  1 .If  1 is less 3, ⋯   to [+ 1].Then we traverse each  − 1dimension combination of  2 ,  3 , ⋯   to get the value of the ( + 1) ℎ column.Similarly, we can get the value of all columns.It is obvious that each line in matrix  represents one of the testing data.At last, we calculate the weight and the proportion of each line.

Table 1
The GFI of three methods

Table 2
The coverage ratio of code