Estimation of Random Accuracy and its Use in Validation of Predictive Quality of Classification Models within Predictive Challenges

: Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. T he second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the m easured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of a greement of predicted values with the corresponding experimental values and can lead to a hig hly optimistic conclusion about the predictive ability of the model. Due to these shortcomings of the correlation coefficient, the use of standard error (root -mean-square-error) of prediction is suggested as a better quality measure of predictive capabilit ies of a model. In the case of classification models, the use of the difference between the real accuracy and the most probable random accuracy of the model shows very good characteristics in ranking different models according to predictive quality, having at the same time an obvious interpretation.


INTRODUCTION
OMPUTER modelling has intensively been used in many areas, including chemistry and life science.As a rule, the main goal of the modelling is to extract useful information or regularities in a form of functional relationship(s) between subsets of data, between one variable and a subset of variables, or just between pairs of variables.In chemistry and life science, the modelling of relationships between a property i y , (i = 1, …, N) of N compounds represented by a set of M structure-based descriptors , i j x , is very often used in the analysis of different problems.That kind of models is known under the acronym QSPR/QSAR (quantitative structure-property/activity relationship). [1]or validation of model quality, from the beginnings of modern QSPR models correlation coefficient R and root-mean-square error S have been usually calculated from experimental values i y and values obtained in fitting ˆi y . [1]Also, the quality of correlation between topological (graph-theoretical) descriptors and the most important physico-chemical properties of a molecule [2] was analysed in this journal, and the ranges of correlation coefficients of QSPR models on a set of octane isomers were defined. [3]he quality of the model is first validated by internal validation procedures on the training set in the fit, or in the Leave-One-Out (LOO) or Leave-k-out (LkO) cross-validation (CV) procedures. [4]Within CV procedure, k (LkO) compounds are omitted in each step from the total number of M compounds in the training set, and properties of omitted compounds are estimated by the developed model.Such a procedure is repeated as long as the property value of each compound from the training set is estimated by the LOO procedure just once.Later, it has been recommended to define the eligibility of QSPR models.The main recommendation was based on the square of correlation coefficient achieved in the LOO CV process, which must be higher than 0.5 for the high-quality (predictive) model. [5]lthough it was introduced in Ref. [5] after the analysis of only one type of model obtained by the k-nearest neighbour algorithm, and although that recommendation was not derived from the strict mathematical analysis but rather from simulations on selected data sets, it is often used and quoted in the scientific literature.However, the importance of testing model's quality on external data set has been pointed out by Tropsha et al. [6] Probably, the main reason why such a practice has not been introduced earlier was the lack of large enough sets of data at that time.Even earlier, other authors have been noticed that CV is not always reliable in estimating the model's quality in predicting properties for new chemical compounds (i.e. for new, never seen examples -an external data set) that were not used in model training and optimisation (e.g.Refs.[7-10]).Thus, validations on external sets have been used in a comparative study between multivariate and neural network structure-property models, [7] in development of models for modelling viscosities of 361 organic compounds (240 in the training and 121 in the test set), [8] and in modelling secondary structure contents (alpha, beta and irregular) in a set of 475 soluble proteins (317 in the training and 158 in the test set). [9]o estimate or measure the quality of a model, in addition to the statistical parameters calculated in the fit and CV procedure, i.e. the correlation coefficient (R and Rcv) or the standard error or estimate (S and Scv) calculated between the estimated and experimental property values, corresponding parameters are introduced and calculated between experimental and predicted values on the test set (Rpred and Spred).Also, many other parameters have been calculated and used for model validation in the field of QSAR/QSPR modelling.These parameters are just the basic set calculated and used for estimation of the quality of (almost) all QSPR models.
The afore mentioned QSPR / QSAR methodology and model validation parameters and procedures were accepted for the regulatory purposes for the evaluation and prediction of molecular properties by OECD countries. [10]Later, these recommendations were included in REACH, the EU document regulating the Registration, Evaluation, Authorisation and Restriction of chemical compounds in the European Union. [11,12]oday, the field of modelling in chemistry and biology is evolving due to the increasing amount of data for many compounds/structures and their measured properties, activities and interactions with other small molecules or macromolecules.These studies are very interesting and important to wider community because they are linked to drug design and environmental research -two issues of global importance.Due to the propulsion of the area and the availability of large databases, researchers from different areas such as scientific computing, machine learning or deep learning, etc. are also intensively involved in modelling.In order to accelerate the exchange of ideas and to estimate/summarise the predictive potential of available (constantly evolving) computer algorithms and procedures in modelling of different chemical and biological problems, predictive modelling challenges have been organised in predicting molecular properties and activities for a new (external) set of compounds (cases, instances).One such consortium (DREAM [13] ) has a long experience in organization of different challenges since 2006 and is an interdisciplinary team composed of researchers from biotechnological, pharmaceutical and technological companies. [13,14]During challenges related to chemistry or biology (drug design), data sets containing both structural descriptors (attributes) and experimental activities for training and model development are first given to all participant groups.Also, an additional test set is given to competitors without experimental activities, which should be predicted by the developed model.Evaluation of model quality is estimated by an independent team of scientists, according to pre-defined statistical parameters.[17] In a case of classification problems with two classes A and B, F1score was used as the main statistical parameter for ranking the quality of models.
Also, we will analyse the suitability of the use of correlation coefficient in estimations of quality of models on an external (test) set by analysis of predictive potential of structure-solubility models on the test set containing 258 organic compounds.All these models are developed on the training data set having 1039 organic compounds. [18]urthermore, the non-sensitivity of the correlation coefficient to the constant shift of predictions will be analysed on the training set in cross-validation and on the test set in prediction.Such a characteristic disqualifies the correlation coefficient for its use in ranking modelling methods according to their predictive accuracy on an external test set.In prediction of classification properties (e.g.[22] It comes out that the parameter ΔQ2 has very good properties in estimating the quality of predictions done by different models.Namely, ΔQ2 ranks as better models those having higher value of correct predictions of both classes (designated as 1 and 0) and, at the same time, more balanced total values of errors (under-and over-prediction, which are two types of errors defined in analyses of accuracy of classification models meaning the total number of cases when class 1 is predicted as 0, and the total number of cases when class 0 is predicted as class 1, respectively [19] ).

METHODS
Mathematical equations for calculation of statistical parameters, whose characteristics and suitability for estimating predictive accuracy of models on external (test) are analysed in this study, are given here.Additionally, data sets used in simulations and in the comparative analysis are also described.

Estimating the Accuracy of Prediction of Continuous Properties
For estimating the quality of models for prediction of continuous properties in predictive challenges, [15][16][17] a wellknown Pearson's correlation coefficient  was used as the main parameter.It is calculated by Eq. ( 1): Here, i y and ˆi y are experimental property values and values predicted by the model, respectively, y is the mean of experimental property values and, finally, ŷ is the mean of property values predicted by the model in prediction on an external test set.
Another parameter that has been regularly used for estimating the predictive accuracy of models is the standard error (root-mean-square-error) of prediction, calculated by Eq. ( 2): where i y and ˆi y are described below Eq. ( 1), and N is the total number of molecules (cases) in the test set.These parameters can be also calculated between experimental values and those estimated by the model on the training set in fitting and CV procedure.

Estimation of Accuracy of a two-state Classification Models
The F1score (Eq.( 3)) has been used in predictive challenges [20−22] for estimation of predictive model accuracy and for ranking classification models according to its values.A higher value of F1score means that the model is more accurate in prediction.
( ) This parameter is defined as a harmonic mean of precision (Eq.( 4)) and recall (Eq.( 5)), and is primarily used for estimating the quality of models developed on (highly) disimbalanced data sets.In Eq. ( 3), precision is defined by Eq. ( 4) and recall by Eq. ( 5): ( ) ( ) where p = TP (true positive) is the total number of positive correct predictions of class A (observed class A is correctly predicted by the model to be class A), u = FN (false negative) is under-prediction of class A (experimental class A predicted to be class B) and o = FP (false positive) is overprediction (class A predicted to be class B).By putting Eq. ( 4) and ( 5) into Eq.( 3) and after some simplifications, F1score can be simply expressed by (Eq.( 6)) using only p, u and o: It is useful to compare this equation for estimation of the F1score with a well-known and often used parameter named accuracy, classification accuracy, [19] or the percentage of all correct prediction [23] (Eq.( 7)): where n = TN (true negative) is the total number of negative class correctly predicted as negative, and p, u and o have the same meaning as in Eqs.(4-6).These numbers (p, n, u, o) are elements of the contingency table [19] (called also confusion matrix) that is defined for each classification problem.
Parameter Q2 has been used for balanced classification problems, i.e. those having similar number of positive and negative cases.17] It is interesting to note that one can obtain Eq. ( 6) from Eq. ( 7) just by putting n = p in Eq. ( 7).However, because F1score is primarily used on imbalanced data sets (where p << n, but not p = n) such an analogy made by this 'arbitrary' substitution indicates that the comparison of these two parameters is very complicated (or even impossible).Though the interpretation of Q2 is very simple, the interpretation of F1score (Eq.( 6)) is neither straightforward nor comparable with the interpretation of Q2.According to Eq. ( 6), it seems like the F1score is not dependent on n.For the fixed set of p, u, and o values, F1score will have the same value for any n.This is a weak characteristic of the F1score indicating its insensitivity to the size of data set.If we re-write the denominator in Eq. ( 6) taking into account that N = p + n + o + u, and p + o + u = Nn (N is the total number of cases), then we get the following: Thus, Eqs. ( 7) and (8) show that both Q2 and F1score can be calculated from the same (three) numbers: p, n, and N.
In case of binary (two-class) classification models, the correlation coefficient between observed predicted variable (Eq.( 1)) can be expressed by the elements of the contingency table as: This well-known form of a correlation coefficient for estimating the correlation between the observed and the estimated (predicted) two-class target variable is named Matthew's correlation coefficient (Mcc). [24]Although it has also some limitations (like that it is not possible to calculate its values if only one class is predicted or estimated), [25] Mcc is very often used for estimating the accuracy of the model, and as the quality parameter for ranking different models developed on the same data set.Equation (Eq.( 2)) for calculation of standard error of estimation/prediction of a binary classification model can also be expressed by the values of under-and over-estimation/prediction o and u (Eq.( 10)):

Estimating the Random Correlation of a two-state Classification Model
This value of Q2,rnd is always between values corresponding to minimal accuracy, i.e. maximal disagreement, and maximal accuracy, i.e. maximal agreement.The maximal range of Q2,rnd values is between 0 and 1.Both Q2 and Q2,rnd can be expressed in percentages.Additionally, the difference (in %) between the real model accuracy Q2 obtained by a model and the corresponding most probable random accuracy Q2,rnd (Eq.( 11)) can be simply calculated (Eq.( 12)): This value can have a maximum of (ΔQ2)max = 50 % in these two cases: (1) totally equal numbers of elements in both classes (50 : 50 %) in data set, and (2) perfect model estimation or prediction (u = o = 0).Thus, ΔQ2 can be considered as a measure of the contribution of a model to real accuracy of estimation or prediction over the most probable random accuracy level.In analysis of mutual quality of different classification models, ΔQ2 parameter can serve for models' ranking.The DOI: 10.5562/cca3551 Croat.Chem.Acta 2019, 92 (3)   higher value of parameter ΔQ2 means that the model contributes a larger amount of useful information over the maximal level of random accuracy, which is a clear interpretation.
Normally, an appropriately optimised model (named balanced model in Ref. [19]) estimates (or predicts) the same numbers of states/classes as in the experimental structure (i.e.p + u = p + o and n + o = n + u), then Q2,rnd from Eq. ( 11) becomes: Equation ( 13) enables the estimation of the most probable random accuracy for balanced model and, in that case, Q2,rnd can be calculated only using the experimental number of cases in the first class (p + u), because, for the second class, we have n Whenever the value of Q2,rnd calculated by Eq. ( 11) is different (i.e.higher) from the one calculated by Eq. ( 13), it is an indication of the lack of model training process.

Data Sets
Analysis of modelling and prediction of properties having continuous values was performed on aqueous solubility data of 1297 organic compounds. [26]The solubility data set is composed of two aqueous solubility databases AQUASOL [27] and PHYSPROP, [28] and it was partitioned as it was done by Liu and So. [18]Namely, the training set contains 1039 compounds, and the test set used for estimation of external prediction has (remaining) 258 compounds (Table S1).The set of 123 descriptors used in the last stage for selection of the best models are also given in Table S1, and the analysis of statistical parameters in Table 1.More details on the developed models are given in Tables S2 and  S3.The weakness of correlation coefficient connected to the distribution of data has been illustrated on simulated data set having three pairs of variables among which (in each pair) the first represents experimental and the second one predicted variable (Table 2).For analysis related to modelling of classification variables, we used three data sets dealing with a two-class problem.The first one is the data set used in the final phase of the Tumour prediction challenge [22] organised to develop algorithms and models for detection of somatic mutations from cancer genome sequences in order to understand the genetic basis of disease progression.This data set is taken from Cooper et al. (Additional file 9, Table S8 in Ref. [20]).It contains prediction results for 70 models in the final phase of predictive challenge (IS3).Among them, prediction performances of 15 top scoring models is given in Table 3, and details of the remaining 55 models are included in Table S4.The data set contains 24687 prediction cases among which 7903 (32 %) is of positive, and the remaining 16784 (68 %) of negative class.Because this data set is imbalanced, F1score was used as the main scoring (methods' ranking) criteria.
The second classification data set contains six special cases of contingency tables of extremely imbalanced class distribution (5 : 9995) from the critical overview of evaluation metrics applicable for analysis of classification problems. [29]dditional data sets are composed of: (1)  analogy with examples from literature [29][30][31] mentioned above in ( 1) and (2).To calculate and compare different validation parameters one has to have only 2 x 2 contingency tables for each model estimate/prediction containing p, n, o and u values.Some of the data sets are artificial ones containing p, n, u and o values selected in a specific way in order to check the values of the corresponding validation parameter in that special cases.The largest set of contingency tables is from modelling done within final phase of the Tumour prediction challenge, [20,22] where many groups developed different classification models; however, descriptions of the models are not given in sufficient details.Comparative analysis of the usefulness and informativeness of the validation parameters presented in this paper on classification problems is completely independent on the algorithms/methods used for model development.For more details on computational methods used in modelling classification problems, interested readers can consult recent literature. [32]

RESULTS AND DISCUSSION
By using four data sets we will illustrate some important problems which can arise from the application of commonly used statistical parameters like Pearson's correlation coefficient (R) in application to validation of prediction accuracy of continuous and two-class problems.Moreover, the limitation of two-class accuracy measures like Q2, Mcc or F1score (given by Eqs.(7-9), respectively), and the advance of the use of novel parameter ΔQ2 given by Eq. ( 12) in ranking models will be analysed.
We want to point out here the distinction between internal (fit and LOO CV) and external validation procedures.Consequently, we want to delineate these two procedures by using different more precise terms, in order to avoid possible confusion in tracking results, as well as to help in understanding the main matter and points of the presented research.Thus, the term 'estimate' in this study corresponds to internal validation procedure, i.e. to the calculation of property/activity values by a model on the training data set, which is used for model development and optimisation.However, the term 'prediction' is used in this study in case of pure prediction, i.e. when the model is used for calculation of property/activity values on an external data set, which is not used for model development and optimisation.

Problems in the Evaluation of Prediction Accuracy of Continuous Properties Using the Correlation Coefficient
Using the algorithm for selection of the best subset of descriptors into the Multivariate Linear Regression (MLR) models, [33] we selected the best QSPR for modelling water solubility of organic compounds.Developed models given in Table 1 are based molecular descriptors selected from 123 descriptors (which are pre-selected from the initial pool containing more than one thousand descriptors) calculated by the Dragon 5.4 program. [34]For a selected set of descriptors, model parameters are optimised using the MLR methodology by the least square fitting procedure, which ensures that developed model will have the lowest standard error of estimate in fitting among all other possible linear models (which could be obtained by the application of other fitting procedures).
To save space, we give in the supplementary Table S1 of the manuscript details on the models from Table 1, like the model equation or details on molecular descriptors involved, because it is not the main subject of this study.The main statistical parameters (correlation coefficients and standard errors of estimate) for the best selected QSPR models are given in Table 1.These models are developed in fitting and internally validated by LOO CV procedure on 1039 compounds from the training, and they are also externally validated on 258 compounds from the test set.
Correlation coefficient given by Eq. ( 1) can be considered as a measure of linear agreement between two sets of paired data (two variables), and is not sensitive to the constant shift of values of variables considered.In modelling, values of experimental variable y i are fixed, and only estimated/predicted values ˆi y can have a constant shift.Regularly, experimental and the corresponding values (a) N is the number of compounds in data set; I is the number of descriptors in the model; S' is the standard error of estimate calculated (for each model) as the mean deviation of each error from the mean error value (Eq.( 14)); S is the (normal) standard error of estimate calculated by Eq. ( 2); R is the correlation coefficient (Eq.( 1)); c is the constant shift defined by Eq. (15).To be able to notice the variation of corresponding statistical parameters, their values are given (as a rule) to the last two digits that differ.

DOI: 10.5562/cca3551
Croat.Chem.Acta 2019, 92 (3)   estimated by the fit and CV procedure on the training set can be in a stronger linear relationship because model parameters were optimised on these experimental data.However, experimental data whose numerical values are unknown to modeller(s) and the corresponding values from the test set predicted by the model do not have to be in a significant linear relationship.If so, Pearson's correlation coefficient given by Eq. ( 1) is not a good or acceptable measure of agreement between experimental and predicted values.
The standard error of estimate or prediction in Eq. ( 2) is calculated using the difference (error, deviation) of each estimated/predicted value ˆi y from the corresponding experimental value yi.However, if predicted values have some constant shift (c) in one or another direction, we can calculate a modified version of the standard error estimate (S') by Eq. ( 14): The constant shift c for each data set (in fitting and LOO CV on the training set, and in prediction on the test set) is simply obtained as the mean of all differences (errors) by Eq. ( 15): ( ) Parameter S' (≤ S) is calculated to see what be the value the standard error in case of an ideally optimised model, which introduce in prediction neither constant over-estimation (c > 0), nor under-estimation (c < 0).Moreover, by considering the differences S -S' for fit, and LOO CV one can see that they are for a factor of 10 3 larger going from the fit to LOO CV, and for a factor 10 4 larger going from LOO CV to prediction (Table 1).Linear models having one to five descriptors from Table 1 were developed on a large training set of 1039 compounds and have (only) two to six optimised parameters, i.e., 1037 to 1033 degrees of freedom.Thus, these models are far from the over-fitting regime.And even with such models, we notice an increase in constant shifts (though they are very small) going from the fit to LOO CV on the training set, and to prediction on the test set of 258 molecules.By comparing the constant shift values of each model with the same number of descriptors starting from the fit procedure to LOO CV the increase is for a factor of 10, and the increase from LOO CV to prediction is for a factor of 100.Since this can happen with such simple models with a very small number of optimised parameters, we can assume that a constant shift will be also present (being much larger) in the case of more complex and nonlinear models.Evidently, we cannot eliminate it, and hence the largest constant shift is in prediction, being more than 1000 times larger than in the fitting estimate.It is worth mentioning here that the LOO CV procedure done for the models from Table 1 is just the stability test performed to see the difference between the fit statistical parameters and the corresponding ones obtained by the LOO CV procedure.Namely, the model parameters are not selected either optimised by the use of LOO CV.This is not the case with robust methods based on machine learning, which are prevalent methods used in prediction challenges, and which are comprehensively optimised in several cycles of LOO or LkO CV procedures.
Taking this into account, the validation of models in prediction on external data set should be primarily evaluated and ranked by the parameter which is not sensitive to constant shift such as the standard error of prediction, or maybe by another variant of correlation coefficient named concordance correlation coefficient. [35]esides Pearson's correlation coefficient is not sensitive to the constant shift of predicted towards experimental values, it is also highly sensitive to the distribution of data in the test set.The test set can be small, and its distribution can be (generally) skewed.In such a case, good prediction of only one or two cases located at the far edge of the distribution can cause a relatively large increase of correlation coefficient.This is illustrated on data given in Table 2 with three sets, among which the second and the third set are larger for just one case.Correlation coefficients between experimental and predicted values from these three sets are -0.09,0.56 and 0.77 for Prediction 1, 2 and 3, respectively.Correlation coefficients between experimental and predicted values for Prediction 2 and 3 are high because correlation coefficient given by Eq. ( 1) is highly sensitive to the distribution of data values, and insensitive to the constant shift.Obviously, the change of S values between these three predictions is much smaller than the change of R, indicating a greater stability of standard error when applied to the calculation of prediction accuracy on an external set.
It is known that the application of the least square optimisation is not optimal in such a case, and does not give optimal result. [36]Namely, another method, based on the minimisation of absolute deviation (L1-norm) introduced in 1757 by R. Bošković, [37] seems to be a more convenient solution for data sets with outliers, what can appear quite often in prediction on new external data sets. [38]here are many problems in chemistry or life sciences having skewed distribution, i.e. in which there are many compounds in the data set which are inactive or only weakly active, and only few of them with high or very high activity.One example of such a distribution is related to activity of 100 polyphenols measured by two assays. [39,40]ll antioxidant activity values of polyphenolic compounds determined by the first assay are in the range 0 -11.6.Among them 51 least active compounds have activity values < 1.0, and 30 values are < 0.1.The reason for this is in the fact that only polyphenols having one or more catecholic OH groups can have high antioxidant activity.
In the fitting procedure, it is possible to optimise the final model to be ( ) .
However, it will not be the case in the cross-validation procedure or in prediction on an external set of data having an arbitrary distribution.This problem will be much larger in the case of nonlinear models, which are much more dependent on the distribution of the training data set due to the stronger minimisation of the total model error by the introduction of additional nonlinear terms and, consequently, additional optimised coefficients in the model.In the CV procedure, only a slight perturbation of model is randomly introduced in each step by omitting some small portion of data, or only one case (compound) in each step in LOO CV.The rest of the data samples are then used for calculation of model coefficients which differ from the coefficients of model obtained on the complete training data set in fitting procedure.

Prediction on Data Sets Containing twoclass Classification Properties
In the analysis of predictive quality of two-class problems we used the data from the paper describing results of the Tumour prediction challenge. [22]Originally, all submitted models in the stage 3 were ranked according to higher value of F1score.The values of F1score, Q2, Mcc and ΔQ2 are calculated from the contingency table values of 70 submitted models.The top 15 models are given in Table 3, and details of the remaining 55 models are included in Table S4.The code-names of models as indicated in the Tumour prediction challenge [22] are given in the second column of Table 3.
The first three parameters in Table 3 are just the values of pure accuracy parameters, and they can be related just to their maximal (or minimal) value.However, the first three parameters do not consider the accuracy that can be obtained just by the random guessing.Random (guessing) accuracy is higher if data set is more monotonous, i.e. more imbalanced.However, parameter ΔQ2 takes into account the level of random accuracy, and estimate the real model contribution to accuracy above that level.In the last four columns the ranks of models according to each of four parameters are given.These results in Table 3 are sorted by the values of ΔQ2 parameter.
Ten best-ranked models according to each of four parameters are within 15 top-ranked models according to ΔQ2.Additionally, 20 best-ranked models according to each of four parameters is within 21 top-ranked models.The mean absolute differences of ranks according to F1score, Q2, Mcc with the ΔQ2 rank on the top 15 models are (respectively) 4.3, 5.9 and 6.3, and for the complete list of 70 models they are 2.9, 4.1 and 4.0 (Table S4).
a) No.
(b) Rank 4 appears three times because models between ranks 3-5 have the identical value of Q2, and the identical rank.Analogously, rank 6.5 appears twice, because models 6 and 7 have the identical rank.
a) No. (a) See footnote of Table 3 for details and explanations.b) The ratio of true negative and true positive (n/p) and false positive and false negative (o/u) values from the left part of this table calculated for each model.
The top three models according to ΔQ2 (X2463247, X2478107, and X2453885) with ranks 1, 2 and 3 have, respectively, ranks 5, 10 and 6 by F1score, 10, 5 and 6 by Q2, and 10, 5 and 15 by the Mcc values.To obtain a deeper evidence in the differences between models ranked as the best ones according to these four statistical parameters, we give in Table 4 the values of elements of contingency matrices for the best 3-5 models.Also, the ratio of negative and positive correct predictions (n/p) and the ratio of the over-prediction and under-prediction (o/u) for each model are given in the last two columns of Table 4.
One can see that the ratios of (n/p) are similar for all top models, but the lowest values or ratios (o/u) for the models ranked as the best ones according to ΔQ2, being in the range 2.0 -3.Six examples of extremely imbalanced data sets in Table 5 are constructed and suggested for the testing the suitability of statistical parameters in estimation of quality of classification models. [29]The authors suggested that the correct order (ranking) of quality of models should be a6 → a5 → a4 → a3 → a2 → a1.Three parameters from Table 5 give such an order, and Q2 is not sensitive to variation of values of elements of contingency tables (p, n, u, and o) corresponding to these models, giving the accuracy of 99.95% for all models.
However, the difference in p, n, u, and o between neighbour models is just 1 or -1, and going from the first to the sixth model p, u, and o values are gradually changed in the same direction (from 0 to 5, or vice versa).Accordingly, we can say that two neighbouring models in the sequence given in Table 5 are the closest neighbours according to p, n, u, and o values.If so, a better quality parameter should have equidistant values (although this is not an imperative property) for models in the sequence a6 → a5 → a4 → a3 → a2 → a1.Only the ΔQ2 parameter gave such values, the difference being 0.02 between each two neighbouring models.
Both F1score, suggested as the most convenient for estimating the quality of models on imbalanced sets in Refs.[20-22], and Mcc, suggested as the better one in Refs.[30,31], do not give equidistant values for neighbouring models in Table 5.Moreover, Mcc is not defined for model a1, what could be an important failure of a parameter.
Hereafter, in Table 6 more examples of values of contingency table corresponding to imbalanced models are given.We will analyse these examples to test the adequacy of these parameters in estimating the model quality, as well as in ranking the models based on the values of these quality parameters.It is evident that Q2 shows a larger redundancy, because it counts only the sum of correct (positive and negative) predictions giving the same values if p + n is constant.Again, F1score and Mcc are not defined in some specific cases of contingency table values, but Q2 and ΔQ2 are defined in all analysed cases of models in Table 6.The values of F1score for models C1 -C6 show relatively large deviations of this parameter, because it is largely sensitive to relatively small changes of p (i.e. of the class having a smaller number of elements, minority class).
Models C7 and C8 illustrate that F1score is not an adequate measure of more populated class (majority class), and two separate variants of this parameter should be calculated for two classes.The comparison of models C11 and C12 show drastic differences of F1score and Mcc just because of small differences of p (from 1 to 0) and o (from 0 to 1) values, and similar conclusions can be drawn from the analysis of models C13 and C14.These results indicate a relatively large sensitivity of F1score and Mcc on the distribution of data.All examples of models given in Table 5  We showed here several examples in which the use of the correlation coefficient is not justified.However, R can be a good and reliable measure in analyses of yvariables (i.e.properties or activities) which have symmetric distribution of data values in the case of classification variables, approximately equal number of elements of both classes.Correlation coefficient is a standard and useful validation parameter in analysis of accuracy of models in fitting and in cross-validation.Also, R is a useful accuracy measure in cases in which there is no constant shift between values of experimental and estimated/predicted values of y-variables, and when one wants just to predict by a model the correct order of values of y-variable, e.g. to rank correctly a set of molecules according to predicted properties or activities.
The values of parameter ΔQ 2 very small in the case of models from Tables 5 and 6 developed on the imbalanced data sets.Knowing that ΔQ2 is the difference between the real classification accuracy Q2 and the corresponding random accuracy Q2,rnd given by Eq. ( 12), it is normal to obtain even very high value of Q2,rnd for highly imbalanced data sets.Because the maximal value of Q2 is 100% and if for the largely imbalanced set the Q2,rnd value is higher than 95 or 99 %, then ΔQ2 will be very low.According to presented results we strongly suggest the use of parameter ΔQ2 in analysis of quality of models, together with other useful and appropriate statistical parameters.
All results related to classification models presented here correspond to external validation, i.e. they are related to predictions done on external test sets.External validation can be problematic and can give over-optimistic or under-optimistic results and performance parameters, especially if the test set is small. [41,42]In such a case, the model evaluation parameters can be highly sensitive to the partition of data into the training and test set, and to the distribution of values of variables or descriptors.However, results of analysis related to the usefulness of different validation parameters presented in this study are based only on the elements of contingency tables and, consequently, are not dependent on the size of data sets and the significance of calculated parameters.The analysis of significance of parameters that are compared and analysed in this study is not a primary issue of this paper, although it is known according to the basics of statistics that each statistical parameter will be more significant if it is calculated on/from a larger set of cases.Anyway, we used here several artificial data sets for which elements of contingency tables (p, n, u and o) were either defined by us or taken from literature (Tables 5 and 6), and their size is selected arbitrary.Moreover, the continuous real data set (Tables S1) is large enough having 1039 and 258 cases (compounds) in the training and test set, respectively.Additionally, the real classification data set is very large having 24687 cases in the test set (Tables 3 and S4), and the corresponding models submitted within the challenge were developed on the training set containing an even larger number of cases. [13,20,22]

CONCLUSION
The validation of models submitted within predictive challenges is not an easy and straightforward task.In several challenges with subjects connected to the prediction of biological properties of small molecules and macromolecules, evaluation criteria (in cases when continuous predictions are expected) is based on correlation coefficient R applied on the test set.However, although the correlation coefficient is very useful in comparative analysis of quality of models, in prediction of activity/property values on the test set (because of the possible constant shift to which R is not sensitive) its use for ranking models in prediction could be misleading.Moreover, the test set can have an arbitrary and even much skewed distribution of experimental activity/property data that are intended to be predicted.Consequently, prediction of such data by the models could also have largely skewed distribution.We shown that the correlation coefficient is very dependent on the distribution of data and it can have a very high value just because of the presence of only one activity/property value which is far from the mean of the rest of data values.In cases of binary classification problems, [20][21][22] the Matthew's correlation coefficient Mcc and F1score were used for ranking models.In our study, a novel accuracy parameter ΔQ2, estimating the real contribution of a model over the most probable random accuracy, was tested in the evaluation of model predictive quality.We presented here results of the analysis and ranking of classification models/methods based on four most often used accuracy parameters.These results (obtained/predicted ranks) are compared with the ones obtained by the parameter ΔQ2.Also, the corresponding values of the contingency table parameters were analysed.It turns out that the best models ranked according to parameters ΔQ2 have more true positives (p = TP) of minor class, and more balanced numbers of false positives (o = FP) and false negatives (u = FN) than the corresponding top models, ranked according to values of Q2, Mcc or F1score.Thus, ΔQ2 favours models having more balanced (symmetric) prediction errors over the imbalanced classification models.The symmetry of false positive and negative prediction errors (o and u) is also suggested as a good characteristic of validation parameter in the literature (e.g. by Baldi et al. in Ref. [43]).Additionally, ΔQ2 is defined for any set of values of the contingency table, and it has (linearly) proportional values with respect to the changes of values of the contingency table.Additionally, ΔQ2 is defined for each set of elements of contingency tables.
Presented analyses support the use of the standard error of prediction (or the mean absolute error of prediction) for ranking models developed on continuous data and for evaluation of their quality.Also, presented results related to estimation of quality of classification models strongly support the involvement of parameter ΔQ2 in the standard set of validation parameters for ranking models within predictive challenges.Namely, the parameter ΔQ2 estimates just what should be the main aim of improvement of modelsi.e. the increase of model accuracy (as much as possible) over the most probable random accuracy.

2 .
The corresponding range for the top three models ranked by F1score is 4.4 -9.8 (Table 4, the last column of part B), and in the similar range are the values of o/u for the top models ranked by Q2 (part C) and Mcc (part D).The analysis of the ratios of n/p and o/u reveals that ranking by ΔQ2 favour the models with closer values of o and u, having at the same time the ratio o/u closer to the ratio of n/p.It seems reasonable to proclaim as the better one the model having closer values of ratios of n/p and o/u.
Four examples of contingency table values from Tables 4 and 5 (special cases) in Ref. [29] having a highly imbalanced class distribution (5 : 95); (2) Two imbalanced examples (95 : 5 and 94 : 6) of contingency table values given and analysed in Ref. [30] and also in Ref. [31] in order to illustrate the drawbacks of F1score comparing to Mcc; and (3) Eight examples constructed in this study in close

Table 1 .
Basic statistical accuracy parameters of structure-solubility QSPR models based on Multivariate Linear Regression (MLR) having 1 -5 most significant descriptors.(a) , Q2, Mcc, and ΔQ2 based on their predictions on the test set (IS3).(a)

Table 5 .
and 6 by their contingency table values are developed on imbalanced data having non-symmetric (skewed) distribution.A similar conclusion was obtained for correlation coefficient and illustrated by artificial Comparison of accuracy parameters calculated from values of contingency tables as suggested in Ref.[29](N = 10000).(a)Examplesa1-a6arefromRef.[29],Values of Q2 and ΔQ2 are given in (%).See footnote of Table3for the definition of F1score, Q2, Mcc, and ΔQ2.continuous data given in Table2.Such a result was to be expected because Mcc is just the correlation coefficient written in the form appropriate for calculation from the contingency table values.