QSPR Models for Prediction of Aqueous Solubility: Exploring the Potency of Randić-type Indices

The development of QSPR models to predict aqueous solubility (logS) is presented. A structurally diverse set of over 1600 compounds with experimentally determined solubility values (AqSolDB database) is used for building the data-driven models based on multiple linear regression (MLR) and artificial neural network (ANN) methods to predict aqueous solubility. Molecular structures are encoded by numerous structural descriptors, including the connectivity index developed by Randić in 1975, and many later derived variations. To evaluate the potency of Randić-like descriptors in the structure-property relationship, we developed models based on two sets of descriptors, first using only Randićlike descriptors calculated with Dragon, and second using 17 commonly applied descriptors available in the AqSolDB database. All models were validated with external prediction sets, with the RMSE ranging from 0.8 to 1.1. Interestingly, the RMSE of predicted LogS values of models based only on the Randić-like descriptors were in average just 0.1 larger than the models with 17 descriptors preselected as suitable for modelling logS.


INTRODUCTION
QUEOUS solubility is an important physical property describing a complex process of the interaction of a solute with water. It is of a special importance in the pharmaceutical industry, as the drug discovery relays upon solubility data to help improve drug delivery systems. In our previous work we found a potential autolysin E (AtlE) inhibitor, [1] which has to be subjected to a structure optimization procedure due to its low solubility. There are several models available to predict aqueous solubility (logS) based generally on two main approaches, either calculation of solution free energy using physics-based theory alone, or using machine learning/quantitative structure−property relationship (QSPR) models.
Theoretical calculations by using molecular simulation of aqueous solubility are extremely demanding. Physicsbased solubility computations equilibrate the free energy of a molecule in the crystal lattice to the solvation energy of a molecule in saturated aqueous solution. Crystal lattice free energy is experimentally observed as free energy of sublimation, while free energy for transfer of the molecule from the gas phase to the saturated solution is solvation free energy. Both values can be calculated by molecular dynamics simulation in conjunction with one of the methods for free energy calculation, such as thermodynamic integration, thermodynamic perturbation, or metadynamics. Solvation free energy calculation [2,3] is specially demanding since in the case of solubility simulation it depends on the solute concentration. Therefore, solvation free energy value should be determined for several values of solute concentration. Obviously, a huge computational effort for one species and first principle approach is definitively not practical for purposes like drug design where hundreds of drug candidates should be screened. For much simpler tasks like octanol/water partition calculations where solvent reaction field methods should work, only few solvation models reproduce experimental values e.g. SMD (solvation model density) of Truhlar and Cramer works, while PCMpolarizable continuum model of Tomasi does not work. [2] Often the pragmatic approach of the machine or deep learning outperforms the theoretical calculations, particularly after having more and more data available. It has been reported by J. L. McDonagh et al. [3] that direct theoretical calculation did not give accurate results, while machine learning was able to predict the logS with a root mean squared error (RMSE) of ∼1.1 log units in a 10-fold cross-validation for 100 drug-like molecules. The group of Schneider performed a review of the drug discovery with explainable artificial intelligence. [4] They concluded that deep learning bears promise for drug discovery, but there is a demand for 'explainable' deep learning methods to address the need for a new narrative of the machine language of the molecular sciences. QSPRs are widely used in silico methods to predict chemical properties for untested chemicals, since they present time and costeffective approach, which is in most cases sufficient alternative method to experimental testing. For building and validation of models, the appropriate statistical algorithms and the data matrix that includes numerical values of chemical structures and empirical values of property are needed. In the literature and on web platforms, QSPR models for aqueous solubility based on various sets of molecules are published. [3,[5][6][7][8][9][10][11][12][13][14][15][16] Models are developed using various number of compounds and linear or non-linear methodologies (e.g.: multiple linear regression, partial least squares, ordinary least squares, multivariate adaptive regression splines, hierarchical clustering, group contribution, nearest neighbour, support vectors, random forest, or artificial neural networks). By utilizing information derived only from SMILES strings, the available models make predictions of aqueous solubilities (logS) in simple and straightforward procedures, which do not require molecular geometry optimizations. Models generate predictions including various molecular properties from molecular descriptors (Chemistry Development Kit (CDK), PaDEL, RDKit, Dragon, alvaDesc, SASA) to molecular structures signatures (distance graph based signatures (GBS), MACCS keys, etc.). Most of the models include logP parameter for prediction [3,5,7,9,12,[17][18][19] , which is also a core variable in General Solubility Equation (GSE) [20] While GSE is method of choice only when melting point is estimated, the other models, in which variables are calculated from molecular structure, can be used without limitations.
On SwissADME platform, [13] three aqueous solubility models are available: ESOL, [5] Ali [12] and SILICOS-IT. ESOL model by Delaney [5] was developed from a set of 2874 compounds using multiple linear regression and nine molecular properties like logP, molecular weight, proportion of heavy atoms in aromatic systems, and number of rotatable bonds. The model has good performance = =  [12] While model SILICOS-IT [21] is based on fragmental method corrected by molecular weight and having = 2 TR 0.75.

R
Models of McElroy and Jurs [14] were generated with 399 heteroatomcontaining organic compounds by using multiple linear regression (MLR) and computational neural network. The best results were obtained with non-linear CNN models = TR (RMSE 0.6, = V RMSE 1.5; subscripts TR and V referring to training and validation sets, respectively). The models available on VEGA and EPA platforms are based on 5020 compounds from EPI Suite database. VEGA water solubility model v.1.0.0 [7] [6] Another available prediction model is on pkCSM platform [11] generated with 1708 compounds and graph based sig- The model from admetSAR 2.0 is also based on the same set of compounds = 2 ( 0.81). R [9,10] On Alvascience platform model based on ordinary least squares, 8825 compounds and five alvaDesc descriptors is presented and its predictions are in high correlation (> 0.9) with ESOL model.
The molecular connectivity index developed by Milan Randić [22] was shown to correlate almost perfectly with the boiling points of alkane isomers having two to seven carbon atoms. Hall et al. [23] demonstrated its relationship to water solubility and boiling point. After 25 years, in 2001, Randić published a comprehensive review of the developments of the connectivity indices as molecular descriptors in multiple linear regression analysis structureproperty-activity studies. [24] The review is focused on the elaboration of higher order connectivity indices and the valence connectivity indices. The discussion has shed light on further development in chemical graph theory, novel directions in mathematical characterization of chemical, biochemical, and biological systems, all stimulated by the connectivity index. Connectivity-based molecular descriptors were later applied in many QSPR models, including those for prediction of aqueous solubility [25][26][27][28][29] A few years ago, Gutman et al. [30] depicted an interesting connection between the degree-based information content of a (molecular) graph and Randić index. However, detailed inspection of the correlation studies revealed that I(G) (degree-based information content), converse to R(G) (Randić index), carries information only on degree distribution in graphs and not on their mutual relationship, which results in the insensitivity of vertex-degree-based information of a graph on subtle structural differences among graphs. This illustrates additionally the advantage of the Randić index and its application potential in chemistry. An interesting use of connectivity indices is presented for the estimation of stability constants of metal-complexes. [31,32] The aim of this study was to evaluate the potency of Randić-like descriptors in the structure-property relationship regarding aqueous solubility. In particular, our goal is to develop the QSPR (Quantitative Structure-Property Relationship) models to predict aqueous solubility (logS) of the potential (AtlE) inhibitors in order to optimize the initial chemical structure of poorly soluble hit compounds that were obtained in previous work. [1] Therefore, we developed and compared models based on two sets of descriptors, first using only Randić-like descriptors calculated with Dragon, and second using 17 commonly applied descriptors, as described in the literature, and available in the AqSolDB database. [33]

Dataset
The dataset for modeling included 1674 compounds. The solubility data were obtained from AqSolDB database. [33] In this database logS values were collected from different sources. [5,20,[34][35][36][37][38][39][40][41] AqSolDB consists of over 9900 unique compounds, which are coded with SMILES strings. For our modelling, we have chosen only compounds from the most reliable groups G3 and G5, i.e. groups composed of compounds with logS values found more than once in merged dataset and having reliability label of standard deviation < 0.5. In this way, 1818 compounds were obtained that were further reduced because of limitations of calculation of molecular descriptors (MD) in Dragon software. [42] This led to 1674 compounds in our dataset. Compounds were classified in four classes according to the thresholds reported in AqSolDB: insoluble compounds (logS > −4), moderately soluble compounds (log S −4 to −2), soluble compounds (logS −2 to 0) and highly soluble compounds (logS > 0) [33] The classes are labelled as I, L, S and H (see Supplementary file Figure Table S1 (Supplementary Material). The experimental values available as standardized logS units in AqSolDB [33] were obtained from aqueous solubility assays that followed the OECD guidelines for testing of chemicals. Further on, 1674 compounds were divided into three datasets: 1004 compounds in the training set TR (60 %), 335 in the test set TE (20 %) and 335 in the validation set V (20 %). The split of compounds into datasets was based on mapping and visualization of compounds on top-map with CPANNatNIC software [43] . Several Kohonen maps of different network sizes (19 × 19, 20 × 20, 21 × 21) were tested for mapping the compounds according to their structural similarity and logS values. After the statistical analysis and inspection of occupied/empty neurons, the optimal neural network for the splitting purpose was of size 20 × 20 (Table S2). The most similar compounds that were located on the same neuron were then split in training, test and validation datasets. See the methodology of splitting of data using Kohonen ANN in the paper of Minovski et al. [46] and Refs. [36,37] ibid. The Kohonen map of the optimal network is visualized in Figure S1.

Molecular Descriptors
Two set of molecular descriptors were used in development of our QSPR predictive models. The first set, so called AqSol set, was composed of 17 MDs topological and physico-chemical 2D descriptors that are published in AqSolDB. [33] Originally, they were calculated with RDKit software [44] and are listed in Table S8 (Supplementary Material). The second set, so called Dragon set, was generated by Dragon 7.0 software. [42] Molecular structures in the format of SMILE strings were an input and the program calculated over 3000 molecular descriptors for each molecule. Therefore, the reduction of the number of initially calculated descriptors was crucial for QSPR modelling. We focused on Randić connectivity index and other Randić-like descriptors. In this way we ended up with 94 Randić descriptors (Table S3). Prior to be used in building of QSPR models, both set of descriptors were normalised to zero mean and unit standard deviation for each descriptor.

Generation of Models
The inputs for model building were various m-dimensional vectors representing the chemical structure; m being the number of selected MDs (independent variables), and the target (property, i.e. dependent variable) corresponding to logS of compounds. For model fitting, the linear and nonlinear regression approaches were used. Firstly, we used the supervised learning algorithm of counter-propagation artificial neural network (CP-ANN) and in-house software (CP-ANNatNIC). [43] During the model optimisation, the size of CP-ANN (number of neurons), number of epochs, minimal and maximal coefficients used for correction of CPANN weights, and selection of descriptors were simultaneously optimised. Secondly, the multiple linear regression models were developed with Qsarins software (DiSTA, Varese, Italy, www.qsar.it) [45] In optimization the genetic algorithm (GA) was used for selection of the influential descriptors and improvement of predictive ability and robustness of the models, which is available in Qsarins and combined with the CPANN in-house program. All models were externally validated and had defined applicability domain (AD). The AD was applied to evaluate the reliability of model predictions within the established chemical space limits. Two approaches were used for AD evaluation, the cumulative distributions of Euclidean distances to central neurons (MEDS) for CP-ANN models [46] and the leverage values for MLR models (hat values). [45] At the end, the models with the best performance parameters were selected and are presented in this paper as models NN-AqSol (NN-A), Q-AqSol (Q-A), NN-Dragon (NN-D) and Qsarins-Dragon (Q-D). These optimized regression models can be further used for prediction of logS value of any chemical of interest, considering the limitations specified with the models, such as electro-neutrality or structural applicability domain.

Statistical Evaluation of Models
First it has to be stressed that the problem of overfitting can always be an issue using ANN method. The methodology applied in this work relies on division of the data into training, test and validation set of compounds. Here, we have to take care that the compounds in the validation set, which serves for the final validation of the models, is separated from the rest of compounds at very beginning, prior to any data curation. Then the test set is selected as described, and it serves for internal intermediate testing during the model development and optimization. This standard procedure has proven to be the most robust against potential overfitting.
The criteria used for selection of the best regression models were several validation parameters, which are RMSE, R 2 , Q 2 F3, CCC. [47][48][49] The models with the highest quality indicators of the validation set were selected. The consensus approach was also applied for the NN-A and NN-Q models (consNN) and for all four regression models NN-A, Q-A, NN-D and Q-D (cons4) by using the weighted average response. The calculation of the final predicted value of logS was performed following the Equation (1); M-number of models, yk-response estimated by the k-th model, hk-leverage [50]

Splitting of the Data
Initially, we followed recommended methodology for building QSAR models [51] and performed precise splitting of the initial data to avoid inconsistent results. The rate of the division of compounds into the training set (TR), test set (TS), and validation set (V) sets was done according to the optimal distribution of 1674 compounds described with 17 AqSol molecular descriptors on the Kohonen top map. In Kohonen neural network, molecular descriptors were mapped according to similarity and consequently few of them have place on the same neurons. A network with optimal distribution of compounds had the following parameters: 20 × 20 neuron grid, 100 learning epochs, RMSE = 0.397 and R = 0.918, 0.5 maximal learning rate, 0.01 minimal learning rate, non-toroidal NN boundary conditions, and triangular correction function of the neighborhood (Table S2). Splitting of 1674 compounds was performed following the optimal rate (60 % training set / 20 % test set/ 20 % validation set) to cover as much information as possible. In models using Dragon descriptors, the number of compounds was reduce to 1665, since for nine compounds some descriptors were not calculated. The validation set (n = 335, 334 for Dragon models) was the same for all developed models. The training set for MLR models was composed of 1339 compounds (1331 for Dragon models), merged TR and TS sets, while in CP-ANN models the training and test sets were composed of 1004 (996 for Dragon models) and 335 compounds, respectively ( Table 1).

Distribution of logS
Analysis of the logS distribution reveals that the compounds have solubility values in the range between −12.1 and 1.5. Figure 1 shows the distribution of solubility values in solubility classes according to AqSolDB classification. [33] Our rates of compounds distribution according to aqueous solubility classes are comparable with rates of compounds as available in AqSolDB.

Development of Predictive Models
The logS values of 1674 compounds selected from AqSolDB and algorithms of multiple linear regression (MLR) or counter-propagation artificial neural networks (CP-ANN) with genetic algorithm (GA) were applied for development of regression models for predicting solubility in water. During optimization process hundreds of models were generated, but only the best four models were selected and represented in this study (models NN-A, Q-A, NN-D, Q-D) for two sets of descriptors (A: AqSol 17 descriptors and D: Dragon Randić-like 94 descriptors) and two modelling methods (NN for neural networks and Q for MLR). The QSARINS and CP-ANNatNIC software are well known and frequently used tools for linear and non-linear QSAR models. [43,45] The best models were chosen by using Root Mean Square Error for training set (RMSE TR ) and validation set (RMSE V ) as the optimization value criteria. In building and optimization process of CP-ANN models numerous GA runs by changing different parameters like number of neurons and learning epochs, learning rate, minimal and maximal coefficients used for correction of CP-ANN weights and number of descriptors were performed. The final selection of the best models was based on the optimal values of performance indexes ( Table 1, Table S13). The best four regression models are presented in Table 1. The performance statistics for eight logS predictive QSPR models obtained by Qsarins or CP-ANN tools are given in Table 2. The best results were obtained with NN-A model (RMSE TR = 0.69, RMSE V = 0.76), which has a good ability to predict aqueous solubility. Next reliable model NN-D has also good performance (RMSE TR = 0.81, RMSE V = 0.96). The linear Q-A model has values for RMSE TR = 1.38, RMSE V = 1.12, while another non-linear prediction model Q-D have parameters RMSE TR = 1.22 and RMSE V = 1.24. If several models are developed the consensus approach could be applied according to the OECD guidance. Consensus models cons4 and aver4 are including predictions of four chosen models, but do not show better results than single models NN-A and NN-D. On the other hand, the consensus models consNN and averNN, which includes predictions from both CP-ANN models (NN-A and NN-D), are the best according to validation parameters. The consensus approach in consNN has decreased the RMSE TR to 0.59, if compared with single models NN-A (RMSE TR = 0.69) and NN-D (RMSE TR = 0.81). The newly developed models have broad applicability domain and cover wide chemical space. Almost all compounds are in AD of our models. Therefore, models are robust and predictions are reliable, based on compounds from similar structural domain.

Selection of Influential Descriptors
The descriptors selected in the NN-A and Q-A models are shown in  Table S5 are correlated with original Randić connectivity index (X1) and some other connectivity indices X1v, X0sol, X3sol, X5sol, RDCHI, X1Kup, X1Per, X1MulPer, and 2D matrix-based descriptor Chi_H2 in model NN-D. We also observed high correlation with walk and path counts descriptor CID,  connectivity index X1sol, and 2D matrix-based descriptors Chi_B(e), Chi_B(p) in model Q-D. In Table S5, we can also see, that five Dragon descriptors (X0Av, ChiA_Dt, Chi_B(p), ChiA_B(p), VR2_B(i)) are represented in both models, NN-D and Q-D.
The correlation coefficients (CC) of aqueous solubility predictions among our prediction models, experimental logS and predictions generated with six publically available models are represented in correlation matrix (Table S14). We can observe high correlation (CC > 0.9) for models NN-A, NN-D, aver4, consNN, averNN and VEGA with experimental logS values. High correlation (CC > 0.8) was also observed among predictions from NN-A, NN-D, cons4, aver4, consNN and averNN with predictions in VEGA and pkCSM models, while predictions of other three public models (ESOL, Ali and SILICOS-IT) are less correlated with our models and experimental logS, but still in reasonable high rate (CC > 0.7).
To summarize, our trial was to select a really large set of reliable solubility data of structurally diverse compounds, from different data sets, all collected in AqSol database. So compiled data (chemical diversity) combined with CP-ANN method, which is able to automatically organize smaller clusters (subsets) of compounds from which the prediction are performed, give better results than any of available models compared in the discussion, with large applicability domain.
Software together with the model of Aqueous solubility, is available from the authors upon request.
Software written in Java is also freely available, the user can download it from: https://www.ki.si/en/departments/d01theory-department/laboratory-for-cheminformatics/software/ (SOM tool developed within LIFE+ project LIFE PROSIL).

CONCLUSIONS
In this work, linear and non-linear QSPR models were constructed to predict solubility in water. A dataset of 1674 chemicals (splitting in 60 % training set, 20 % test set and 20 % validation set) and their experimentally measured aqueous solubility values (logS) was used for model development. Dragon and AqSol molecular descriptors, and MLR and CP-ANN algorithms were implemented in modelling process. Multiple calculations were performed in order to obtain the optimal model (cons NN with = 2 0.92, R RMSE TR = 0.59). Non-linear models were shown to give better results and predict the water solubility of chemicals more accurately than the linear ones. An interesting conclusion can be drawn from the comparison of models based on descriptors derived from the connectivity index (Randić-like indices) on one side, and on AqSol descriptors (recognized as most suitable descriptors for logS modelling) on the other side. The RMSE of the models based on the Randić-like descriptors only, were in average just 0.1 log units larger than the models with AqSol descriptors, which demonstrates a huge potential of connectivity index in capturing molecular structural properties that may correlate with physico-chemical as well as biochemical properties of compounds. In drug design, the solubility is one of the key properties that have to be considered. For drug design purposes it would be worth mentioning protonation states that depend on solution pH and solute pKa values. The latter is also concentration dependent. Besides, protonation states may be different in crystal than in the solution. The methodology applied in this work, however, cannot explicitly consider such effects. Nevertheless, some underlying information about the effects mentioned above is present in the solubility data used for training and developing data-driven models. Once the database of tested chemicals is large enough and covers an extensive area of chemical structure space, the models would become more reliable, also regarding the protonation state, but the interpretation of it definitively remains as a challenge for future.
Based on the results obtained in this work we are confident that these newly developed models could be a valuable guidance for design and optimization of more soluble compounds in the future. Since models are robust and reliable, we hope they will be very useful in our further drug development of autolysin E inhibitors.