Chemometric versus Random Forest Predictors of Ionic Liquid Toxicity

Interest in ionic liquids ILs stems from their unique solvent properties and potential process “self-containment”. Their application in chemical processes and biotransformations provides the possibility for clean manufacturing (“green technology”). Besides their solvent and extraction functions, ILs also exhibit synergy effects with catalysts (enzymes) yielding higher production productivity. Theoretically, there is a limitless number of possible ILs with a very broad range of physical and chemical properties. Research on ILs has become one of the most interesting application research areas in novel catalytic synthesis, biofuel production from agricultural wastes, integration of chemical and enzyme reactors with separation processes, polymerization, nanotechnology, enzyme-catalysis, composite preparation and renewable resource utilization1–3. Especially interesting is the use of microreactors for ionic liquid synthesis and possibly as production systems for integrated biotransformations and product separation4. However, the recent questions of ILs’ eco-toxicity and their degradability have also been raised. Analysis of their versatile structure is formally viewed as a combinatorial problem which can be effectively accounted by computers. The object of this work is to apply computer modeling by chemometric methodology and decision tree algorithm for predicting continuous variables, such as toxicity level concentration EC50 and level classification, based on the choice of cation and anion structure and their chemical compositions. Predictions of ILs physical properties are based on literature published data and internet available NIST and MERCK databases of physical properties and cytotoxicity5–7. The main objective of this work is in inferring the rules and patterns implicitly contained in a set of chemical structures and molecular descriptors. Applied is a supervised learning algorithm with target sets for continuous and classification properties revealing relationships between molecular descriptors.


Introduction
Interest in ionic liquids ILs stems from their unique solvent properties and potential process "self-containment".Their application in chemical processes and biotransformations provides the possibility for clean manufacturing ("green technology").Besides their solvent and extraction functions, ILs also exhibit synergy effects with catalysts (enzymes) yielding higher production productivity.Theoretically, there is a limitless number of possible ILs with a very broad range of physical and chemical properties.Research on ILs has become one of the most interesting application research areas in novel catalytic synthesis, biofuel production from agricultural wastes, integration of chemical and enzyme reactors with separation processes, polymerization, nanotechnology, enzyme-catalysis, composite preparation and renewable resource utilization [1][2][3] .Especially interesting is the use of microreactors for ionic liquid synthesis and possibly as production systems for integrated biotransformations and product separation 4 .However, the recent questions of ILs' eco-toxicity and their degradability have also been raised.Analysis of their versatile structure is formally viewed as a combinatorial problem which can be effectively accounted by computers.The object of this work is to apply computer modeling by chemometric methodology and decision tree algorithm for predicting continuous variables, such as toxicity level concentration EC 50 and level classification, based on the choice of cation and anion structure and their chemical compositions.Predictions of ILs physical properties are based on literature published data and internet available NIST and MERCK databases of physical properties and cytotoxicity [5][6][7] .
The main objective of this work is in inferring the rules and patterns implicitly contained in a set of chemical structures and molecular descriptors.Applied is a supervised learning algorithm with target sets for continuous and classification properties revealing relationships between molecular descriptors.

Experimental
The chemical formula of each ion is recorded in SMILES and MOL format and evaluated for corresponding molecular descriptors The objective of this work was a comparative analysis of the standard chemometric and decision tree(s) models for prediction of biological impact of ionic liquids (ILs) for various combinations of cations and anions.The models are based on molecular descriptors for combinations of the following cations: imidazole, pyridinium, quinolinium, ammonium, phosphonium; and anions: BF 4 , Cl, PF 6 , Br, CFNOS, NCN 2 , C 6 F 18 PBF 4 , C 6 F 18 P.The derived data matrix is decomposed by singular value decomposition of the cation and anion matrices into corresponding first ten components, each accounting for 99.5 % of the corresponding total variances.Biological impact data, i.e. molecular level toxicity, are based on acetylcholinestarase inhibition experimental data provided in MERCK Ionic Liquids Biological Effects Database.Applied were the following models: Principal component regression (PCR), partial least squares (PLS), and decision tree(s) model.The model performances were compared by ten-fold validation.Obtained were the following Pearson regression coefficients R 2 : PCR 0.62, PLS 0.64, and for decision tree forest RFDT 0.992.The decision tree(s) models significantly outperformed chemometric models for numerical predictions of EC 50 concentrations and the classification of ILs into four levels of toxicities.
Key words: ionic liquids, toxicity, chemometrics, decision tree by 2x797 data points.Since numerical values of molecular descriptors cover a range of numerical orders of magnitude, each descriptor is autoscaled based on the sample average and the corresponding standard deviation.For the selected cations, the transformation is: Similarly, molecular descriptors for the selected anions are transformed accordingly: The obtained matrices of the autoscaled descriptors are analyzed for their mutual inter-relationships.For cations and anions data matrices, the average Pearson correlation of R 2 =0.4 is obtained, which is significant considering the large number of samples (ions).Due to high co-linearity between various molecular descriptors, the data matrices are decomposed into a series of partial components by application of singular value decomposition of the corresponding anion X A and cation X C covariances by solving the eigenvalue problems: Decomposed matrices, P A and P C , are defined by the corresponding eigenvectors v A and v C , and the contributions of individual partial decompositions are evaluated by the ratios of squares corresponding eigenvalues λ Ai and λ A to the number of descriptors M.
Based on the preselected level of 99.5 % of the total variance, the first ten, K=10, eigenvectors for each data set are chosen.

Results and discussion
Compared are the chemometric and decision tree models for regression and prediction of concentration E 50 and toxicity classification for inhibition of acetylchlorinestarase inhibition experimental data provided in MERCK Ionic Liquids Biological Effects Database 7 .The model input data are the target values of molecular descriptor projections.The chemometric models are linear models, and applied here based on their expected robustness and improved prediction when compared to classical least squares multivariate models [9][10][11][12] .The first tested model is Principal Component Regression (PCR) given by Eq. (7).
The statistical evaluation and analysis of the model parameters are performed by the algorithms provided by R open source software 16 and STATISTICA 17 .Applied is ten-fold cross validation within the training set of samples, as well as validation with the data set that had not been used during the modelling phase.The model "quality" for prediction of E 50 concentration is relatively "poor" with R 2 = 0.62 presented in Fig. 1.
The second tested model is Partial Least Squares (PLS) which is to improve the predictions by separate decomposition of the input and output data sets (Eqs.8-9).

 
The predictive model is built by regression between the inner projections T and U: The PLS model predictions on the test data slightly improved yielding R 2 = 0.64 as presented in Fig. 2.
The obtained relatively poor predictions of EC 50 by the chemometric models is in contrast to good predictions for some of ILs physical proper-F i g . 1 -Comparison between the test samples for measured ln(EC 50 ) concentrations and the principal component regression model predictions ties, as for example, viscosity, given in literature [12][13][14] .A possible explanation is due to high dispersion of the experimental data EC 50 involved in measurement of the biological effect of ILs.
In order to elevate the modelling assumption on continuity and linearity between molecular descriptors and biological effects, applied are decision tree (DT) and random forest (RF) models (11,15,18) .These are nonparametric models and are not based on assumed functional relationships between the input and output data.The main objective of decision tree model is a supervised procedure of step-wise classification of input data by binary split into subsets for "improved" or more significant information content (information gain).It is obtained by minimisation of Gini index or pattern entropy.Produced models are not given in a closed mathematical form, but as a set of logical statements which can be easily represented in graphical form as a tree of stepwise decisions.When a DT model is used for regression, the numerical range of output data is approximated by pseudo classes for assumed precision of regression predictions.Here is applied the Breiman and Cutler 15 algorithm available in R software system and tree plotting [16][17][18] .

( )
ˆ, , Single decision tree prediction models tend to be biased but modelling can be improved by re-initialization of collections of trees by randomisation of the split algorithm and production of a random forest.Prediction of a random forest is obtained by aggregation of individual trees with weighted response corresponding to individual tree cross-validation.

Conclusions
Applied are chemometric and decision tree models of ILs toxicity based on their molecular descriptors.Toxicity criteria is based on EC 50 concentrations for inhibition of acetylcholinesterase, In view of very large of molecular descriptors their colinearity was investigated and was found significant average correlation R 2 ≈ 0.4.In order to simplify and obtain robust models the matrices of cation and anion descriptors are projected to the corresponding spaces of the first ten eigenvectors resulting into about 99.95 of variance (data dispersion content).
Application of chemometric models, partial component regression and partial least squares, resulted in limited quality of prediction on test sets with regression coefficients R 2 of 0.62 and 0.64.However, application of decision tree and random forest models significantly improved quality of prediction with R 2 = 0.992.Randomization and aggregation of large population (500 trees) resulted with the model with low overfitting effects and unbiased estimates (besides possible bias in molecule selec- results are presented in Figs.3-4.Prediction of ln(EC 50 ) by the random forest model is greatly improved with Pearson correlation R 2 = 0.992.Individual decision tree for classification of ILs toxicity is depicted in Fig.5.For acetycholinesterase the following classes were here adopted: low (L, EC 50 < 10 μmol L -1 ), medium (M), EC 50  [10 -100 μmol L -1 ], high (H) EC 50  [100 -1000 μmol L -1 ], and very high (VH ) EC 50 > 1000 μmol L -1 , according to MERCK classification7 .The advantage of applying uncorrelated principal components of the molecular descriptor sets has resulted in a simple and transparent model.

F i g . 2 -
Comparison between the test samples for measured ln(EC 50 ) concentrations and the partial least squares model predictionsF i g .3 -Comparison between the test samples for measured ln(EC 50 ) concentrations and the random forest model predictions tion).Due to orthogonalisation of training patterns derived are simple and transparent decision trees.Practical application of the derived models is their potential use as part of a feedback loop for inverse design of new ILs for specific (tailored) new process technology needs.ACKNOWLEDGMENT This work was supported by the Ministry of Science, Education and Sports of the Republic of Croatia, project 058-1252086-0589 L i s t o f s y m b o l s principal components by SVD PCR -principal component regression PLS -partial least squares Q -output projection PLS matrix RFDT -random forest decision trees SVD -singular value decomposition T -projection PLS matrix U -projection PLS matrix U -input projection PLS matrix v -eigenvectors X -matrix of input data Y -vector of output data  -vector of model parameters  -eigenvalues  -standard deviation R e f e r e n c e s 8. Hence, each combination of ions for a specific IL is represented Chemometric versus Random Forest Predictors of Ionic Liquid Toxicity