An Improved Prediction Model for Zinc-Binding Sites in Proteins Based on Bayesian Method

The zinc ion is the second richest metal ion in organisms. The proteins binding to zinc ions have important biological functions. However, few scholars have integrated the existing tools to predict the zinc-binding sites in proteins. To make up for this gap, this paper combines three well-known prediction tools into an improved model called IBayes_Zinc to predict the zinc-binding sites, and utilizes the advantages of the Bayesian method in handling incomplete or partial missing data. Specifically, the prediction scores of three existing sequence-based prediction tools were adopted, and the missing values were padded, forming an integrated classification tool. Then, the probabilities of positive and negative samples were computed and categorized as the class with higher probabilities. Experiments were conducted on a non-redundant training dataset and an independent testing dataset. The results show that our method surpassed the other three methods by nearly 5–13% in Matthew correlation coefficient (MCC) and outperformed the latter in recall and precision. The research findings promote the detection of zinc-binding sites in protein sequence and the identification of metalloprotein functions.


INTRODUCTION
The life science in the post-genomic era has found that the interaction between proteins and metal ions is irreplaceable to life activities [1,2]. The products of the interaction, known as metalloproteins, have important biological functions. The research into this interaction can greatly promote disease treatment and drug development [3,4]. In the Protein Data Bank (PDB) [5], more than one third of proteins combine with metal ions. Among these ions, the zinc ion is the second richest metal ion in organisms, only behind ferric ion. The binding of the zinc ion to proteins facilitates the catalytic activity and stabilizes the protein structure [6]. As a result, the zincbinding sites in proteins have become a research hotspot in computational biology.
Traditionally, the zinc-binding sites in proteins are identified by biochemical and biophysical experiments, using mass spectrometry (MS) [7,8], high-throughput Xray absorption spectroscopy (HT-XAS) [9], and isothermal titration calorimetry (ITC) [10]. However, these experimental methods are too costly and time-consuming to detect all zinc-binding sites on the level of proteome. To overcome these defects, machine learning algorithms have been developed to predict the zinc-binding sites, such as the Bayesian algorithm, decision tree, support vector machine (SVM), random forest, neural network (NN), regression, and hidden Markov model. These methods can identify the zinc-binding sites in an effective and accurate manner.
The remainder of this paper is organized as follows: Section 2 reviews the existing studies in the related fields; Section 3 presents the materials and methods; Section 4 discusses the results of the experiments; Section 5 puts forward the research conclusions.

RELATED WORK
The prediction precision of zinc-binding sites varies with the feature vectors, which are derived from protein attributes and inputted to the prediction algorithms. Some prediction methods for zinc-binding sites have been established through extraction of protein sequences. For example, Passerini et al. [11] combined the SVM and bidirectional concurrent NN into a two-stage prediction algorithm, which achieved 73% precision and 61% recall in predicting His and Cys sites. Subsequently, they developed ZincFinder [12], an SVM-based automated tool to identify zinc-binding sites. Shu et al. [13] designed ZincPred based on the SVM and homology-based methods, and predicted CHED at 75% precision and 50% recall with this tool. Chen [14] integrated the SVM, cluster and template-based predictors into ZincExplorer, which identified CHED at 86% precision and 70% recall.
Some prediction methods have been proposed based on structural attributes. For instance, Zhao et al. [15] presented 3D template-based metal site prediction (TEMSP), which has a sensitivity of 60% and a selectivity of 80%. He et al. [16] created the mFASD method to recognize varied metal-binding sites, and proved its excellent performance.
There are some other computational prediction tools [17][18][19][20][21] that can determine the binding sites easily. In addition, some prediction tools have been proposed based on multiple types of protein features to promote the prediction precision [22,23]. However, feature optimization has a limited effect on prediction precision, or proteins have fixed features.
To solve the problem, several machine learning algorithms have been integrated to the existing prediction tools. Considering the availability of sequence-based information of existing proteins and the restrictions on the acquisition of structural data, References [12][13][14] explore three well-known prediction tools based on sequence features, with the aim to promote the prediction precision and robustness. Reference [24] applies linear regression to integrate the three tools and achieves good prediction results. But the integrated method only fills a special value to the attribute values that are lost. Later, Reference [25] establishes a prediction model for zinc-binding sites based on Bayesian method, which is insensitive to missing data, requires a few estimation parameters, and boasts a high classification efficiency. Then, an improved prediction model for zinc-binding sites was developed based on Bayesian algorithm to better fill and process the missing attribute values, drawing on the ability of the Bayesian decision theory to handle incomplete or partially missing data.

Dataset
The dataset proposed by Zhao et al. [15] (Zhao_dataset) was taken as the training dataset. This is a unified open standard dataset popular among researchers. To prevent over-fitting, some redundant protein sequences were removed, leaving 392 protein chains. These reserved protein chains include 2,023 zinc-binding sites and 14,493 non-zinc-binding sites.
To verify the precision and robustness of our method, the zinc-binding protein complexes, which were submitted to the PDB in recent years, were adopted as the testing dataset. To remove the unreliable, repetitive, short and non-binding chains, the upper limit of structure resolution was set to 2.5 Å, and the peptide chains with homology higher than 70% and sequence redundancy less than 20% were eliminated through X-ray diffraction. Finally, the testing dataset (Final_dataset) include 213 randomly selected protein chains, including 1,017 zinc-binding sites and 10,148 non-zinc-binding sites.

Improved Prediction Model Based on Bayesian Method
Firstly, each residue of every protein chain was predicted by ZincExplorer, ZincFinder, and ZincPred, respectively, and the results were recorded as a vector of three fractional values xi (i = 1, 2, 3), with x i ∈ [0, 1]. In Reference [24], some sites in the protein sequence have no predictive score. Considering the extremely small ratio of actual binding sites, the predictive scores of these sites were set to 0, i.e. these sites were considered as nonbinding by default. Since different attributes are independent, three scores were used as the attribute variables of the Bayesian classifier, whose advantages were fully utilized to tackle the missing values of some attributes.
The class variable cutoff was defined to reflect whether a site is zinc-binding or not. Then, the classification model IBayes_Zinc was constructed based on the independence of attributes and class variable. The model framework is shown in Fig. 1.
If some sites in a protein sequence have no predictive score, then some attribute values must be missing. In the case that all attribute values are missing, these data should not be used to compute probability. In fact, the probability computation depends on the number of attribute values appearing in the training samples, rather than the total number of the training samples. In the case that some attribute values are missing, the missing values should be padded. On this basis, a Bayesian model can be constructed to train and predict the binding sites.
Let x = (x1, x 2 , x 3 ) be a sample eigenvector containing three fractional sample eigenvalues, where the sample with feature is negative . the sample with feature is positive 3], and cutoff is the class variable.
Let Ω be the total sample space and N be the number of total samples. Obviously, N is the sum of the number of positive samples N 1 and the number of negative samples N 2 , as well as the sum of the number of samples predicted to be positive n 1 and the number of samples predicted to be negative n 0 . Then, n 1 can be divided to the number of correctly predicted positive samples n 11 and falsely predicted positive samples n 10 , while n 0 can be divided to the number of correctly predicted negative samples n 00 and the number of falsely predicted negative samples n 01 .
It is assumed that event A 1 indicates that samples are positive, event A 0 indicates that samples are negative, and A 1 ∪A 0 = Ω; event B(j) 1 indicates that the samples with feature j(j=1, 2, 3) are predicted as positive, and event B(j) 0 indicates that the samples with feature j are predicted as negative; event A 1 B(j) 1 represents that the samples are positive, and the prediction is true; event A 1 B(j) 0 represents that the samples are positive, but the prediction is false; event A 0 B(j) 1 represents that the samples are negative, and the prediction is false; event A 0 B(j) 0 represents that the samples are negative, and the prediction is true.

Else 23 Output it is a non-binding site. 24 End if 25 End
In Algorithm 1.1, the decision is made based on the probability for each sample. If 1 ≥ 0 , the sample is determined as positive; Otherwise, the sample is determined as negative.

Evaluation Criteria
The following indices were selected to evaluate the effect of the proposed model: where, recall is a quantity descriptor equaling the proportion of the number of correctly predicted positive samples to the number of positive samples; precision is a quality descriptor equaling the proportion of the number of correctly predicted positive samples to the number of samples predicted to be positive; MCC∈[−1, 1] (Matthew correlation coefficient) is an indicator of the correlation between the data and a feature classification (if MCC=1, the prediction is completely correct; if MCC=0, the prediction is random; if MCC=−1, the prediction is completely contradictory).

RESULTS AND DISCUSSION 4.1 Performance Analysis
Based on the Zhao_dataset, the tools ZincExplorer, ZincFinder, ZincPred and IBayes_Zinc were tested and evaluated separately. In the whole interval of cutoff, the MCC and recall of IBayes_Zinc were higher than those of the other three methods. As for precision, IBayes_Zinc exceeded ZincExplorer and ZincPred, but stayed slightly below ZincFinder.
To further verify the precision of these tools, the sum of Precision and Recall was calculated. As shown in Table  1, IBayes_Zinc was still superior to the three contrastive methods. Thus, IBayes_Zinc outshined the other three methods in overall performance. When recall was 70% (the general level for actual recognition of zinc-binding sites), the precision of IBayes_Zinc was 2-15% higher than that of the other three methods. To sum up, IBayes_Zinc is more accurate than the other methods in predicting Zinc-binding sites.

Performance Analysis on an Independent Testing Dataset
The precision and robustness of the four tools were further compared on Final_dataset, an independent testing dataset. The results show that IBayes_Zinc outputted a higher MCC than that of the other three methods, across the interval of cutoff, except for 0.2-0.28. Since a small cutoff is not usually adopted to judge an instance, the exception does not affect the overall performance of these methods.
The mean MCC of IBayes_Zinc was 0.5369 in the whole interval, about 13%, 5%, and 9% higher than ZincExplorer, ZincFinder, and ZincPred, respectively. The performance index curves are shown in Fig. 2 below.  Statistical analysis of the other performance indices (Tab. 2) shows that the mean precision of IBayes_Zinc reached 75.67%, nearly 11% and 8% higher than that of ZincExplorer and ZincPred, respectively, and about 6% less than that of ZincFinder. The mean recall of IBayes_Zinc was 45.6%, nearly 12% and 3% higher than that of ZincFinder and ZincPred, respectively, and 4% lower than that of ZincExplorer. IBayes_Zinc also outperformed the other methods in the sum of precision and recall. Overall, IBayes_Zinc is superior to the other three methods.

Case Study
Se-Met-Ampd derivative is a protein complex (RSCB:2Y28) with a crystal structure that contains three protein chains A, B, and C. In this paper, 2Y28 (Chain A), denoted as 2Y28_A, is selected as the visual instance to verify the predictive power of IBayes_Zinc.  Fig. 3 displays the visualization results of zinc-binding sites identified by four different tools, namely, IBayes_Zinc, ZincExplorer, ZincFinder and ZincPred. In 2Y28_A, there are three residues, HIS/34, HIS/154, and ASP/164 (marked in blue) binding to zinc ions (marked as magnetic balls). IBayes_Zinc correctly identified all of the three zinc-binding sites. ZincExplore also predicted the three sites, but falsely predicted a positive site CYS/108 (marked in red). ZincFinder only identified two sites, HIS/154 and ASP/164, failing to find HIS/34. ZincPred predicted the three sites, but falsely predicted the positive site HIS/96 (marked in red).
Tab. 3 compares the prediction results of the four tools on zinc-binding sites in 2Y28_A. Note: Y means the residue is zinc-binding; N means the residue is nonzinc-binding; None means the absence of falsely predicted positive site.

CONCLUSION
The proteins binding to zinc ions have important biological functions, making it critical to identify the zincbinding sites in proteins. Considering the availability of protein sequence information, this paper develops a robust and precise prediction model (IBayes_Zinc) for zincbinding sites based on the Bayesian method, which can effectively handle incomplete or missing data. Specifically, the prediction scores of three existing sequence-based prediction tools were adopted, and the missing values were padded, forming an integrated classification tool. Then, the probabilities of positive and negative samples were computed, and categorized as the class with higher probabilities. Experiments were conducted on a non-redundant training dataset and an independent testing dataset. The results show that our method surpassed the other three methods by nearly 5%-13% in MCC and outperformed the latter in recall and precision.