Distributed Representation of Protein Sequence Based on Multi-Alignment Results

Protein sequence representation is a key problem for protein studies, especially for those sequence-based models. In this paper, a distributed representation model of protein sequence is proposed, which involves evolutionary information by introducing multi-alignment results. Firstly, we construct a non-redundancy protein dataset and perform multi-alignment for each protein. Then k-mer amino acids "biology corpus" was abstracted from the alignment results which are "evolutionary information" enriched. Using the "biology corpus", k-mer amino acids distributed embedding vectors could be trained according to word2vec method. We compared the amino acid pair distance derived from our produced 1-mer amino acids distributed embedding vectors with that derived from BLOSUM62; it was found that their Pearson coefficient is 0.937, showing they have strong correlation. Then we applied the obtained amino acids distributed embedding representation to protein secondary structure recognition and solubility prediction. For both of the experiments, our proposed alignment results based amino acid distributed representation outperforms that derived directly from protein sequences. Moreover, compared to those existing up-to-date algorithms, our method could get better or comparative results, on condition of only using the feature of our produced amino acid distributed vectors.


INTRODUCTION
Proteins are polypeptide chains composed of amino acid sequences, which are one of the most important components of living organisms. Proteins are involved in almost all the processes within organisms, including catalysing metabolic reactions, DNA replication, responding to stimuli, providing structure to cells and organisms, and transporting molecules from one location to another [1]. Proteins function biologically depending on their tertiary structures, which are generally believed as being determined by their primary structures, i.e., amino acid sequences [2]. However, experiment techniques for protein structure determination, such as X-ray crystal diffraction and NMR, are extremely expensive and time consuming, resulting in the knowledge gap between obtained protein structures and protein sequence. Therefore, sequence analysis of proteins is of great significance.
The first problem of sequence analysis is how to represent the sequence. Traditional protein sequence representation methods mainly include orthogonal coding and profile method. Orthogonal coding, also known as one hot representation method, represents each amino acid by using a 21-bit binary vector, of which only one bit is '1', and its position corresponds to the amino acid type. It is natural and simple, but could not grasp the relationships between different amino acids. Position-Specific Scoring Matrix (PSSM) is a convinced and popular used profile method which contains the evolutionary information. However, it is computationally expensive because multialignment is needed for each protein sequence. With the development of deep learning, distributed representation method has made great achievements in the Natural Language Processing field, such as word2vec [3], and GloVe [4], which has received great attention and been widely used [5,6]. Naturally, the protein sequence was considered as biology "sentence" and "amino acid vectors" were obtained by word2vec, which could represent the protein sequence for further protein studies [7,8]. However, the evolutionary information was not included in the existing methods, though it is very important for protein analysis. Aiming at the abovementioned problems, this paper developed an amino acid distributed representation model, which involves evolutionary information by introducing multi-alignment results.
In order to test the effectiveness of our proposed method, protein secondary structure recognition and solubility prediction were applied. Numerical results show that our method outperforms the existing representation methods.

METHODS
Focused on protein sequence analysis, we proposed a new embedding representation method based on the distributed embedding method word2vec, which introduces the evolutionary information by using the multialignment results to construct the "biology corpus". In this section, data processing and the method will be described in detail.

Data Retrieval and Pre-Processing
To construct the protein dataset, we downloaded 160,000 non-redundant protein data from NCBI's FTP site. After that, multi-alignment of each protein sequence was performed using the package of BLASTp [9], which could be accessed at https://blast.ncbi.nlm.nih.gov/Blast.cgi. We set the parameters of the BLASTp to find only the 20 most similar protein sequences for each protein sequence as output. Then, for each sequence in the dataset, a multialignment profile was obtained (an example is shown in Fig. 1). For the representative reason, those profiles with less than 10 sequences are discarded, and finally, 40968 multiple-sequence alignment profiles remained.

Word2vec Distributed Representation Method 2.2.1 Word2vec Introduction
The term of "word embedding" was originally developed by Bengio et al. in 2003 [10]. Different from former vector space models, word embedding is trained in a neural language model together with the model's parameters. However, it was not until word2vec was proposed by Google in 2013 that the word embedding model has attracted numerous researchers' attention and thus got developed promptly. For example, a year later, Stanford developed GloVe [4], Facebook proposed FastText in 2016 [11] and Peters et. al. introduced ELMo in 2018 [12]. These models have been widely used in various natural language processing problems and have become standard word representation methods. In this paper, we chose word2vec as the amino acid embedding model, so it will be described in detail below in this section.
Word2vec is a shallow two-layered neural network which produces word embedding for better word representation. It is based on the hypothesis of "a word is its own context", meaning that a word could be described by its own context, or vice visa, which derived two types of network frameworks: CBOW and Skip gram. In CBOW, the inputs are the context word embedding vectors and the output is the central word, while it is the contrary in Skip gram. Taking Skip gram as an example, suppose we have a sentence of {w1, w 2 , …, w i−1 , w i , w i+1 , …, w n }, where w i means the i-th word in the sentence. Then for each word, saying about w i , a sample could be abstracted by setting it as the central word, and the neighbors in t-window as its context, namely that Fig. 2a).
The above described samples consisting of a pair of a central word and its context words could be considered as "positive samples", and the pairs 5 that do not exactly match all are "negative samples". Obviously, the numbers of "positive samples" and "negative samples" are severely unbalanced. Word2vec uses two kinds of sampling methods to solve this problem, namely hierarchical softmax and negative sampling [13,14]. In negative sampling method, only a smaller number of "negative" samples are randomly selected to train the model instead of using the whole "negative" sample set. In hierarchical softmax method, the output layer uses a Huffman tree to represent the whole vocabulary, with each word corresponding to a leaf node of the tree, and the path linking the leaf node and root representing the word (See Fig. 2b). In this paper, Skip gram framework and hierarchical softmax method is used, the detail information about word2vec could refer to [3].

K-mer Amino Acids Content Corpus Construction
Unlike the existing method that directly uses the protein amino acid sequences as the "biology sentence" [15], we construct the "biology corpus" from the multialignment results, which includes rich evolutionary information. We just consider the normal 20 types of amino acids in this paper. As the alignment results include the symbol of "indel", the generalized amino acids have 21 types. Then for k-mer amino acids, the size of vocabulary should be 21 k . Tab. 1 is an example of multi-alignment results. The target protein sequence is A 1 (0) , A 2 (0) , A 3 (0) , A 4 (0) , the first obtained similar protein sequence is A 1 (1) , A 2 (1) , A 3 (1) , A 4 (1) , and so on.
In the alignments, the rich evolutionary information is included in the columns of the table. Naturally, we use the column of alignments to construct "biology sentences". However, the aligned sequences are highly homologous and many columns are highly identical. Therefore, we should filter the alignment results by removing those columns with too high "information entropy". For column i, the "information entropy" is defined by: where p j (i) means the j-th type of amino acids frequency in column i.
As we want to get the embedding of the k-mer amino acid vocabulary, the "biology sentences" should consist of k-mer amino acid. So we define the "information entropy" of a k consecutive columns as: where H i is "information entropy" of the ith column within the k consecutive columns. If the "information entropy" of the k consecutive columns is less than the threshold, some samples can be generated as follows: randomly select a k-mer amino acid as the target "word" from all the n similar sequences of this k column, and randomly select other 2tkmers as its "context", this process could be repeated several times. Finally, a "biology corpus" for k-mer amino acid can be derived from multi-alignment profiles.

FRAMEWORK OF ALIGNMENT BASED PROTEIN SEQUENCE DISTRIBUTED REPRESENTATION
The framework of the proposed method is shown as Fig. 3. We first download the protein data from NCBI, then use the blast program to analyze each protein's similarities within the download data. By removing those sequences with high redundancy, we get the non-redundant data set of protein sequence. Next, we perform multi-alignment for each protein sequence, and construct the "biology corpus" of k-mer amino acids from the alignment results. Then we use the word2vec method to train k-mer amino acids distributed embedding vectors. In the paper, CBOW framework and hierarchical softmax method are used in word2vec.

RESULTS
Firstly, we compare our obtained 1-mer amino acids embedding vectors with BLOSUM62 to testify the rationality of the proposed embedding method. Then two designed experiments are executed to verify the performance of our method, namely, the secondary structure recognition and the solubility prediction from protein sequence. Some existing popular used methods are selected for comparison.

Comparison of 1-mer Amino Acids Distributed Representation with BLOSUM62
A common method to measure the similarity between different types of amino acids is to use the so called "substitution matrix", of which PAM and BLOSUM are the most popular [16]. BLOSUM was originally introduced by Steven Henikoff et al. in 1992 [17], the main idea of which is to compare the likelihood of amino acid substitutions in homologous protein sequences compared with the substitutions in the background. For example, BLOSUM62 is a score matrix by constructing homologous protein sequences alignment dataset with similarities greater than 62%.
To verify the rationality of our proposed amino acids embedding method, we compared the result of our 1-mer amino acids distributed embedding representation with BLOSUM62. We calculate the similarity of each pair of amino acids by using dot product of their 1-mer amino acids embedding vectors, and the resulting similarity matrix has the same shape as BLOSUM62 (See Fig. 4). Both of them are stretched to a vector with dimension210. Fig. 5 shows the comparison result of these two vectors after normalization. Their Pearson coefficient is 0.937, and it means that they have strong correlation, showing that our proposed amino acid distributed embedding vector model is convincing.

Performance on Protein Secondary Structure Recognition 4.2.1 Model Introduction
Long Short-Term Memory (LSTM) is a very successful algorithm in the field of natural language processing, which can effectively extract remote-related information. Naturally, LSTM is extended to amino acid sequence analysis to take full advantage of the correlation features between remote amino acids [18,19]. Therefore, a variant of LSTM, Bi-directional LSTM, is used as the recognition model for protein secondary structure, as well as the protein solubility prediction model in the next subsection. In the model, it takes our produced distributed amino acid vectors as input, and outputs the types of protein secondary structure or protein solubility accordingly.
The framework of the model is shown in Fig. 6. The input of the model is the k-mer amino acid sequence of the protein (to clarify, 1-mer is shown in Fig. 6). By retrieving the pretrained k-mer amino acid distributed embedding vectors, the k-mers of the inputs could be represented by a low dimension condense embedding vector, and then be input into Bi-directional LSTM model. The Bi-directional LSTM model links to a fully connected layer and then the final softmax layer which outputs the protein secondary structure type. ReLU (Rectified Linear Unit) is used as the active function of the hidden layers in the model.

Experiment Setup
We select the Cull PDB data set used in the literature [20] as the training and validation data, and CB513 data set as the test data. Cull PDB contains 6128 non-homologous protein sequences after filtering. This data set was generated by the PISCES Cull PDB server, which is typically used in protein structural prediction. To filter the raw data, we set the constraint condition as the protein resolution below 2.5 A, the sequence identity less than 30%, and the protein sequence length is between 50 and 700. Moreover, to avoid the bias of the training data, the sequences in Cull PDB with more than 25% identity of those in CB513 were deleted. Finally, 5.534 sequences of Cull PDB remained, 5278 of which are set as training data and the other 256 ones are set as the validation data.
The secondary structure is classified into 8 types, which could be computed from the PDB files by DSSP program [21]. The input length is set as 700, the max sequence length, those proteins with length shorter than 700 will be padded with zeros to fulfil the input length.
In experiment, our produced 1-mer and 3-mer distributed vectors are respectively set as the input of Bidirectional LSTM model. For comparison, the distributed embedding vectors directly derived from protein sequence are also set as the input of the same model. Moreover, the existing popular algorithms are also applied on this data set, including SSpro8, RaptorX-SS8, SC-GSN, and LSTM large. RaptorX-SS8 uses a conditional neural field model [22], the SC-GSN uses a convolutional random network [23], and the LSTM large uses a Bi-directional LSTM structure [24]. All of these methods use many other features, such as position-specific scoring matrix (PSSM).

Comparison Results
Comparison results of protein secondary structure recognition on CB513 dataset are listed in Tab. 2. It can be seen that our proposed methods are superior to the other five algorithms, especially the 1-mer based method. Compared with the result of the distributed embedding vectors directly derived from protein sequence (Amino acid embedded based on sequential sequence in table), 1mer and 3-mer based methods respectively improve by 8.35% and 5.98%, which show that our proposed multialignment results based distributed embedding method is more convincing than the one derived from protein sequence. Compared with other 4 existing methods, 1-mer based method is better than all of them, while 3-mer based method is a little lower than that of LSTM large. Noting that our methods just use only one type of features, while the other existing methods utilize many other types of features (for example, RaptorX-SS8 uses PSSM feature combined with the physicochemical properties of proteins), our produced distributed vectors are pretty good. Compared with our two produced vectors, it could be found that 1-mer based method is superior to 3-mer based. As there are so many "indels" existing in the alignment results, the frequencies of some types of 3-mer amino acids are very low, and some types even did not appear. In our experiment, a total amount of 7.785.773 3-mer amino acids were produced, while there are 985 types that were not covered. It is possibly the reason why 1-mer based method beats 3-mer based one.

Performance on Protein Solubility Prediction 4.3.1 Model Introduction
In this experiment, the model is almost the same as that in the above experiment, the only difference is that the output soft max layer has 2 nodes indicating soluble or insoluble classes, while that in the above experiment has 8 nodes indicating 8 secondary structure types.

Experiment Setup
The data set used in this experiment is derived from the SOLP which is used in [25]. It contains a total of 8704 soluble proteins and 8704 insoluble proteins.
Similar to above, our produced 1-mer and 3-mer distributed vectors are respectively set as the input of Bidirectional LSTM model, and the distributed embedding vectors directly derived from protein sequence are also set as the input of the same model for comparison. Moreover, two existing up-to-date methods, namely PROSO [26] and SOL-Pro [25] are selected for further comparison. PROSO is a two-layered structure of logistic regression classifiers, while SOL-Pro is a two-stage support vector machine (SVM) architecture.

Comparison Results of Protein Solubility Prediction
Comparison results of protein solubility prediction are listed in Tab. 3. Compared with the result of the distributed embedding vectors directly derived from protein sequence, 1-mer and 3-mer based methods respectively improve by 6.08% and 3.76%, showing that our proposed multialignment results based distributed embedding method is also superior to the one derived from protein sequence. Compared with the other 2 existing methods, our methods are far better than that of PROSO, while they are a little less good than SOL-Pro method. However, considering the large number of sequence-based calculation results and prediction results used in the SOL-Pro, our method could obtain similar results on condition that it only uses one kind of features. This implies the effectiveness of our proposed amino acid embedding method. The comparison between 1-mer based method and 3-mer based one is similar to the former experiments.

CONCLUSION
In this paper, we proposed a k-mer amino acid sequence distributed representation model based on multiple sequence alignment results. Constructing the "biology corpus" from the alignment profiles, we trained the distributed represented vectors based on word2vec method, thus integrating the evolutionary information into the embedding method. Experimental results show that our proposed model is much better than that directly derived from protein sequence. Only using this one type of features, we could obtain similar or even better results than those of existing popular methods on both protein secondary structure recognition and protein solubility prediction experiments. It implies our proposed model is convincing and effective.