Technical gazette, Vol. 27 No. 4, 2020.
Original scientific paper
https://doi.org/10.17559/TV-20200417091724
Distributed Representation of Protein Sequence Based on Multi-Alignment Results
Siqi Wang
orcid.org/0000-0002-3591-0763
; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012; College of Computer Science and Technology, Jilin University, City Changchun, China 130012
Liu He
orcid.org/0000-0003-3749-3315
; Intelligent Connected Vehicle Development Institute of China FAW Group Corporation, Changchun, China 130000
Shi Cheng
orcid.org/0000-0002-0527-8226
; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012; College of Computer Science and Technology, Jilin University, City Changchun, China 130012
Xiaohu Shi
orcid.org/0000-0002-5115-8137
; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012 College of Computer Science and Technology, Jilin University, City Changchun, China 130012; Zhuhai Laboratory of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, China 519041
Abstract
Protein sequence representation is a key problem for protein studies, especially for those sequence-based models. In this paper, a distributed representation model of protein sequence is proposed, which involves evolutionary information by introducing multi-alignment results. Firstly, we construct a non-redundancy protein dataset and perform multi-alignment for each protein. Then k-mer amino acids "biology corpus" was abstracted from the alignment results which are "evolutionary information" enriched. Using the "biology corpus", k-mer amino acids distributed embedding vectors could be trained according to word2vec method. We compared the amino acid pair distance derived from our produced 1-mer amino acids distributed embedding vectors with that derived from BLOSUM62; it was found that their Pearson coefficient is 0.937, showing they have strong correlation. Then we applied the obtained amino acids distributed embedding representation to protein secondary structure recognition and solubility prediction. For both of the experiments, our proposed alignment results based amino acid distributed representation outperforms that derived directly from protein sequences. Moreover, compared to those existing up-to-date algorithms, our method could get better or comparative results, on condition of only using the feature of our produced amino acid distributed vectors.
Keywords
distributed representation; embedding; protein sequence; word2vec
Hrčak ID:
242327
URI
Publication date:
15.8.2020.
Visits: 1.369 *