Skip to the main content

Original scientific paper

https://doi.org/10.17559/TV-20200417091724

Distributed Representation of Protein Sequence Based on Multi-Alignment Results

Siqi Wang orcid id orcid.org/0000-0002-3591-0763 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012; College of Computer Science and Technology, Jilin University, City Changchun, China 130012
Liu He orcid id orcid.org/0000-0003-3749-3315 ; Intelligent Connected Vehicle Development Institute of China FAW Group Corporation, Changchun, China 130000
Shi Cheng orcid id orcid.org/0000-0002-0527-8226 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012; College of Computer Science and Technology, Jilin University, City Changchun, China 130012
Xiaohu Shi orcid id orcid.org/0000-0002-5115-8137 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012 College of Computer Science and Technology, Jilin University, City Changchun, China 130012; Zhuhai Laboratory of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, China 519041


Full text: english pdf 1.243 Kb

page 1237-1243

downloads: 662

cite


Abstract

Protein sequence representation is a key problem for protein studies, especially for those sequence-based models. In this paper, a distributed representation model of protein sequence is proposed, which involves evolutionary information by introducing multi-alignment results. Firstly, we construct a non-redundancy protein dataset and perform multi-alignment for each protein. Then k-mer amino acids "biology corpus" was abstracted from the alignment results which are "evolutionary information" enriched. Using the "biology corpus", k-mer amino acids distributed embedding vectors could be trained according to word2vec method. We compared the amino acid pair distance derived from our produced 1-mer amino acids distributed embedding vectors with that derived from BLOSUM62; it was found that their Pearson coefficient is 0.937, showing they have strong correlation. Then we applied the obtained amino acids distributed embedding representation to protein secondary structure recognition and solubility prediction. For both of the experiments, our proposed alignment results based amino acid distributed representation outperforms that derived directly from protein sequences. Moreover, compared to those existing up-to-date algorithms, our method could get better or comparative results, on condition of only using the feature of our produced amino acid distributed vectors.

Keywords

distributed representation; embedding; protein sequence; word2vec

Hrčak ID:

242327

URI

https://hrcak.srce.hr/242327

Publication date:

15.8.2020.

Visits: 1.369 *