Distributed Representation of Protein Sequence Based on Multi-Alignment Results

Wang, Siqi; He, Liu; Cheng, Shi; Shi, Xiaohu

doi:10.17559/TV-20200417091724

Technical gazette, Vol. 27 No. 4, 2020.

Original scientific paper

https://doi.org/10.17559/TV-20200417091724

Distributed Representation of Protein Sequence Based on Multi-Alignment Results

Siqi Wang orcid.org/0000-0002-3591-0763 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012; College of Computer Science and Technology, Jilin University, City Changchun, China 130012
Liu He orcid.org/0000-0003-3749-3315 ; Intelligent Connected Vehicle Development Institute of China FAW Group Corporation, Changchun, China 130000
Shi Cheng orcid.org/0000-0002-0527-8226 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012; College of Computer Science and Technology, Jilin University, City Changchun, China 130012
Xiaohu Shi orcid.org/0000-0002-5115-8137 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012 College of Computer Science and Technology, Jilin University, City Changchun, China 130012; Zhuhai Laboratory of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, China 519041

Full text: english pdf 1.243 Kb

page 1237-1243

downloads: 662

cite

APA 6th Edition

Wang, S., He, L., Cheng, S. & Shi, X. (2020). Distributed Representation of Protein Sequence Based on Multi-Alignment Results. Tehnički vjesnik, 27 (4), 1237-1243. https://doi.org/10.17559/TV-20200417091724

MLA 8th Edition

Wang, Siqi, et al. "Distributed Representation of Protein Sequence Based on Multi-Alignment Results." Tehnički vjesnik, vol. 27, no. 4, 2020, pp. 1237-1243. https://doi.org/10.17559/TV-20200417091724. Accessed 28 Nov. 2024.

Chicago 17th Edition

Wang, Siqi, Liu He, Shi Cheng and Xiaohu Shi. "Distributed Representation of Protein Sequence Based on Multi-Alignment Results." Tehnički vjesnik 27, no. 4 (2020): 1237-1243. https://doi.org/10.17559/TV-20200417091724

Harvard

Wang, S., et al. (2020). 'Distributed Representation of Protein Sequence Based on Multi-Alignment Results', Tehnički vjesnik, 27(4), pp. 1237-1243. https://doi.org/10.17559/TV-20200417091724

Vancouver

Wang S, He L, Cheng S, Shi X. Distributed Representation of Protein Sequence Based on Multi-Alignment Results. Tehnički vjesnik [Internet]. 2020 [cited 2024 November 28];27(4):1237-1243. https://doi.org/10.17559/TV-20200417091724

IEEE

S. Wang, L. He, S. Cheng and X. Shi, "Distributed Representation of Protein Sequence Based on Multi-Alignment Results", Tehnički vjesnik, vol.27, no. 4, pp. 1237-1243, 2020. [Online]. https://doi.org/10.17559/TV-20200417091724

Abstract

Protein sequence representation is a key problem for protein studies, especially for those sequence-based models. In this paper, a distributed representation model of protein sequence is proposed, which involves evolutionary information by introducing multi-alignment results. Firstly, we construct a non-redundancy protein dataset and perform multi-alignment for each protein. Then k-mer amino acids "biology corpus" was abstracted from the alignment results which are "evolutionary information" enriched. Using the "biology corpus", k-mer amino acids distributed embedding vectors could be trained according to word2vec method. We compared the amino acid pair distance derived from our produced 1-mer amino acids distributed embedding vectors with that derived from BLOSUM62; it was found that their Pearson coefficient is 0.937, showing they have strong correlation. Then we applied the obtained amino acids distributed embedding representation to protein secondary structure recognition and solubility prediction. For both of the experiments, our proposed alignment results based amino acid distributed representation outperforms that derived directly from protein sequences. Moreover, compared to those existing up-to-date algorithms, our method could get better or comparative results, on condition of only using the feature of our produced amino acid distributed vectors.

Keywords

distributed representation; embedding; protein sequence; word2vec

Hrčak ID:

242327

URI

https://hrcak.srce.hr/242327

Publication date:

15.8.2020.

Visits: 1.369 *

Login and registration

Technical gazette, Vol. 27 No. 4, 2020.

Abstract

Keywords

Hrčak ID:

URI

Publication date: