Distributed Representation of Protein Sequence Based on Multi-Alignment Results

Wang, Siqi; He, Liu; Cheng, Shi; Shi, Xiaohu

doi:10.17559/TV-20200417091724

Tehnički vjesnik, Vol. 27 No. 4, 2020.

Izvorni znanstveni članak

https://doi.org/10.17559/TV-20200417091724

Distributed Representation of Protein Sequence Based on Multi-Alignment Results

Siqi Wang orcid.org/0000-0002-3591-0763 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012; College of Computer Science and Technology, Jilin University, City Changchun, China 130012
Liu He orcid.org/0000-0003-3749-3315 ; Intelligent Connected Vehicle Development Institute of China FAW Group Corporation, Changchun, China 130000
Shi Cheng orcid.org/0000-0002-0527-8226 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012; College of Computer Science and Technology, Jilin University, City Changchun, China 130012
Xiaohu Shi orcid.org/0000-0002-5115-8137 ; The key laboratory of symbolic computation and knowledge engineering of ministry of education, Jilin University, Changchun, China 130012 College of Computer Science and Technology, Jilin University, City Changchun, China 130012; Zhuhai Laboratory of Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, China 519041

Puni tekst: engleski pdf 1.243 Kb

str. 1237-1243

preuzimanja: 1.019

citiraj

APA 6th Edition

Wang, S., He, L., Cheng, S. i Shi, X. (2020). Distributed Representation of Protein Sequence Based on Multi-Alignment Results. Tehnički vjesnik, 27 (4), 1237-1243. https://doi.org/10.17559/TV-20200417091724

MLA 8th Edition

Wang, Siqi, et al. "Distributed Representation of Protein Sequence Based on Multi-Alignment Results." Tehnički vjesnik, vol. 27, br. 4, 2020, str. 1237-1243. https://doi.org/10.17559/TV-20200417091724. Citirano 24.07.2026.

Chicago 17th Edition

Wang, Siqi, Liu He, Shi Cheng i Xiaohu Shi. "Distributed Representation of Protein Sequence Based on Multi-Alignment Results." Tehnički vjesnik 27, br. 4 (2020): 1237-1243. https://doi.org/10.17559/TV-20200417091724

Harvard

Wang, S., et al. (2020). 'Distributed Representation of Protein Sequence Based on Multi-Alignment Results', Tehnički vjesnik, 27(4), str. 1237-1243. https://doi.org/10.17559/TV-20200417091724

Vancouver

Wang S, He L, Cheng S, Shi X. Distributed Representation of Protein Sequence Based on Multi-Alignment Results. Tehnički vjesnik [Internet]. 2020 [pristupljeno 24.07.2026.];27(4):1237-1243. https://doi.org/10.17559/TV-20200417091724

IEEE

S. Wang, L. He, S. Cheng i X. Shi, "Distributed Representation of Protein Sequence Based on Multi-Alignment Results", Tehnički vjesnik, vol.27, br. 4, str. 1237-1243, 2020. [Online]. https://doi.org/10.17559/TV-20200417091724

Sažetak

Protein sequence representation is a key problem for protein studies, especially for those sequence-based models. In this paper, a distributed representation model of protein sequence is proposed, which involves evolutionary information by introducing multi-alignment results. Firstly, we construct a non-redundancy protein dataset and perform multi-alignment for each protein. Then k-mer amino acids "biology corpus" was abstracted from the alignment results which are "evolutionary information" enriched. Using the "biology corpus", k-mer amino acids distributed embedding vectors could be trained according to word2vec method. We compared the amino acid pair distance derived from our produced 1-mer amino acids distributed embedding vectors with that derived from BLOSUM62; it was found that their Pearson coefficient is 0.937, showing they have strong correlation. Then we applied the obtained amino acids distributed embedding representation to protein secondary structure recognition and solubility prediction. For both of the experiments, our proposed alignment results based amino acid distributed representation outperforms that derived directly from protein sequences. Moreover, compared to those existing up-to-date algorithms, our method could get better or comparative results, on condition of only using the feature of our produced amino acid distributed vectors.

Ključne riječi

distributed representation; embedding; protein sequence; word2vec

Hrčak ID:

242327

URI

https://hrcak.srce.hr/242327

Datum izdavanja:

15.8.2020.

Posjeta: 2.144 *

Prijava i registracija

Tehnički vjesnik, Vol. 27 No. 4, 2020.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: