Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

Jing-Ming, Li; Jing-Tao, Sun; Wen-Han, Huang; Qiu-Yu, Zhang; Zhen-Zhou, Tian; Ning, Lu

doi:10.17559/TV-20191111095139

Tehnički vjesnik, Vol. 27 No. 3, 2020.

Izvorni znanstveni članak

https://doi.org/10.17559/TV-20191111095139

Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets

Li Jing-Ming ; School of Management Science and Engineering, Anhui University of Finance and Economics, 962 Caoshan Road, 233030, Bengbu, Anhui Province, China
Sun Jing-Tao ; School of Computer Science and Technology, Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, Xi'an University of Posts and Telecommunications, 710021, Xi’an, Shaanxi Province, China
Huang Wen-Han ; School of Mathematics and Computer Science, Shaanxi University of Technology, 723001, Hanzhong Shaanxi, China
Zhang Qiu-Yu ; School of Computer and Communication, Lanzhou University of Technology, 36 Pengjiaping Road, Qilihe District, Lanzhou, Gansu, China
Tian Zhen-Zhou ; School of Computer Science and Technology, Xi'an University of Posts and Telecommunications, 710021, Xi’an, Shaanxi Province, China
Lu Ning ; School of Computer Science and Technology, Xi'an University of Posts and Telecommunications, 710021, Xi’an, Shaanxi Province, China

Puni tekst: engleski pdf 769 Kb

str. 842-852

preuzimanja: 672

citiraj

APA 6th Edition

Jing-Ming, L., Jing-Tao, S., Wen-Han, H., Qiu-Yu, Z., Zhen-Zhou, T. i Ning, L. (2020). Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets. Tehnički vjesnik, 27 (3), 842-852. https://doi.org/10.17559/TV-20191111095139

MLA 8th Edition

Jing-Ming, Li, et al. "Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets." Tehnički vjesnik, vol. 27, br. 3, 2020, str. 842-852. https://doi.org/10.17559/TV-20191111095139. Citirano 12.12.2024.

Chicago 17th Edition

Jing-Ming, Li, Sun Jing-Tao, Huang Wen-Han, Zhang Qiu-Yu, Tian Zhen-Zhou i Lu Ning. "Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets." Tehnički vjesnik 27, br. 3 (2020): 842-852. https://doi.org/10.17559/TV-20191111095139

Harvard

Jing-Ming, L., et al. (2020). 'Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets', Tehnički vjesnik, 27(3), str. 842-852. https://doi.org/10.17559/TV-20191111095139

Vancouver

Jing-Ming L, Jing-Tao S, Wen-Han H, Qiu-Yu Z, Zhen-Zhou T, Ning L. Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets. Tehnički vjesnik [Internet]. 2020 [pristupljeno 12.12.2024.];27(3):842-852. https://doi.org/10.17559/TV-20191111095139

IEEE

L. Jing-Ming, S. Jing-Tao, H. Wen-Han, Z. Qiu-Yu, T. Zhen-Zhou i L. Ning, "Unsupervised Text Topic-Related Gene Extraction for Large Unbalanced Datasets", Tehnički vjesnik, vol.27, br. 3, str. 842-852, 2020. [Online]. https://doi.org/10.17559/TV-20191111095139

Sažetak

There is a common notion that traditional unsupervised feature extraction algorithms follow the assumption that the distribution of the different clusters in a dataset is balanced. However, feature selection is guided by the calculation of similarities among features when topic keywords are extracted from a large number of unmarked, unbalanced text datasets. As a result, the selected features cannot truly reflect the information of the original data set, which thus affects the subsequent performance of classifiers. To solve this problem, a new method of extracting unsupervised text topic-related genes is proposed in this paper. Firstly, a sample cluster group is obtained by factor analysis and a density peak algorithm, based on which the dataset is marked. Then, considering the influence of the unbalanced distribution of sample clusters on feature selection, the CHI statistical matrix feature selection method, which combines average local density and information entropy together, is used to strengthen the features of low-density small-sample clusters. Finally, a related gene extraction method based on the exploration of high-order relevance in multidimensional statistical data is described, which uses independent component analysis to enhance the generalisability of the selected features. In this way, unsupervised text topic-related genes can be extracted from large unbalanced datasets. The results of experiments suggest that the proposed method of extracting unsupervised text topic-related genes is better than existing methods in extracting text subject terms from low-density small-sample clusters, and has higher prematurity and feature dimension-reduction ability.

Ključne riječi

CHI statistical selection method; density peaks; factor analysis; information entropy; independent component analysis; text feature gene extraction

Hrčak ID:

239093

URI

https://hrcak.srce.hr/239093

Datum izdavanja:

14.6.2020.

Posjeta: 1.565 *

Prijava i registracija

Tehnički vjesnik, Vol. 27 No. 3, 2020.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: