Tehnički vjesnik, Vol. 32 No. 1, 2025.
Izvorni znanstveni članak
https://doi.org/10.17559/TV-20240111001262
Efficient Decomposition Method for Similar Text Data in Large Corpora
Yun He
; College of Foreign Languages and International Education, Quzhou University, No. 78, Jiuhuabei Road, Kecheng District, Quzhou City, Zhejiang Province, Quzhou, 324000, China
*
* Dopisni autor.
Sažetak
In order to solve the problems that the decomposition results of current similar data decomposition methods are inconsistent with the actual text quantity, the increase of sensitive data is not significant, and the absolute error mean and normalized root mean square error are high, a large-scale real text corpus similar data decomposition method is proposed. Dividing into a plurality of minority sub-clusters, determining the probability distribution of the minority sub-clusters in the similar data set of the text corpus, and sampling the data in the similar data set of the text corpus. On the basis of data ontology structure mapping model and text big data analysis model, tag semantics are generated to realize similar data decomposition of text corpus. The experimental results show that this method improves the category imbalance of the original data set, can decompose the text accurately, and the decomposition results are basically consistent with the actual text quantity, with lower absolute error mean and normalized root mean square error, and have better similar data decomposition ability.
Ključne riječi
corpus; data decomposition; real text; similar data
Hrčak ID:
325980
URI
Datum izdavanja:
31.12.2024.
Posjeta: 10 *