Skip to the main content

Original scientific paper

https://doi.org/10.17559/TV-20240111001262

Efficient Decomposition Method for Similar Text Data in Large Corpora

Yun He ; College of Foreign Languages and International Education, Quzhou University, No. 78, Jiuhuabei Road, Kecheng District, Quzhou City, Zhejiang Province, Quzhou, 324000, China *

* Corresponding author.


Full text: english pdf 330 Kb

page 288-295

downloads: 3

cite


Abstract

In order to solve the problems that the decomposition results of current similar data decomposition methods are inconsistent with the actual text quantity, the increase of sensitive data is not significant, and the absolute error mean and normalized root mean square error are high, a large-scale real text corpus similar data decomposition method is proposed. Dividing into a plurality of minority sub-clusters, determining the probability distribution of the minority sub-clusters in the similar data set of the text corpus, and sampling the data in the similar data set of the text corpus. On the basis of data ontology structure mapping model and text big data analysis model, tag semantics are generated to realize similar data decomposition of text corpus. The experimental results show that this method improves the category imbalance of the original data set, can decompose the text accurately, and the decomposition results are basically consistent with the actual text quantity, with lower absolute error mean and normalized root mean square error, and have better similar data decomposition ability.

Keywords

corpus; data decomposition; real text; similar data

Hrčak ID:

325980

URI

https://hrcak.srce.hr/325980

Publication date:

31.12.2024.

Visits: 10 *