Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data

Lee, Woon-Kyo; Kim, Ja-Hee

doi:10.17559/TV-20231214001207

Technical gazette, Vol. 31 No. 6, 2024.

Original scientific paper

https://doi.org/10.17559/TV-20231214001207

Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data

Woon-Kyo Lee ; Seoul National University of Science & Technology Graduate school of Public Policy and Information Technology, 232 Gongneung-ro, Nowon-gu, Seoul, Korea
Ja-Hee Kim ; Seoul National University of Science & Technology Graduate school of Public Policy and Information Technology, 232 Gongneung-ro, Nowon-gu, Seoul, Korea *

* Corresponding author.

Full text: english pdf 2.723 Kb

page 1845-1858

downloads: 68

cite

APA 6th Edition

Lee, W. & Kim, J. (2024). Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data. Tehnički vjesnik, 31 (6), 1845-1858. https://doi.org/10.17559/TV-20231214001207

MLA 8th Edition

Lee, Woon-Kyo and Ja-Hee Kim. "Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data." Tehnički vjesnik, vol. 31, no. 6, 2024, pp. 1845-1858. https://doi.org/10.17559/TV-20231214001207. Accessed 8 Jan. 2025.

Chicago 17th Edition

Lee, Woon-Kyo and Ja-Hee Kim. "Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data." Tehnički vjesnik 31, no. 6 (2024): 1845-1858. https://doi.org/10.17559/TV-20231214001207

Harvard

Lee, W., and Kim, J. (2024). 'Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data', Tehnički vjesnik, 31(6), pp. 1845-1858. https://doi.org/10.17559/TV-20231214001207

Vancouver

Lee W, Kim J. Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data. Tehnički vjesnik [Internet]. 2024 [cited 2025 January 08];31(6):1845-1858. https://doi.org/10.17559/TV-20231214001207

IEEE

W. Lee and J. Kim, "Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data", Tehnički vjesnik, vol.31, no. 6, pp. 1845-1858, 2024. [Online]. https://doi.org/10.17559/TV-20231214001207

Abstract

Abbreviation ambiguity poses significant challenges when searching academic literature. This study evaluated the accuracy of clustering algorithms on imbalanced datasets with varying ratios of target groups. A corpus consisting of 1052 papers focused on the study of abbreviations. The "MSA" dataset was clustered using TF-IDF, cosine similarity, and k-means. Clustering performance declined as the ratios in the target group deviated from balanced thresholds. A re-clustering method was introduced, involving the selective exclusion of non-target clusters. Re-clustering improved accuracy and F1 scores in most scenarios, demonstrating particular stability with higher cluster counts. The re-clustering performance of comparisons was stronger when compared to k-means and self-adaptive methods. The study highlights issues stemming from data imbalance and presents an effective strategy for enhancing abbreviation search efficiency.

Keywords

imbalanced data, K-means algorithm, Re-clustering, word sense disambiguation

Hrčak ID:

321905

URI

https://hrcak.srce.hr/321905

Publication date:

31.10.2024.

Visits: 179 *

Login and registration

Technical gazette, Vol. 31 No. 6, 2024.

Abstract

Keywords

Hrčak ID:

URI

Publication date: