Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data

Lee, Woon-Kyo; Kim, Ja-Hee

doi:10.17559/TV-20231214001207

Tehnički vjesnik, Vol. 31 No. 6, 2024.

Izvorni znanstveni članak

https://doi.org/10.17559/TV-20231214001207

Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data

Woon-Kyo Lee ; Seoul National University of Science & Technology Graduate school of Public Policy and Information Technology, 232 Gongneung-ro, Nowon-gu, Seoul, Korea
Ja-Hee Kim ; Seoul National University of Science & Technology Graduate school of Public Policy and Information Technology, 232 Gongneung-ro, Nowon-gu, Seoul, Korea *

* Dopisni autor.

Puni tekst: engleski pdf 2.723 Kb

str. 1845-1858

preuzimanja: 68

citiraj

APA 6th Edition

Lee, W. i Kim, J. (2024). Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data. Tehnički vjesnik, 31 (6), 1845-1858. https://doi.org/10.17559/TV-20231214001207

MLA 8th Edition

Lee, Woon-Kyo i Ja-Hee Kim. "Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data." Tehnički vjesnik, vol. 31, br. 6, 2024, str. 1845-1858. https://doi.org/10.17559/TV-20231214001207. Citirano 13.01.2025.

Chicago 17th Edition

Lee, Woon-Kyo i Ja-Hee Kim. "Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data." Tehnički vjesnik 31, br. 6 (2024): 1845-1858. https://doi.org/10.17559/TV-20231214001207

Harvard

Lee, W., i Kim, J. (2024). 'Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data', Tehnički vjesnik, 31(6), str. 1845-1858. https://doi.org/10.17559/TV-20231214001207

Vancouver

Lee W, Kim J. Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data. Tehnički vjesnik [Internet]. 2024 [pristupljeno 13.01.2025.];31(6):1845-1858. https://doi.org/10.17559/TV-20231214001207

IEEE

W. Lee i J. Kim, "Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data", Tehnički vjesnik, vol.31, br. 6, str. 1845-1858, 2024. [Online]. https://doi.org/10.17559/TV-20231214001207

Sažetak

Abbreviation ambiguity poses significant challenges when searching academic literature. This study evaluated the accuracy of clustering algorithms on imbalanced datasets with varying ratios of target groups. A corpus consisting of 1052 papers focused on the study of abbreviations. The "MSA" dataset was clustered using TF-IDF, cosine similarity, and k-means. Clustering performance declined as the ratios in the target group deviated from balanced thresholds. A re-clustering method was introduced, involving the selective exclusion of non-target clusters. Re-clustering improved accuracy and F1 scores in most scenarios, demonstrating particular stability with higher cluster counts. The re-clustering performance of comparisons was stronger when compared to k-means and self-adaptive methods. The study highlights issues stemming from data imbalance and presents an effective strategy for enhancing abbreviation search efficiency.

Ključne riječi

imbalanced data, K-means algorithm, Re-clustering, word sense disambiguation

Hrčak ID:

321905

URI

https://hrcak.srce.hr/321905

Datum izdavanja:

31.10.2024.

Posjeta: 179 *

Prijava i registracija

Tehnički vjesnik, Vol. 31 No. 6, 2024.

Sažetak

Ključne riječi

Hrčak ID:

URI

Datum izdavanja: