Skoči na glavni sadržaj

Izvorni znanstveni članak

https://doi.org/10.19279/TVZ.PD.2024-13-1-01

KEY FEATURES OF OPEN DATA SETS IN SENTIMENT ANALYSIS OF TWITTER POSTS

Gaurish Thakkar orcid id orcid.org/0000-0002-8119-5078 ; Filozofski fakultet Sveučilišta u Zagrebu, Ivana Lučića 3, 10000, Zagreb, Hrvatska *

* Dopisni autor.


Puni tekst: hrvatski pdf 413 Kb

str. 1-14

preuzimanja: 0

citiraj


Sažetak

Open-source datasets are fundamental to the advancement of sentiment analysis models, yet their practical utility is often hampered by a lack of standardisation and comprehensive documentation. This paper provides a critical review of the open dataset landscape for Twitter sentiment analysis, examining 48 papers that introduce datasets in 30 different languages. We analyse key elements, including naming conventions, labelling schemes, data distribution methods, and the inclusion of essential metadata such as tweet IDs. Our findings reveal significant inconsistencies that create challenges for reproducibility and the comparative evaluation of models. We identify a critical need for standard practices in dataset creation and dissemination. Based on this analysis, we offer concrete recommendations to enhance the scientific value, discoverability, and long-term usability of open datasets for the research community.

Ključne riječi

sentiment analysis; natural language processing; twitter; sentiment datasets; multilingual

Hrčak ID:

341614

URI

https://hrcak.srce.hr/341614

Datum izdavanja:

30.8.2025.

Podaci na drugim jezicima: hrvatski

Posjeta: 0 *