Politehnika i dizajn, Vol. 13 No. 1, 2025.
Izvorni znanstveni članak
https://doi.org/10.19279/TVZ.PD.2024-13-1-01
KEY FEATURES OF OPEN DATA SETS IN SENTIMENT ANALYSIS OF TWITTER POSTS
Gaurish Thakkar
orcid.org/0000-0002-8119-5078
; Filozofski fakultet Sveučilišta u Zagrebu, Ivana Lučića 3, 10000, Zagreb, Hrvatska
*
* Dopisni autor.
Sažetak
Open-source datasets are fundamental to the advancement of sentiment analysis models, yet their practical utility is often hampered by a lack of standardisation and comprehensive documentation. This paper provides a critical review of the open dataset landscape for Twitter sentiment analysis, examining 48 papers that introduce datasets in 30 different languages. We analyse key elements, including naming conventions, labelling schemes, data distribution methods, and the inclusion of essential metadata such as tweet IDs. Our findings reveal significant inconsistencies that create challenges for reproducibility and the comparative evaluation of models. We identify a critical need for standard practices in dataset creation and dissemination. Based on this analysis, we offer concrete recommendations to enhance the scientific value, discoverability, and long-term usability of open datasets for the research community.
Ključne riječi
sentiment analysis; natural language processing; twitter; sentiment datasets; multilingual
Hrčak ID:
341614
URI
Datum izdavanja:
30.8.2025.
Posjeta: 0 *