Skip to the main content

Original scientific paper

https://doi.org/10.19279/TVZ.PD.2024-13-1-01

KEY FEATURES OF OPEN DATA SETS IN SENTIMENT ANALYSIS OF TWITTER POSTS

Gaurish Thakkar orcid id orcid.org/0000-0002-8119-5078 ; University of Zagreb, Faculty of Humanities and Social Sciences,Ivana Lučića 3, 10000, Zagreb, Croatia *

* Corresponding author.


Full text: croatian pdf 413 Kb

page 1-14

downloads: 67

cite


Abstract

Open-source datasets are fundamental to the advancement of sentiment analysis models, yet their practical utility is often hampered by a lack of standardisation and comprehensive documentation. This paper provides a critical review of the open dataset landscape for Twitter sentiment analysis, examining 48 papers that introduce datasets in 30 different languages. We analyse key elements, including naming conventions, labelling schemes, data distribution methods, and the inclusion of essential metadata such as tweet IDs. Our findings reveal significant inconsistencies that create challenges for reproducibility and the comparative evaluation of models. We identify a critical need for standard practices in dataset creation and dissemination. Based on this analysis, we offer concrete recommendations to enhance the scientific value, discoverability, and long-term usability of open datasets for the research community.

Keywords

sentiment analysis; natural language processing; twitter; sentiment datasets; multilingual

Hrčak ID:

341614

URI

https://hrcak.srce.hr/341614

Publication date:

30.8.2025.

Article data in other languages: croatian

Visits: 245 *