Izvorni znanstveni članak
https://doi.org/10.24138/jcomss-2025-0131
Can Public Code Smells Datasets Be Trusted?
Ruchin Gupta
; Jaypee Institute of Information Technology, Noida, India
*
Jitendra Kumar Seth
; KIET Group of Institutions, Delhi-NCR, Ghaziabad, India
Anupama Sharma
; Ajay Kumar Garg Engineering College, Ghaziabad, India
Abhishek Goyal
; KIET Group of Institutions, Delhi-NCR, Ghaziabad, India
* Dopisni autor.
Sažetak
Code smells signal potential issues in a codebase andindicate technical debt. Early detection is crucial for maintaining code quality. Researchers often rely on public datasets to automate and enhance smell detection, but their trustworthiness is frequently assumed rather than verified. While these datasets are valuable for developing detection tools, key questions arise: Can they be fully trusted? Are the labels accurate? Do they reflect real-world software development? Recent studies reveal inconsistencies, biases, and misclassifications, raising concerns about their reliability. This paper explores the integrity of widely used 2 sets of public code smells datasets namely Group A dataset and Group B dataset by examining their internal consistency, alignment with established facts. Through this investigation, we aim to determine whether these datasets can be confidently utilized in research and practical applications, or if their inherent issues undermine the validity of the results they produce. Group A datasets are smaller, balanced, and factually aligned but lack industry relevance, while Group B deviates from known facts. The study acknowledges academic–industry differences, viewing divergence as a reflection of real-world variability rather than a flaw, and emphasizes the need for rigorous validation of public datasets to ensure reliable research outcomes.
Ključne riječi
Code Smell; code smell datasets; validation
Hrčak ID:
341492
URI
Datum izdavanja:
31.12.2025.
Posjeta: 470 *