Original scientific paper
https://doi.org/10.24138/jcomss-2025-0133
Leveraging XLM-RoBERTa with CNN and BiLSTM for Hinglish Toxicity Detection
Nikita Singhal
orcid.org/0000-0001-7457-9041
; Army Institute of Technology, Pune, Maharashtra, 411015, India
*
Avadhesh Yadav
; Army Institute of Technology, Pune, Maharashtra, 411015, India
Ankush Ankush
; Army Institute of Technology, Pune, Maharashtra, 411015, India
Giriraj Singh
; Army Institute of Technology, Pune, Maharashtra, 411015, India
Ronak Kumar
; Army Institute of Technology, Pune, Maharashtra, 411015, India
* Corresponding author.
Abstract
Toxicity in online communication, particularly in code-mixed languages like Hinglish, is a growing concern across social media platforms. Hinglish, a blend of Hindi and English, is widely used in informal online conversations, making it challenging for traditional toxicity detection models to accurately identify harmful content. This issue is compounded by the limited availability of resources and models specifically trained to handle Hinglish. This work presents the XLM-RoBERTa- CNN-BiLSTM (XCB) model, a novel architecture for toxicity detection in Hinglish on various social media platforms. This work compares the XCB model with the SOTA models mBERT, XLM-RoBERTa (XLM-R), and Indic-BERT. It was made on three publicly available datasets: Constraint, Facebook, and HASOC. The XCB model achieved macro F1 scores of 0.81, 0.73, and 0.82 and inference times of 0.24 s, 0.48 s, and 0.22 s on the Constraint, Facebook, and HASOC datasets, respectively. XCB not only outperforms existing romanized Hinglish models but also matches the macro F1 scores of existing SOTA multilingual models, requiring only half the training time—with extremely low inference times unlike the existing state-of-the-art models, thus making it a much more efficient candidate for large-scale real-time toxicity detection in Hinglish.
Keywords
Toxicity Detection; Hinglish; Code-Mixed Language; XCB Model; Real-Time Moderation; Multi-Lingual Models; Efficiency
Hrčak ID:
336751
URI
Publication date:
20.10.2025.
Visits: 349 *