Efficient Pre-processing for Text Classification: Unified Framework for Hinglish Short Text

Rajshree Singh

PDF

Published: Dec 23, 2025

Rajshree Singh

Abstract

Hindi-English code-mixed (Hinglish) short texts pose unique challenges for automatic text classification, especially sarcasm detection. Such texts exhibit code-mixing, non-standard spelling variations, extreme class imbalance, and data scarcity in labeled corpora. This paper proposes a unified pre-processing framework addressing these issues through three integrated modules: (1) a TF-IDF based feature balancing layer to counter skewed class distributions, (2) a spelling normalization method leveraging character/word n-grams to handle noisy Hinglish orthography, and (3) a hybrid data augmentation approach combining Easy Data Augmentation (EDA), synonym replacement, and back-translation. We evaluate the impact of each module on sarcasm classification performance. Experimental results on a Hinglish sarcasm dataset (5,250 tweets, ~9.6% sarcasm) show that our unified framework significantly improves F1-scores and overall accuracy. The balanced TF-IDF feature layer raises minority-class recall, character–word n-gram normalization reduces spelling-induced errors, and augmented data boosts low-resource generalization. Our best model achieves ~95% F1, outperforming prior benchmarks (78.4% accuracy) by a large margin. This efficient pre-processing pipeline demonstrates that tackling code-mix noise, class imbalance, and data paucity in tandem yields state-of-the-art sarcasm detection in Hinglish short texts.

How to Cite

Rajshree Singh. (2025). Efficient Pre-processing for Text Classification: Unified Framework for Hinglish Short Text. Journal of Informatics Education and Research, 5(4). Retrieved from http://jier.org/index.php/journal/article/view/4067

Issue

Vol. 5 No. 4 (2025)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details