Efficient Pre-processing for Text Classification: Unified Framework for Hinglish Short Text

Main Article Content

Rajshree Singh

Abstract

Hindi-English code-mixed (Hinglish) short texts pose unique challenges for automatic text classification, especially sarcasm detection. Such texts exhibit code-mixing, non-standard spelling variations, extreme class imbalance, and data scarcity in labeled corpora. This paper proposes a unified pre-processing framework addressing these issues through three integrated modules: (1) a TF-IDF based feature balancing layer to counter skewed class distributions, (2) a spelling normalization method leveraging character/word n-grams to handle noisy Hinglish orthography, and (3) a hybrid data augmentation approach combining Easy Data Augmentation (EDA), synonym replacement, and back-translation. We evaluate the impact of each module on sarcasm classification performance. Experimental results on a Hinglish sarcasm dataset (5,250 tweets, ~9.6% sarcasm) show that our unified framework significantly improves F1-scores and overall accuracy. The balanced TF-IDF feature layer raises minority-class recall, character–word n-gram normalization reduces spelling-induced errors, and augmented data boosts low-resource generalization. Our best model achieves ~95% F1, outperforming prior benchmarks (78.4% accuracy) by a large margin. This efficient pre-processing pipeline demonstrates that tackling code-mix noise, class imbalance, and data paucity in tandem yields state-of-the-art sarcasm detection in Hinglish short texts.

Article Details

Section
Articles