Text Classification Performance Analysis through Machine Learning and Deep Learning on the Low-Resourced Balochi Language
DOI:
https://doi.org/10.63163/jpehss.v4i1.1260Keywords:
Text Classification, Machine learning, Deep Learning, XLM-RoBERTa, Balochi Language, NLP, Low-Resource LanguagesAbstract
Text classification is a crucial task in Natural Language Processing (NLP). The purpose of text classification research is to classify the text into pre-defined classes automatically. Low-resource languages still receive less attention in NLP tasks due to the scarcity of publicly annotated datasets and computational resources. Similarly, Balochi, a low-resource language with a 2500-year history and cultural significance, has not been considered much for the development of NLP applications. This research study implements a text classification task in Balochi and compares machine learning, Deep Learning, and Transformer-based models. Balochi-language’s unlabelled dataset of approximately 5.5k sentences was collected, and various pre-processing techniques, including tokenization, stop words removal, and text normalization, were applied. The experimental results of this research conclude that, among machine learning models, the SGD classifier achieved the highest accuracy of 98.83%. Among Deep Learning models, the BiLSTM achieved the highest accuracy of 98%. However, the Transformer-based model, the pre-trained XLM-RoBERTa, performed exceptionally well, achieving 99% accuracy on the Balochi classification task. These research findings provide a foundation for future multilingual pre-trained models for low-resource languages and aim to develop consistent Balochi language models for NLP applications.