Text Classification Performance Analysis through Machine Learning and Deep Learning on the Low-Resourced Balochi Language

Authors

  • Muhammad Ameen Chhajro Department of Software Engineering, Sindh Maressatul University, Karachi. Email: ameen.chhajro@smiu.edu.pk
  • Sohaib Raheem Sindh Maressatul University,Karachi. Email: sohaibraheem27@gmail.com
  • Farhan Bashir Shaikh Department of Computer Science, The university of Larkano. Email: Farhan@uolrk.edu.pk
  • Muhammad Hibatullah Channa Department of Computer Sciences & Related Studies Hyderabad Institute for Technology & Management Sciences, Hyderabad. Email: hibatullah@hitms.edu.pk
  • Fakhira Tabassum Department of Computer Sciences NUML University Hyderabad Campus. Email: fakhira.tabassum@numl.edu.pk
  • Adnan Jahangir Panhwar Department of Computer Science the University of Sindh, Jamshoro. Email: adnanpanhwar1@gmail.com

DOI:

https://doi.org/10.63163/jpehss.v4i1.1260

Keywords:

Text Classification, Machine learning, Deep Learning, XLM-RoBERTa, Balochi Language, NLP, Low-Resource Languages

Abstract

Text classification is a crucial task in Natural Language Processing (NLP). The purpose of text classification research is to classify the text into pre-defined classes automatically. Low-resource languages still receive less attention in NLP tasks due to the scarcity of publicly annotated datasets and computational resources. Similarly, Balochi, a low-resource language with a 2500-year history and cultural significance, has not been considered much for the development of NLP applications. This research study implements a text classification task in Balochi and compares machine learning, Deep Learning, and Transformer-based models. Balochi-language’s unlabelled dataset of approximately 5.5k sentences was collected, and various pre-processing techniques, including tokenization, stop words removal, and text normalization, were applied. The experimental results of this research conclude that, among machine learning models, the SGD classifier achieved the highest accuracy of 98.83%. Among Deep Learning models, the BiLSTM achieved the highest accuracy of 98%. However, the Transformer-based model, the pre-trained XLM-RoBERTa, performed exceptionally well, achieving 99% accuracy on the Balochi classification task. These research findings provide a foundation for future multilingual pre-trained models for low-resource languages and aim to develop consistent Balochi language models for NLP applications.

Downloads

Published

2026-03-31

How to Cite

Text Classification Performance Analysis through Machine Learning and Deep Learning on the Low-Resourced Balochi Language. (2026). Physical Education, Health and Social Sciences, 4(1), 61-77. https://doi.org/10.63163/jpehss.v4i1.1260

Similar Articles

121-130 of 384

You may also start an advanced similarity search for this article.