Text Classification Performance Analysis through Machine Learning and Deep Learning on the Low-Resourced Balochi Language

Muhammad Ameen Chhajro; Sohaib Raheem; Farhan Bashir Shaikh; Muhammad Hibatullah Channa; Fakhira Tabassum; Adnan Jahangir Panhwar

doi:10.63163/jpehss.v4i1.1260

Authors

Muhammad Ameen Chhajro Department of Software Engineering, Sindh Maressatul University, Karachi. Email: ameen.chhajro@smiu.edu.pk
Sohaib Raheem Sindh Maressatul University,Karachi. Email: sohaibraheem27@gmail.com
Farhan Bashir Shaikh Department of Computer Science, The university of Larkano. Email: Farhan@uolrk.edu.pk
Muhammad Hibatullah Channa Department of Computer Sciences & Related Studies Hyderabad Institute for Technology & Management Sciences, Hyderabad. Email: hibatullah@hitms.edu.pk
Fakhira Tabassum Department of Computer Sciences NUML University Hyderabad Campus. Email: fakhira.tabassum@numl.edu.pk
Adnan Jahangir Panhwar Department of Computer Science the University of Sindh, Jamshoro. Email: adnanpanhwar1@gmail.com

DOI:

https://doi.org/10.63163/jpehss.v4i1.1260

Keywords:

Text Classification, Machine learning, Deep Learning, XLM-RoBERTa, Balochi Language, NLP, Low-Resource Languages

Abstract

Text classification is a crucial task in Natural Language Processing (NLP). The purpose of text classification research is to classify the text into pre-defined classes automatically. Low-resource languages still receive less attention in NLP tasks due to the scarcity of publicly annotated datasets and computational resources. Similarly, Balochi, a low-resource language with a 2500-year history and cultural significance, has not been considered much for the development of NLP applications. This research study implements a text classification task in Balochi and compares machine learning, Deep Learning, and Transformer-based models. Balochi-language’s unlabelled dataset of approximately 5.5k sentences was collected, and various pre-processing techniques, including tokenization, stop words removal, and text normalization, were applied. The experimental results of this research conclude that, among machine learning models, the SGD classifier achieved the highest accuracy of 98.83%. Among Deep Learning models, the BiLSTM achieved the highest accuracy of 98%. However, the Transformer-based model, the pre-trained XLM-RoBERTa, performed exceptionally well, achieving 99% accuracy on the Balochi classification task. These research findings provide a foundation for future multilingual pre-trained models for low-resource languages and aim to develop consistent Balochi language models for NLP applications.

Text Classification Performance Analysis through Machine Learning and Deep Learning on the Low-Resourced Balochi Language

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

How to Cite

Similar Articles

info

Latest publications

Language

Information