Advancing Arabic Scientific Text Analysis: Evaluating Machine Learning Models for Named Entity Recognition

Document Type : Original Research Papers

Authors

1 Department of Computer Science, faculty of Computers and Artificial Intelligence, Benha University, Benha, Egypt

2 Computer Science Department Faculty of Computers and Artificial Intelligence Benha University Benha, Egypt

Abstract

The task of named entity recognition in Arabic text, particularly within the scientific and medical domains, presents unique challenges due to the language's rich morphology, the scarcity of resources, and dialectical diversity. This study evaluates the efficacy of Conditional Random Fields (CRF), Support Vector Machines (SVM), and Stochastic Gradient Descent (SGD) models for named entity recognition in Arabic scientific texts. These models have been implemented on a self-collected dataset consisting of Arabic abstracts of theses. The named entities identified in the dataset include proteins, DNA, RNA, cell types, and cell lines. Focusing on the scientific domain, our comparative analysis reveals significant performance differences among the models, with hybrid approaches showing promising results. SGD, SVM, and CRF achieved F1-scores of 0.96, 0.91, and 0.80, respectively. The results demonstrate the effectiveness of the proposed models. The research contributes to Arabic natural language processing by highlighting model strengths and guiding future selections and development of named entity recognition models.

Keywords

Main Subjects