Skip to main content Skip to main navigation

Publication

A Comprehensive Evaluation of Chain-of-Thought Faithfulness in Persian Classification Tasks

Shakib Yazdani; Cristina España-Bonet; Eleftherios Avramidis; Yasser Hamidullah; Josef van Genabith
In: Proceedings of The Second Workshop on Language Models for Low-Resource Languages. Workshop on Language Models for Low-Resource Languages (LoResLM-2026), located at The 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2026), March 24-29, Rabat, Morocco, Association for Computational Linguistics, 3/2026.

Abstract

Large language models (LLMs) have shown remarkable performance when prompted to reason step by step, commonly referred to as chain-of-thought (CoT) reasoning. While prior work has proposed mechanism-level approaches to evaluate CoT faithfulness, these studies have primarily focused on English, leaving low-resource languages such as Persian largely underexplored. In this paper, we present the first comprehensive study of CoT faithfulness in Persian. Our analysis spans 15 classification datasets and 6 language models across three classes (small, large, and reasoning models) evaluated under both English and Persian prompting conditions. We first assess model performance on each dataset while collecting the corresponding CoT traces and final predictions. We then evaluate the faithfulness of these CoT traces using an LLM-as-a-judge approach, followed by a human evaluation to measure agreement between the LLM-based judge and human annotator. Our results reveal substantial variation in CoT faithfulness across tasks, datasets, and model classes. In particular, faithfulness is strongly influenced by the dataset and the language model class, while the language used for prompting has a comparatively smaller effect. Notably, small language models exhibit lower or comparable faithfulness scores than large language models and reasoning models.

Projects

More links