High Performance Lay- out Analysis of Arabic and Urdu Scripts Document Image

Syed Saqib Bukhari; Faisal Shafait; Thomas Breuel

In: 11th International Conference on Document Analysis and Recognition. International Conference on Document Analysis and Recognition (ICDAR-2011), September 18-21, Beijing, China, IEEE, 2011.


Text-lines extraction and their reading order determination is an important step in optical character recognition (OCR) systems. Research in OCR of Arabic script documents has primarily focused on character recognition and therefore most of researchers use primitive methods like projection profile analysis for text-line extraction. Although projection methods achieve good accuracy on clean, skew corrected documents, their performance drops under challenging situations (border noise, skew, complex layouts). This paper presents a robust layout analysis system for extracting text-lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian) and styles (Naskh, Nastaliq). The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations. Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz