Publication

Layout Analysis of Arabic Script Documents

Syed Saqib Bukhari, Faisal Shafait, Thomas Breuel

In: Volker Märgner, Haikal El Abed (editor). Guide to OCR for Arabic Scripts. Pages 35-53 Springer 2012.

Abstract

Layout analysis—extraction of text lines from a document image and identification of their reading order—is an important step in converting the document into a searchable electronic representation. Projection methods are typically employed for extraction of text lines in Arabic script documents. Although projection methods achieve good accuracy on clean, skew-free documents, their performance drops under challenging situations (border noise, skew, complex layouts, etc.). This chapter presents a layout analysis system for extracting text lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian, etc.) and different styles (Naskh, Nastaliq, etc.). The presented system is based on a suitable combination of different well-established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations.

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz