OCR-Free Table of Contents Detection in Urdu Books
Adnan Ul Hasan; Syed Saqib Bukhari; Faisal Shafait; Thomas Breuel
In: IAPR International Workshop on Document Analysis Systems. IAPR International Workshop on Document Analysis Systems (DAS-12), 10th, March 27-29, Gold Coast, Queensland, Australia, IEEE, 3/2012.
Table of Contents (ToC) is an integral part of multiple-page documents like books, magazines, etc. Most of the existing techniques use textual similarity for automatically detecting ToC pages. However, such techniques may not be applied for detection of ToC pages in situations where OCR technology is not available, which is indeed true for historical documents and many modern Nabataean (Arabic) and Indic scripts. It is, therefore, necessary to develop tools to navigate through such documents without the use of OCR. This paper reports a preliminary effort to address this challenge. The proposed algorithm has been applied to ﬁnd Table of Contents (ToC) pages in Urdu books and an overall initial accuracy of 88% has been achieved