High Performance Layout Analysis of Medieval European Document Images

Syed Saqib Bukhari, Ashutosh Gupta, Anil Kumar Tiwari, Andreas Dengel

In: The 7th International Conference on Pattern Recognition Applications and Methods. International Conference on Pattern Recognition Applications and Methods (ICPRAM-2018) January 16-18 Funchal Portugal Scitepress 2018.


Layout analysis, mainly including binarization and page segmentation, is one of the most important performance determining steps of an OCR system for complex medieval document images, which contain noise,distortions and irregular layouts. In this paper, we present high performance page segmentation techniques for medieval European document images which include a novel main-body and side-notes segregation and an improved version of OCRopus (OCRopus, ) based text line extraction. In order to complete the high performance layout analysis pipeline, we have also presented the application of the percentile based binarization (Afzal et al., 2014) and the multiresolution morphology based text and non-text segmentation (Bukhari et al., 2011) methods over historical document images. presented layout analysis techniques are applied to a collection of the 15th century Latin document images, which achieved more than 90% accuracy for each of the segmentation techniques.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence