A High-Performance Document Image Layout Analysis for Invoices

Mohammad Mohsin Reza, Md Ajraf Rakib, Syed Saqib Bukhari, Andreas Dengel

In: 13th IAPR International Workshop on Document Analysis Systems. IAPR International Workshop on Document Analysis Systems (DAS-2018) April 24-27 Vienna Austria o. A. 2018.


Layout analysis for document is an important step in OCR pipeline and currently an intensive amount of research is going on to extract searchable full text from scanned images. Invoices are different in nature as compared to pages of books, magazine, loan documents and others, since, there are tables, header, footer, large white spaces, currency, item name, item amount, logo in the invoice. The standard layout analysis proves inefficient on invoices. In this paper we are proposing an advanced layout analysis for invoices that integrate the following steps in the standard layout analysis: (i) removal of table cell lines, (ii) reassigning page frame, (iii) merging blocks, and (iv) crop full text lines from blocks. Additionally, we integrated the proposed layout analysis for invoices into the anyOCR system , which was mainly developed for standard document layouts. In the performance evaluation section, we will compare our advanced layout analysis pipeline with the standard anyOCR~\cite{Kadi} pipeline and with a commercial OCR system like ABBYY. Our advanced layout analysis achieved better OCR accuracy as compared to the other mentioned systems.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence