Publikation

Classification and Information Extraction for Complex and Nested Tabular Structures in Images

Amir Moris Monir Riad, Christian Sporer, Syed Saqib Bukhari, Andreas Dengel

In: The 14th IAPR International Conference on Document Analysis and Recognition. International Conference on Document Analysis and Recognition (ICDAR-2017) November 13-15 Kyoto Japan IEEE 2017.

Abstrakt

Understanding of technical documents, like manuals, is one of the most important steps in automatic reporting and/or troubleshooting of defects. The majority of the relevant information exists in tabular structure. There are some solutions for extracting tabular structures from text. However, it is still a big issue to extract tabular information from images and, on top of that, from complex and nested tables. This paper aims to propose classification and information extraction methods for complex tabular structures in document images. These are hybrid approaches using both image layout and OCRed text. The proposed methods outperform on a real-world technical documents dataset from a German railway company (Deutsche Bahn AG) as compared to other state-of-the-art approaches. As a result, the proposed approaches won the competition held by Deutsche Bahn AG in 2016 against other participating research groups and companies.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence