An Investigative Analysis of Different LSTM Libraries for Supervised and Unsupervised Architectures of OCR Training

Syed Saqib Bukhari, Cannannore Nidhi Narayana Kamath, Sumam Francis, Andreas Dengel

In: ICFHR. International Conference on Frontiers in Handwriting Recognition (ICFHR-2018) The 16th International Conference on Frontiers in Handwriting Recognition August 5-8 Niagara Falls United States IEEE 2018.


Optical Character Recognition (OCR) involves con- version of images of text into machine encoded editable text. Despite the wide research advancements in the field of OCR systems, the recognition capability of OCR systems on unseen or degraded historical documents is still questionable. The degra- dations in the document like torn pages, ink spread and blurred documents are major challenges especially in the old paper documents. Most of such degraded documents lack a generalized and reliable OCR system mainly because of the unavailability of ground-truth data and poor generalization capabilities of the OCR systems. Also manually transcribing the documents is cumbersome task which also require certain language-specific expertise. This paper presents a feasibility study of different OCR architectures together with different preprocessing stages for a reliable OCR on such challenging documents. To this end, we evaluate various OCR settings on a dataset containing highly degraded historical German typewriter documents. This paper investigates various key aspects of OCR training such as the impact of incorporation of different LSTM libraries, grayscale or binarized data for training and training data size used on the subject dataset. In addition, difference in the effect of using completely manually transcribed data as compared to semi-corrected ground-truth data for anyOCR architecture of unsupervised OCR training have been analyzed on a small dataset. The anyOCR framework has shown promising results as an efficient OCR system which was evident with its comparison with other OCR systems. The various factors analyzed provided a feasible strategy for approaching the problem and evaluating highly challenging historical documents.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence