A Tesseract-Based OCR Framework For Historical Documents Lacking Ground- Truth Text

Brennan Nunamaker, Syed Saqib Bukhari, Damian Borth, Andreas Dengel

In: International Conference on Image Processing, ICIP’16, USA, 2016. IEEE International Conference on Image Processing (ICIP-2016) September 25-28 Phoenix Arizona United States IEEE 2016.


Computationally transcribing historical document images to digital text often requires an initial, labor intensive recording of ground-truths by language experts to provide the OCR system with training text. This paper presents a framework for the automatic generation of training data, provided only with labeled character images and a digital font, thus removing the need for manually generated text. In contrast to sample images and their real text as ground-truth, our approach is based on the random, rule-based generation of “meaningless” text in an image file and a ground-truth text file. Furthermore, we experimentally demonstrate that there is a correlation between the similarity of the character sample images of a subset to each other and the resulting model's classification performance. This allows us to calculate upper- and lower-bound performance subsets for model generation using only the sample images themselves. We show that using more training samples does not unequivocally improve model performance, allowing us to focus on the case of one sample per character during training. Training a Tesseract model only with samples that maximize a dissimilarity metric for each character in regards to all other mean character sample images yields a character recognition error of ca. 15% on our custom benchmark of 15 th century Latin documents, as compared to ca. 27% error rate for training a model in a traditional Tesseract style using a synthetically generated training image from (manually marked) real text.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence