Language independent thresholding optimization using a Gaussian mixture modelling of the character shapes

Yves Rangoni; Joost van Beusekom; Thomas Breuel
In: MOCR '09: Proceedings of the International Workshop on Multilingual OCR. International Workshop on Performance Evaluation Issues in MultiLingual OCR (MOCR-09), July 25, Barcelona, Spain, Pages 1-9, ISBN 978-1-60558-698-4, ACM, 2009.


One of the first steps in a digitization process is the binarization of the document image. The major further steps like layout analysis, line extraction, and text recognition assume a black and white image as input. Several thresholding methods have been proposed to handle this problem for document images, but few of them take into account the behaviour of the text recognizer. They often rely on parameters that depend on the class of documents. In a large-scale process, neither relying on empirical assumptions nor using a manual tuning is conceivable. In this paper, we introduce statistical modelling of a suitable binarization for a character recognizer. The model is a mixture of Gaussians that gives the prior of a binarization for having the best suitable transcription afterwards. The training is done on the character level, and tuned specifically for the recognizer. The optimization consists in finding the binarization that produces the best character shapes according to the model. As opposed to existing methods, the optimization is goal-directed, and is not linked to subjective visual criterions. We demonstrate the effctiveness of this approach, called Gaussian Mixture Token Thresholding on a subset of the Google 1000 Books dataset containing old documents where we achieve an improvement of more than 10 points compared to a regular binarization.

Weitere Links