Skip to main content Skip to main navigation

Publication

The hOCR Microformat for OCR Workflow and Results

Thomas Breuel
In: Proceedings of the Ninth International Conference on Document Analysis and Recognition. International Conference on Document Analysis and Recognition (ICDAR-07), September 23-26, Curitiba, Brazil, Pages 1063-1067, Vol. 2, ISBN 1520-5363 , 0-7695-2822-8, IEEE Computer Society, Washington, DC, 2007.

Abstract

Large scale scanning and document conversion efforts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format is defined as a microformat on top of the HTML and CSS standards and therefore can represent a wide range of linguisitic and typographic phenomena with al- ready well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typeset- ting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.