The hOCR Microformat for OCR Workflow and Results

Thomas Breuel

In: Proceedings of the Ninth International Conference on Document Analysis and Recognition. International Conference on Document Analysis and Recognition (ICDAR-07) September 23-26 Curitiba Brazil Seiten 1063-1067 2 ISBN 1520-5363 , 0-7695-2822-8 IEEE Computer Society Washington, DC 2007.


Large scale scanning and document conversion efforts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format is defined as a microformat on top of the HTML and CSS standards and therefore can represent a wide range of linguisitic and typographic phenomena with al- ready well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typeset- ting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.

The_hOCR_Microformat.pdf (pdf, 188 KB )

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence