Skip to main content Skip to main navigation

Project | OCRopus

Duration:
OCRopus document analysis and OCR system

OCRopus document analysis and OCR system

The Image Understanding and Pattern Recognition (IUPR) research group at the DFKI is developing the next generation system for Google's massive book scanning project. The system will incorporate state-of-the-art pattern recognition, statistical natural language processing, and image processing methods. The resulting system will be deployed by Google in the field, and it will also be released in open source form as a system for both desktop OCR and high volume conversion efforts.

The Google Library Project consists of the two separate sub-projects:

Optical Character Recognition for Books

Several efforts to digitize large numbers of books and other long documents are underway at Google and other companies and institutions.

In order to make the content of those books searchable and accessible to other kinds of analysis and use, it needs to be transformed into machine readable text (character segmentation and recognition), and the layout and markup of the pages needs to be analyzed from the geometric arrangement of text, images, and whitespace on each page; collectively, these two operations are carried out by Optical Character Recognition (OCR) systems. OCR appears to be a mature field, with many decades of research by numerous research groups invested in it. However, current commercial OCR systems still have a number of limitationsfor practical applications. These limitations include:

  • Even though their error rates are low compared to other kinds of pattern recognition problems, current commercial OCR system still have error rates that are high enough that ensuring correct recognition of even clean scans of books requires proof reading.
  • While OCR error rates are low on average, across all books and all fonts, specific scans, books, and fonts may have high error rates.
  • Usually, these systems are targeted to specific scripts or languages.

The goal of this project is to address these shortcomings of current OCR systems with an OCR system specifically targeted at books. Our approach will take advantage of special properties of the book OCR problem with recent advances in pattern recognition methods in order to achieve substantial improvements in OCR performance.

Trainable and Adaptive Layout Analysis

Document layout---the location and relative arrangement of text and images on a page---contains crucial semantic information about a document. The same string in the document might represent a page number or a section number, a section heading or a footnote, an original text or a quotation. Layout can also indicate that items are arranged in a list or a table. Document layout analysis attempts to recover layout information from representations of a document that do not contain such information explicitly and/or completely.

Currently available layout analysis systems (either research systems or components of commercial OCR systems) handle a variety of simple formats fairly reliably: single and double column memoranda, simple business letters, technical reports, and simple book layouts (e.g., fiction). Many other layouts cannot be handled reliably yet, however; examples are many newspaper and magazine layouts, many textbooks and scientific publications, and documents like catalogs.

Difficulties in analyzing those layouts arise from a variety of sources: non-Manhattan layouts, non-rectangular outlines, complex relationships between spacing and text style for indicating function, complex backgrounds, deliberate overlaps of elements, the presence of formulas and graphics, the use of artistic fonts and font effects, and many other visual effects.

This project will develop document layout analysis techniques that can analyze the layouts found in newspapers and magazines in order to enhance and improve summarization and indexing of both scanned and electronic documents and improve the user experience for on-line viewing and browsing.The system is intended to be trainable, adaptable, and retargetable by non-programmers and non-expert users.

Sponsors

Google Inc.