Skip to main content Skip to main navigation

Publication

High Performance Document Layout Analysis

Thomas Breuel
In: Symposium on Document Image Understanding Technology, Greenbelt, Maryland. Symposium on Document Image Understanding Technology, University of Maryland, 2003.

Abstract

In this paper, I summarize research in document layout analysis carried out over the last few years in our laboratory. Correct document layout analy- sis is a key step in document capture conversions into electronic formats, optical character recognition (OCR), information retrieval from scanned docu- ments, appearance-based document retrieval, and re- formatting of documents for on-screen display. We have developed a number of novel geometric algo- rithms and statistical methods. Layout analysis sys- tems built from these algorithms are applicable to a wide variety of languages and layouts, and have proven to be robust to the presence of noise and spu- rious features in a page image. The system itself consists of reusable and independent software mod- ules that can be reconfigured to be adapted to dif- ferent languages and applications. Currently, we are using them for electronic book and document capture applications. If there is commercial or government demand, we are interested in adapting these tools to information retrieval and intelligence applications. (This paper is a compilation of results, figures, and descriptions from a number of previously published pa- pers. Please see the individual sections for attributions and references.)