Clustering and Classification of Document Structure - A Machine Learning Approach

Andreas Dengel, Frank Dubiel

In: Proceedings of the Third International Conference on Document Analysis and Recognition. International Conference on Document Analysis and Recognition (ICDAR-95) August 14-15 Montreal QC Canada Seiten 587-591 2 ISBN 0-8186-7128-9 IEEE Computer Society Washington, DC, USA 1995.


We describe a system which is capable of learning the presentation of document logical structures, exemplarily shown for business letters. Presenting a set of instances to the system, it clusters them into structural concepts and induces a concept hierarchy. This concept hierarchy is taken as a source for classifying future input. The paper introduces the different learning steps, describes how the resulting concept hierarchy is applied for logical labeling and reports on the results.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence