Evaluation OCR and Non-OCR Text Representations for Learning Document Classifiers

Markus Junker, Rainer Hoch

In: Proceedings of the 4th International Conference on Document Analysis and Recognition. International Conference on Document Analysis and Recognition (ICDAR-97) August 18-20 Ulm Germany Seiten 1060-1066 ISBN 0-8186-7898-4 IEEE Computer Society Washington, DC, USA 1997.


In literature, many feature types and learning algorithms are proposed for document classification. However, an extensive and systematic evaluation of the various approaches has not been done yet. In order to investigate different text representations for document classification, we have developed a tool which transforms documents into feature-value representations suitable for standard learning algorithms. In this paper we investigate seven document representations for German texts based on n-grams and single words. We compare their effectiveness in classifying OCR texts and the corresponding correct ASCII texts in two domains: business letters and abstracts of technical reports. Our results indicate that the use of n-grams is an attractive technique which can even compare to techniques relying on a morphological analysis. This holds for OCR texts as well as for correct ASCII texts.

Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence