Analysing layout information: searching PDF documents for pictures

Brigitte Mathiak, Andreas Kupfer, Richard Münch, Claudia Täubner, Silke Eckstein

Abstract

Roughly 5 percent of the web’s content consists of PDF documents, yet most web mining applications simply ignore them or make a standard to-ASCII conversion. We believe that PDF documents contain valuable information, eg. most scientific publications are stored in PDF, and should therefore be analysed more thoroughly, by not only using the text information, but also the layout including the pictures. In this paper we address some of the issues involved with this approach and present an example application to help scientists search for specific pictures in a scientific PDF corpus.

[article]