Skip to main content Skip to main navigation


Supporting Early Contextualization of Textual Content in Digital Documents on the Web

Bahaa Eldesouky; Menna Bakry; Heiko Maus; Andreas Dengel
In: Proceedings of the 13th International Conference on Document Analysis and Recognition. International Conference on Document Analysis and Recognition (ICDAR-15), 13th, August 23-26, Nancy, France, 2015.


The World Wide Web is arguably the most important source of digital documents nowadays. These documents mainly consist of unstructured and semi-structured data comprising a wealth of information at the disposal of the DAR (Document Analysis and Recognition) community. Contextualization plays an important role in understanding the content of those documents. In this paper, we present an approach to early contextualization of textual data in HTML documents. It combines automatic as well as semi-automatic annotation of named entities with user interaction to support contextualization of the content of digital documents as early as in the authoring stage of their life cycle. We also present the results of an online experimental evaluation involving 120 human test subjects. They show that our approach successfully managed to produce semantically annotated versions of unstructured textual content, which contain reliable contextual information, thus facilitating the task of later document analysis stages.