The Aware Toolbox for the Detection of Law Infringements on Web Pages

Asif Shahab, Thomas Kieninger, Andreas Dengel

In: Document Recognition and Retrieval XVII. SPIE Conference on Document Recognition and Retrieval (DRR-10) befindet sich Electronic Imaging - Program Trak "Image Processing" January 19-21 San Jose California United States 1/2010.


In the project Aware we aim to develop an automatic assistant for the detection of law infringements on web pages. The motivation for this project is that many authors of web pages are at some points infringing copyright- or other laws, mostly without being aware of that fact, and are more and more often confronted with costly legal warnings. As the legal environment is constantly changing, an important requirement of Aware is that the domain knowledge can be maintained (and initially defined) by numerous legal experts remotely working without further assistance of the computer scientists. Consequently, the software platform was chosen to be a web-based generic toolbox that can be configured to suit individual analysis experts, definitions of analysis flow, information gathering and report generation. The report generated by the system summarizes all critical elements of a given web page and provides case specific hints to the page author and thus forms a new type of service. Regarding the analysis subsystems, Aware mainly builds on existing state-of-the-art technologies. Their usability has been evaluated for each intended task. In order to control the heterogeneous analysis components and to gather the information, a lightweight scripting shell has been developed. This paper describes the analysis technologies, ranging from text based information extraction, over optical character recognition and phonetic fuzzy string matching to a set of image analysis and retrieval tools; as well as the scripting language to define the analysis flow.


Deutsches Forschungszentrum für Künstliche Intelligenz
German Research Center for Artificial Intelligence