Publication

Information Extraction from HTML Documents by Structural Matching

Thomas Breuel

In: Second International Workshop on Web Document Analysis. International Workshop on Web Document Analysis (WDA-2003), located at ICDAR 2003, August 3, Edinburgh, 2003.

Abstract

Structured information (database contents, stock quotes, citations, etc.) is widely available on the Internet in HTML format. Examples of such information are stock quotes, financial data, time tables, customer records, etc. While the presentation of such information in HTML format is convenient for human users, information in this format is not particularly convenient for automatic processing because it contains a large amount of irrelevant information. Furthermore, the semantic meaning of different parts of an HTML document may be encoded in ways that do not correspond in a simple way to a structured representation of the data. A number of approaches have been taken to this problem in the past. One is to attempt to get content providers to present their information in a more structured format. In that approach, the information is itself transmitted in XML format from the server, together with formatting instructions, and the browser combines the structured information and the formatting instructions into a human-readable document. However, this is not very widespread yet because only a fraction of the browsers in use have the ability to perform the rendering. Furthermore, such an approach involves considerable changes to their software systems on the part of the content providers for no obvious benefit in most cases. An alternative approach to recovering the information is the manual creation of "screen scraping" software. To create such software, programmers examine the structure of the HTML representing data on a particular web site and hand-code instructions for extracting information from different parts of that structure. This requires both a significant amount of manual work, as well as considerable programming expertise.

InfExtractHTMLDoc.pdf (pdf, 645 KB )