XPath-Wrapper Induction by generating tree traversal patterns

Tobias Anton

Abstract

We introduce a wrapper induction algorithm for extracting information from tree-structured documents like HTML or XML. It derives XPathcompatible extraction rules from a set of annotated example documents. The approach builds a minimally generalized tree traversal pattern, and augments it with conditions. Another variant selects a subset of conditions so that (a) the pattern is consistent with the training data, (b) the pattern’s document coverage is minimized, and (c) conditions that match structures preceding the target nodes are preferred. We discuss the robustness of rules induced by this selection strategy and we illustrate how these rules exhibit knowledge of the target concept.

[article]