Learning HyQL in One Day

HyQL is an SQL-like WWW query language that supports flexible selection of document parts as well as navigation through the WWW by following hyperlinks. Its expressiveness allows document portions to be addressed in a variety of ways. The documents themselves are represented as parse tree structures in a canonical form, which means that some obvious faults in documents are repaired, optional start or end-tags are added. The basic operation model of HyQL is that all selection and filtering operations work on sets of trees which either represent whole documents or document parts, resp. HyQL programs are considered as scripts which are evaluated by an appropriate interpreter.


Overview






Simple Access to Documents


We start with the most simple HyQL script which is just able to fetch the whole content of a specific document: select content from document d in http://www.yahoo.de/

Here, d is a variable bound to the documents content. This content is represented as the result of a parsing process performed on the documents actual data delivered by the respective web server. The consequence is that on one hand we transform all documents into a canonical (tree) structure and on the other hand we repair obvious faults often contained in the documents (these are faults on the HTML level).


Alternatives:

select root,descendant(1) from select content from document d in http://www.yahoo.de select root,descendant(1,html) from select content from document d in http://www.yahoo.de select root,descendant(1,html) from document d in http://www.yahoo.de Assuming that the page delivered by accessing www.yahoo.de contains a full html structure all three scripts above return the same result. Details of how they perform on the representation of the document are given in the following sections.

A document is often fetched by sending a CGI-request to a server, that is what your browser usually does when you submit a form contained in a specific page. The next example demonstrates this behaviour:

select content from document d such that document d fill formget http://www.traxxx.de/cgi/hotel/start with buchstabe = "Kleinblittersdorf", Suchen.x = "0", Suchen.y = "0"

This HyQL script simulates the submission of a form element using the GET-method; using the keyword formpost instead of formget would simulate a POST-request.


What HyQL does in the background

Whenever the evaluation of an HyQL script is concerned with a document access the HyQL interpreter follows some specific rules:


Accessing Parts of Documents TOP


The basis for the selection of document parts is the XML linking and referencing concept operating on a labeled ordered HTML parse tree. Let T be such a parse tree, E a tag in T and P the path from the root of T to E.
There are many ways to characterize a specific element E in T: by specifying constraints on P, the content or other properties of E, by specifying a context of E and E's relative position to this context, etc.
Collections (sets of elements of a specific type, specific structure or content, or with specific attributes) play a key role in the specification of a position. There are collections of subtrees where the root is a specific tag, has specific attributes, the attributes are set to specific values, or the leaves match specific patterns. All collections can be accessed as a complete list, or indexed forwards or backwards. In addition to the single-element and all-collections, selections of intervals are also possible: fetching a prefix or suffix of an all-collection, or specifying two single border elements and automatically all elements between them.


Specifying the basic elements of a collection:

Given a current location in a document's parse tree, we consider the following methods of searching for subtrees:
Building collections and searching for elements:

Let's consider a simple example of a search expression (as part of a complete query expression):

root,descendant(1,table) It specifies that the search starts at the root of the current document tree. Then we consider the collection of all table elements in the document which build subtrees which can be accessed from the current position, i.e. descendants which are tables. In that collection we take the first (1) member.

Collections can incrementally be refined by simply combining search expressions:

root,descendant(1,body)(1,table) It specifies that we first search for the first body element and then we search for the first table element which occurs in that context.

We can also combine different types of collections, e.g.:

root,descendant(1,table)(2,tr)child(1) Here, we first search in the descendant collection to find the second row of the first table. Then we search in the child collection of this row and take the first element regardless of its tag name.

Besides specifying the instance and type of an element to be searched for we can further constrain the search for an element by the specification of (some of) its HTML attributes. For example, the search expression

root,descendant(1,font,{ size = "3" face = "arial" }) searches for first occurring font element which is specialized by the attributes size with value 3 and face with value arial, respectively.

There are different ways to enumerate elements of a collection

descendant(1,table) the first table
descendant(1,*) the first element
descendant(1) the first element
descendant(-2,table) the second last table
descendant(all,table) all tables
descendant(all,table,{ border = "1" }) all tables with a border of width 1


Multiple selections:

It is possible to build sequences of search expressions in order to select a sequence of individual objects from the same resource; this is usually used if the desired selection is neither dense nor identical with some all-collection.

root,descendant(1,table)(2,tr) root,descendant(1,table)(3,tr) This expression specifies, that the second and third row of the respective table will be selected.


Alternatives:


In the example above we use an absolute adressing for both search expressions. But we could do this selection also in the sense of "select the second row of the respective table and then the next immediately following row". In order to do that, HyQL provides the keyword here instead of root. It specifies that the search starts at the root of the last element in the result list of the search expression which has been evaluated just before, and not at the root of the overall resource element. The alternative sequence is given as root,descendant(1,table)(2,tr) here,next(1,tr)

Selecting spans of elements:

In order to be able to select a specific dense region of a document HyQL provides a mechanism to select a whole span of elements. Such a span is characterized by the specification of its left and right borders in form of search expressions connected by a span operator. We distinguish three different variants how a span of elements can be selected from a specific document. We explain them using the following sample document tree:

Consider the following search expression:

root,descendant(1,table)(1,tr) .. here,next(1,tr)
It is able to select the first two rows of the first table in the respective document. The result is given as:

A further example using the ".." operator is provided as:

root,descendant(1,table)(1,tr) .. here,following(-2,td) It selects the following subtrees as the result of its evaluation:

The second variant of a span collection extends that from above in a way that it tries to find HTML-block elements which just cover the selected elements in a span. It is in a certain sense the display-oriented variant of a span selection. Consider the following search expression:

root,descendant(1,table)(1,td) ..block here,following(-1,tr) It basically selects td elements in the first table starting with the first one and goes on gathering them until the last tr element is covered. Now, it searches for block-building elements in order to wrap subsets of these selected elements by more abstract ones. In our example, we wrap all td elements by the block element table:

The last variant of a span selection extends the first one by trying to wrap the selected elements in the most abstract way, regardless of the fact that the wrapping elements are block-building or not. The expression

root,descendant(1,b) ..max here,following(-1,tr) selects the b element and all elements contained in the table element and wraps all these by the div element:


Modification of results:

HyQL provides some functions in order to be able to transform the result of evaluating a search expression. These are operations which either work on the HTML structure or on strings. We will discuss the different forms of adapted search expression:


Following Links TOP


HyQL provides a mechanism to specify qualification conditions on a document to be fetched as a resource for information extraction.
The schema is given as:

select ... from document d such that qualification_conditions(d) [ where ... ] The set of qualification conditions on a document is defined on basis of the following categories:
The formulation of assignments for form submission

An HTML form submission generally involves a set of attribute-value pairs which are transformed into a specific format for the purpose of sending them to a specific server. HyQL allows different ways of specifying these attribute-value pairs.

  1. location = "Paris"
    This specifies that an HTML input element whith a name of location has set its value to Paris. If the respective form contains a select element with a name of location then the value of the respective option element whose content matches with the string Paris is taken as the corresponding value for the attribute location.
  2. location.value = "Paris"
    This handles the case of directly choosing a value for a select element with a name of location.
  3. location =$ INFOVARIABLE
    This extends the cases above in a way that the value to be assigned is not given by a fixed string, but available as the current value of the variable INFOVARIABLE.
  4. INFOVARIABLE $= "Paris"
    That specifies the other variant where the name of the respective attribute is given the current value of a variable (e.g. INFOVARIABLE evaluates to the value location. Both variants can also be combined.


Accessing Multiple Documents TOP


HyQL provides the ability to bind more than one document to a single document variable. That can be initiated either by queries of the kind

select ... from forall document d in ... or select ... from forall document d such that ... The interpretation of the access to multiple documents is done in a weak way, i.e. the HyQL interpreter softens the constraint of fetching a specific set of documents at once to the case of fetching as much as possible documents of a specific set which are accessible w.r.t. a specific amount of available time.


Specification of qualification conditions TOP


HyQL provides a query schema of the form select ... from ... where ... in order to express qualification conditions on the elements to be selected. The usual way to do that is that we define an information variable in the select part of the query and specify a set of conditions on this variable in the where part.
In order to express qualification conditions on resources which are documents (you have seen some examples above), we used the such that construction.

Comparison on string level:


A sample HyQL-script able to deal with a request like "Select only those rows from a table which contain the string 'Destination'" could be given as:

select info t := root,descendant(1,table)(all,tr) from document d such that ... where t match "*Destination*" If this query can successfuly be evaluated the info variable t is bound to a set of tr elements which satisfy the qualification condition; the query returns this set as its result of evaluation. An application of a string comparison on an tr element (or any other HTML-element) means that we first built a string representation of the element which in fact is the concatenation of the strings contained in the leaves considering thetr element in its tree structure. This means, that we only take the browser-based textual representation into account. The only special character we currently support in strings is the *-character with its usual wildcard meaning.

Other operations supported:


Comparison on structure level:

A sample HyQL-script able to deal with a request like "Select only those rows from a table which contain explicitly at least seven columns" could be given as:

select info t := root,descendant(1,table)(all,tr) from document d such that ... where root,child(7,td) applicable to t If this query can successfuly be evaluated the info variable t is bound to a set of tr elements which satisfy the qualification condition that a specific search expression was applicable to each of them.


Combined comparisons:

A sample HyQL-script able to deal with a request like "Select only those rows from a table which contain at least five columns and in its first column element some text containing the keyword 'Departed'" could be given as:

select info t := root,descendant(1,table)(all,tr) from document d such that ... where root,child(5,td)previous(-1) applies to t matches "*Departed*" This query combines a structural filter, the relevant row must contain at least five columns, with a content-based filter, namely that the first column of this row must also match a specific keyword. The search expression we used navigates first to the fifth td element and from that back to the first one (the last in the previous collection).
The variant ... applies to ... not matches ... is also supported.
The specification of the content-based filter can also be done using an info variable.


Index variables:

We will explain the concept of an index variable using an example. Consider the following script consisting of two queries: { let info t := root,descendant(1,table)(1,tr)child(#1,td) from document d such that ... where t match "Date" } { select root,child(#1,td) from select root,descendant(1,table)(1,tr)next(all) from document d } The first query's task is to find that column of a table whose first line contains the keyword Date and to store the columns position for future reference. The index variable #1 is bound to that respective position (index variable symbols always start with the character # followed by a digit); it works here in the sense of a marker for a specific column.
The second query will now select the marked column from the table but starting with the second row (that are all which follow the first one considering the child collection of table rows).
Once an index variable has been bound all further occurences refer to its value. (Remark: The document variable d has been assigned a value during the evaluation of the first query, so its value can be reused in the second query.)


Selecting elements in virtual collections TOP


Consider the situation that a page contains a table which has the following structure:

It contains six rows and four columns and the relevant elements which should be selected are the table cells in colors green and red, respectively. A way to handle this problem is to consider the table as a collection of pairs of rows and then for each pair we select the first cell of its first component and the third cell of its second component. HyQL provides a way to build the "virtual" collection of (in this case) pairs of rows.
This is done using the query schema select sequence ... from ... Our problem from above can then be handled in a way like: { let info t := root,descendant(1,table) from document d in http://... } { select sequence valid_div root,descendant(1,tr) here,next(1,tr) from t } The first query is responsible for providing the context which covers all elements of the new collection, in our example it is just the whole table. The second query builds the pairs of rows. The evaluation starts by selecting the first tr element and from that location additionally the next one. Then, the HyQL interpreter searches for the next ability to evaluate the sequence of search expressions again. It iterates on doing this as long as the search for the respective subcontext is successful. What internally happens is in fact an iterative rewriting of the second query as long as the execution of the current subquery is successful. Considering our example this looks like: { let info t := root,descendant(1,table) from document d in http://... } { select valid_div root,descendant(1,tr) here,next(1,tr) from t } { select valid_div here,following(1,tr) here,next(1,tr) from t } { select valid_div here,following(1,tr) here,next(1,tr) from t } The rewriting pattern root,descendant --> here,following is currently the only pattern which is supported.

There exists a restriction for the successful application of the iterative query schema:

Now, our selection problem of gathering the green and red cells is solved by the three-query script given as: { let info t := root,descendant(1,table) from document d in http://... } { let sequence info v := valid_div root,descendant(1,tr) here,next(1,tr) from t } { select root,child(1,tr)(1,td) root,child(2)(3,td) from v } An alternative (more compact) way to select the relevant cells form the table is given as: { let info t := root,descendant(1,table) from document d in http://... } { select sequence root,descendant(1,tr)(1,td) here,following(1,tr)(3,td) from t } Here, the virtual collection is only built implicitly, i.e. the pairs of rows don't really exist.


Selecting Elements in a Specific Context TOP


Consider an information extraction task of the kind described by the following abstract sample:
"There is a specific document d. That document contains a specific image which is part of a table and characterizes this table as the one which contains the first part of information to be selected, e.g. the text following a header line. The second part of information follows in the flow of the document below the particular table, e.g. a set of header lines."

and and an associated sample script:

{ let info t := root,descendant(1,table) from document d in http://... where root,descendant(1,td)(1,img,{ src = "big.gif" }) applicable to t } { select root,descendant(1,h2)(1,p) from t } { select root,following(all,h2) from t in context document d }

Now, the last query expression from the script above takes as a primary resource the object bound to the variable t; it tries to access all h2-tags which follow this object. However, these elements are located per definition outside the scope of the object. In order to get the desired results the scope of the object can be expanded using the keyword in context in addition to the respective secondary resource (in our example the whole document d). Secondary resources can also be given as objects bound to an info variable.

The use of secondary resources applies also to the specification of qualification conditions for the case of structural and combined filters.


Useful Features TOP



Assignment to a list of info variables

The concept of the assignment of the result of a search expression to an info variable can be extended to the case of a multiple assignment. Consider for example the case that the first and the last row of a table should be selected for a further processing in successive subqueries:

select infolist r1 r2 := root,descendant(1,table)(1,tr) root,descendant(1,table)(-1,tr) from document d ...

Here, the keyword infolist is followed by a set of variables and a respective set of search expressions.


Resources for search expressions

A resource for the evaluation of search expressions can also be specified by the elements bound to an info variable in the sense of:

{ let info t := root,descendant(1,table) from document d ... } ... { select root,descendant(-1,tr) from t } If a set of elements is bound to a resource variable then the evaluation strategy is that the search expression is sequentially applied on each of the elements yielding a set of results.
Annotation of documents

The usual behavior of the HyQL-interpreter is that it annotates all documents after they have been fetched. It does it in a way that all textual entities in an HTML-document are transformed into sets of span-elements where each span-element encapsulates either a special character, a number, or simply a word. For example, this means that a paragraph of text is now accessible as a list of individual words and punctuation marks. A textstring like "Hagen-Nord" is transformed into

<span pan="[1, 3, 2]" class="wordPAN">Hagen</span> <span pan="[1, 3, 3]" class="extraPAN">-</span> <span pan="[1, 3, 4]" class="wordPAN">Nord </span>

The attribute class identifies the syntactic category and the attribute pan contains an internal identifier representing the position of the respective element in the parsed document tree as used by the HyQL-interpreter.

If the annotation of documents is not necessary then the directive

{ no annotation } used as a subquery of an HyQL-script disables the annotating process. The annotation of individual resources can then be enabled by specifying a resource using the function annotated. The interpreter supports queries of the schemata: { select ... from annotated document d } { select ... from annotated INFOVARIABLE }


Restrictions TOP