A document is often fetched by sending a CGI-request to a server, that is what your browser
usually does when you submit a form contained in a specific page. The next example
demonstrates this behaviour:
Whenever the evaluation of an HyQL script is concerned with a document access the
HyQL interpreter follows some specific rules:
|
Specifying the basic elements of a collection:
|
Given a current location in a document's parse tree, we consider the following methods of searching for subtrees:
- descendant : elements appearing within the context of the current location
- child : child elements of the location source
- ancestor : elements in whose content the location source is found
- previous : preceding sibling elements of the location source
- next : following sibling elements of the location source
- preceding : elements which appear before the location source
- following : elements which appear after the location source
|
Building collections and searching for elements:
|
Let's consider a simple example of a search expression (as part of a complete query expression):
root,descendant(1,table)
It specifies that the search starts at the
root of the current document tree. Then we consider the collection of all
table elements in the document which build subtrees which can be accessed from the current position, i.e.
descendants which are
tables. In that collection we take the first (
1) member.
Collections can incrementally be refined by simply combining search expressions:
root,descendant(1,body)(1,table)
It specifies that we first search for the first
body element and then we search for the first
table element which occurs in that context.
We can also combine different types of collections, e.g.:
root,descendant(1,table)(2,tr)child(1)
Here, we first search in the
descendant collection to find the second row of the first table. Then we search in the
child collection of this row and take the first element regardless of its tag name.
Besides specifying the instance and type of an element to be searched for we can further
constrain the search for an element by the specification of (some of) its HTML
attributes. For example, the search expression
root,descendant(1,font,{ size = "3" face = "arial" })
searches for first occurring
font element which is
specialized by the attributes
size with value
3 and
face with value
arial, respectively.
|
There are different ways to enumerate elements of a collection
|
| descendant(1,table) |
the first table
|
| descendant(1,*) |
the first element
|
| descendant(1) |
the first element
|
| descendant(-2,table) |
the second last table
|
| descendant(all,table) |
all tables
|
| descendant(all,table,{ border = "1" }) | all tables with a border of width 1
|
|
Selecting spans of elements:
|
In order to be able to select a specific dense region of a document HyQL provides a
mechanism to select a whole span of elements. Such a span is characterized by the
specification of its left and right borders in form of search expressions connected
by a span operator.
We distinguish three different variants how a span of elements can be selected from a
specific document. We explain them using the following sample document tree:
Consider the following search expression:
root,descendant(1,table)(1,tr) ..
here,next(1,tr)
It is able to select the first two rows of the first table
in the respective document. The result is given as:
A further example using the ".." operator is provided as:
root,descendant(1,table)(1,tr) ..
here,following(-2,td)
It selects the following subtrees as the result of its evaluation:
The second variant of a span collection extends that from above in a way that it
tries to find HTML-block elements which just cover the selected elements in a
span. It is in a certain sense the display-oriented variant of a span
selection. Consider the following search expression:
root,descendant(1,table)(1,td) ..block
here,following(-1,tr)
It basically selects
td elements in the
first table starting with the first one and goes on gathering them until the last
tr element is covered. Now, it searches for
block-building elements in order to wrap subsets of these selected elements by more
abstract ones. In our example, we wrap all
td elements by the block element
table:
The last variant of a span selection extends the first one by trying to wrap the
selected elements in the most abstract way, regardless of the fact that the wrapping
elements are block-building or not.
The expression
root,descendant(1,b) ..max
here,following(-1,tr)
selects the
b element and all elements
contained in the
table element and wraps all
these by the
div element:
HyQL provides some functions in order to be able to transform the result of
evaluating a search expression. These are operations which either work on the HTML
structure or on strings. We will discuss the different forms of adapted search
expression:
-
select valid_html root,descendant(1,table)
from document d in http://www.yahoo.de
The basic search expression contained in this query selects the first table of the
respective document. The function valid_html
guarantees that the structure returned is a full site-specific HTML structure. This
means that the table element above will be wrapped
by a
<html> <head> ... </head><body> ...
structure.
If the relevant document source d contains a base specification then that is integrated into the
wrapping HTML structure. This is because relative links contained in the selected
data (the table element) can then be resolved
appropriately. (If the target system using the result of the query is a Web browser
then it is for example able to display images encoded using relative links.)
-
select valid_table root,descendant(1,table)(2,tr) .. here,next(-1,tr)
from document d in http://www.yahoo.de
The basic search expression contained in this query selects all but the first row of
the relevant table. Using the function valid_table
the set of rows can be reorganized into a new table element (again in order to be displayed
appropriately).
select valid_div root,descendant(1,table)(2,tr)(all,td)
from document d in http://www.yahoo.de
The function valid_div reorganizes the selected
elements by wrapping them into a new container (the general div element). This wrapping transforms a set of subtrees
to a new singleton tree: can be important if the query result is further processed as
a resource in other queries.
select valid_div_set valid_div root,descendant(1,table)(2,tr)(all,td)
valid_div root,descendant(2,table)(2,tr)(all,td)
from document d in http://www.yahoo.de
The function valid_div_set reorganizes the results of
multiple search expressions by wrapping them into a new div container. The result of the sample script is of structure
<div> <div> ..... </div>
<div> ..... </div>
</div>
-
select fill form root,descendant(1,form) with q = "SAP"
from document d in http://www.altavista.de
The function fill form ... with ... first evaluates a
search expression which must provide a form element as its result. Then, this element
is adapted in a way that it represents the form value setting specified by the given
assignments (the right-hand side of an assignment must be double-quoted; individual
assignments are separated by blanks).
-
select adapt root,descendant(1,font) by size = "1"
from document d in http://...
The function adapt ... by ... is able to change
values of attributes of a tag element and returns the adapted element. The sample
script above changes the size attribute of the first
font element found.
-
text root,descendant(1,table)(all,td)
The function text effects that the concatenation
of the leaf words of the selected elements is returned. Regarding the sample
expression this means that all the strings wrapped by the individual tdelements are joined together without an additional
separator has been integrated.
-
textnl root,descendant(1,table)(all,td)
This variant of the function text integrates a
newline character as a delimiter between the individual substrings.
-
root,descendant(1,table)(all,tr)(2,td)child(alltextnl,span)
There is a special variant of the enumeration of elements contained in a
collection. The enumeration type alltextnl works as
follows. First, it performs like type all
resulting in a set of selected elements. Then it applies locally the function text on this set and terminates the resulting string
with a newline character. The evaluation of the sample search
expression above results in a set of strings associated with the rows of the
respective table. Each of these strings is the newline-terminated concatenation of the leaf words
contained in span elements wrapped by the second
data cell of each row.
-
This variant of alltextnl works in a way that the
individual string of each leaf word is terminated by a newline character.
-
buildstring "$index.html" root,descendant(1,a)href
The function buildstring is a two-place
function. The first argument is a string pattern where the character $ is used as a variable to be instantiated by a string
resulting from the evaluation of the second argument (in the sample expression above
the value of the href attribute of the first
anchor found in the respective document is then concatenated with the string index.html). This is naturally extended to the case
where the pattern contains more than one $ placeholder. A valid evaluation demands that the
number of placeholders equals the number of strings to be introduced. Thereby, the set of
strings is the result of evaluating either one or more than one search expression. In
the latter case, the union of the individual result sets is relevant.
-
substring "http://$/*" root,descendant(1,a)href
The function substring is also a two-place
function with arguments given as a string pattern and a search expression,
respectively.
Its task is to extract a specific substring from the string resulting from the
evaluation of the search expression. The string pattern must contain the special
character $ which defines the location of the
substring to be extracted. The character * has its
usual regular expression meaning.
Now, the sample search expression above will return the name of the server used in
the respective anchor.
HyQL provides a mechanism to specify qualification conditions on a document to be
fetched as a resource for information extraction.
The schema is given as:
select ...
from document d
such that qualification_conditions(d)
[ where ... ]
The set of qualification conditions on a document is defined on basis of the
following categories:
-
|
The specification of an URL |
document d in URL1 URL2 ...
The respective document must be specified by one of the given urls. If no further
restrictions occur then the urls are processed in a left-to-right order and
processing stops as soon as one url has been fetched. Instead of specifying the urls
explicitly there is also the possibility to use the current value already assigned to an info
variable.
document d in INFOVARIABLE
That kind of document specification is also possible for the simple kind of document
access in the sense of
select ...
from document d in INFOVARIABLE
Both kinds of specification demand that the current script contains a sub query whose
evaluation has already assigned some value to the variable INFOVARIABLE.
-
|
The specification of attributes of a document
|
We currently support the attributes url, content, title
d.url = "http://www.yahoo.de"
d.url = QUERY
d.title match STRING
d.title nomatch STRING
d.content match STRING
d.content nomatch STRING
The url of the requested document can be given as a constant or as the result of the
evaluation of a complete query expression. The title and the content (i.e. that part
of
a document which is visible in a browser) of a document can be constrained to match
or not to match with a specific string expression. That is a simple regular
expression allowed to contain the wildcard character *. The specification of a document using only title
or content requirements is not sufficient in HyQL in order to be able to fetch the respective document.
-
|
Regular link path expressions |
A subgraph of the WWW to be explored and searched for a specific document
can be specified by the use of
regular path expressions.
We classify links as local (on the same server
as the source), strict local (a link has the url of its parent as a real prefix), global, or empty (source equals target). Regular expressions built
upon these categories can describe a more or less restrictive set of paths between
the source and the potential target document.
Complex expressions are built using the
following primitives:
- =>: a global link of length one
- ->: a local link of length one
- :>: a strict local link of length one
- =>*: a global link of arbitrary length
- ->*: a local link of arbitrary length
- :>*: a strict local link of arbitrary length
- |: separation of alternative expressions
and the schema:
document d0 PATH_EXPRESSION document d
Path expressions can only be evaluated when the source of a path can be constrained
to a concrete URL taking all constraints of the respective qualification
context into account. Consider the following examples:
select content from document d1
such that document d in http://wetter.yahoo.de/land/deutschland.html
document d ->-> document d1
This query specifies the selection of the content of a document
d1 which is reachable from a document d by following local links of path length two, i.e. we take one
intermediate document into account.
select content from document d1
such that document d in http://wetter.yahoo.de/land/deutschland.html
document d -> document d1
d1.url = select root,descendant(3,a)href
from document d
We specify a local link from document d to d1, but the URL of document d1
should be given by the href attribute of the third anchor
we found in the source document d. The local link
constraint works in different ways depending on the current value of the href attribute selected. If it defines a complete URL then the
constraint checks whether the link is a local one. Otherwise, if only a partial URL
is defined the constraint automatically extends it to a complete one.
-
|
The specification of HTML form actions |
HyQL provides some features for simulating the submission of a form:
document d submit form INFOVARIABLE
Such a constraint effects an url access in the same way a browser does it when it
submits a form. Here, we demand that an appropriate form element has already been
bound to an information variable. The following example clarifies the setting and
using of variables:
{ select info f := fill form root,descendant(1,form) with q = "Cebit99"
from document d in http://www.altavista.de/ },
{ select content
from document d2
such that document d2 submit form f }
Among setting/using variables we introduce another new
concept, namely that of multiple queries. In order to execute more than one query expression they are simply joined, but each single one should be set in curly braces. (Note: the parser currently needs the spaces between the braces and the script code.)
The first query contains a variable assignment which specifies that an information (info) variable f is assigned a value which results from finding the first form element in document d and adapt it by setting its input field named q with the value "Cebit99".
The second query specifies a document access by just submitting this adapted form.
Here, we simulate the usual behavior of
- going to the altavista homepage
- enter a search term
- pressing the search button
- getting answers to our search query
The document which returns from evaluating the HyQL query above would contain both the adapted form and the search results. Since that's not what we usually want,
there is the option to suppress visible output of a query: we have to replace the
first occurrence of the keyword select by the keyword
let.
document d fill form f with ASSIGNMENTS
We assume that the information variable f has been bound
to a form element. Before this form is transformed to an url request we apply the
specified value assignments to it.
document d fill formget URL with ASSIGNMENTS
document d fill formpost URL with ASSIGNMENTS
These two variants simulate a form submission without the need of having access to a
concrete form element. Instead of this we specify the action url, the method
of how parameters of the action are submitted (get or post), and the set of
attribute/value pairs to be transmitted.
|
The formulation of assignments for form submission
|
An HTML form submission generally involves a set of attribute-value pairs which are
transformed into a specific format for the purpose of sending them to a specific
server. HyQL allows different ways of specifying these attribute-value pairs.
- location = "Paris"
This specifies that an HTML input element whith a name of
location has set its value to
Paris. If the respective form contains a select element with a name of
location then the value of
the respective option element whose content matches with
the string Paris is taken as the corresponding value for
the attribute location.
- location.value = "Paris"
This handles the case of directly choosing a value for a select element with a name of
location.
- location =$ INFOVARIABLE
This extends the cases above in a way that the value to be assigned is not given by a
fixed string, but available as the current value of the variable INFOVARIABLE.
- INFOVARIABLE $= "Paris"
That specifies the other variant where the name of the respective attribute is given
the current value of a variable (e.g. INFOVARIABLE
evaluates to the value location. Both variants can also
be combined.
|
Accessing Multiple Documents
| TOP
|
HyQL provides the ability to bind more than one document to a single document
variable. That can be initiated either by queries of the kind
select ...
from forall document d in ...
or
select ...
from forall document d such that ...
The interpretation of the access to multiple documents is done in a weak way,
i.e. the HyQL interpreter softens the constraint of fetching a specific set of
documents at once to the case of fetching as much as possible documents of a specific
set which are accessible w.r.t. a specific amount of available time.
|
Specification of qualification conditions
| TOP
|
HyQL provides a query schema of the form
select ... from ... where ...
in order to express qualification conditions on the elements to be selected. The usual way to do that is that we define an
information variable in the
select part of the query and specify a set of conditions on this variable
in the
where part.
In order to express qualification conditions on resources which are documents (you
have seen some examples above), we used the
such that construction.
|
Comparison on string level:
|
A sample HyQL-script able to deal with a request like "Select only those rows from a
table which contain the string 'Destination'" could be given as:
select info t := root,descendant(1,table)(all,tr)
from document d such that ...
where t match "*Destination*"
If this query can successfuly be evaluated the info variable
t
is bound to a set of
tr elements which satisfy the qualification condition; the query returns this set as its result of evaluation.
An application of a string comparison on an
tr element (or any other HTML-element) means that we
first built a string representation of the element which in fact is the concatenation of the strings contained in the leaves
considering the
tr element in its tree structure. This means, that we only take the browser-based textual representation into account.
The only special character we currently support in strings is the
*-character
with its usual wildcard meaning.
Other operations supported:
- in order to select only those elements whose textual leaf word doesn't match with the given string.
select ...
from ...
where t nomatch "*Destination*"
-
the logical connectives and and or in order to connect qualification conditions appropriately.
-
There also exist variants where the matching string is replaced by a variable
according to the schema QUALIFVARIABLE operation
INFOVARIABLE. Thereby, the variable INFOVARIABLE
is evaluated before the condition can be tested. If INFOVARIABLE was e.g. assigned a value 'Destination' then the condition to be tested will be transcoded
by default to
QUALIFVARIABLE operation "Destination*".
|
Comparison on structure level:
|
A sample HyQL-script able to deal with a request like "Select only those rows from a
table which contain explicitly at least seven columns" could be given as:
select info t := root,descendant(1,table)(all,tr)
from document d such that ...
where root,child(7,td) applicable to t
If this query can successfuly be evaluated the info variable
t
is bound to a set of
tr elements which satisfy
the qualification condition that a specific search expression was applicable to each
of them.
A sample HyQL-script able to deal with a request like "Select only those rows from a table which contain at least five columns and in its first
column element some text
containing the keyword 'Departed'" could be given as:
select info t := root,descendant(1,table)(all,tr)
from document d such that ...
where root,child(5,td)previous(-1) applies to t matches "*Departed*"
This query combines a structural filter, the relevant row must contain at least five
columns, with a content-based filter, namely that the first column of this row must
also match a specific keyword. The search expression we used navigates first to the
fifth
td element and from that back to the first one (the
last in the
previous collection).
The variant
... applies to ... not matches ... is also
supported.
The specification of the content-based filter can also be done using an info variable.
We will explain the concept of an index variable using an example. Consider the
following script consisting of two queries:
{ let info t := root,descendant(1,table)(1,tr)child(#1,td)
from document d such that ...
where t match "Date" }
{ select root,child(#1,td)
from select root,descendant(1,table)(1,tr)next(all)
from document d
}
The first query's task is to find that column of a table whose first line contains
the keyword
Date and to store the columns position for
future reference. The index variable
#1 is bound to that
respective position (index variable symbols always start with the character
# followed by a digit); it works here in the sense of a marker
for a specific column.
The second query will now select the marked column from the table but starting with
the second row (that are all which follow the first one considering the child
collection of table rows).
Once an index variable has been bound all further
occurences refer to its value. (Remark: The document variable
d has been assigned a value during the evaluation of the first
query, so its value can be reused in the second query.)
|
Selecting elements in virtual collections
| TOP
|
Consider the situation that a page contains a table which has the following structure:
It contains six rows and four columns and the relevant elements which should be
selected are the table cells in colors green and red, respectively. A way to handle
this problem is to consider the table as a collection of pairs of rows and then for each
pair we select the first cell of its first component and the third cell of its second
component. HyQL provides a way to build the "virtual" collection of (in this case)
pairs of rows.
This is done using the query schema
select sequence ...
from ...
Our problem from above can then be handled in a way like:
{ let info t := root,descendant(1,table)
from document d in http://... }
{ select sequence valid_div root,descendant(1,tr) here,next(1,tr)
from t }
The first query is responsible for providing the context which covers all elements of
the new collection, in our example it is just the whole table. The second query
builds the pairs of rows. The evaluation starts by selecting the first
tr element and from that location additionally the
next one. Then, the HyQL interpreter searches for the next
ability to evaluate the sequence of search expressions again. It iterates on doing
this as long as the search for the respective subcontext is successful. What
internally happens is in fact an iterative rewriting of the second query as long as
the execution of the current subquery is successful. Considering our example this
looks like:
{ let info t := root,descendant(1,table)
from document d in http://... }
{ select valid_div root,descendant(1,tr) here,next(1,tr)
from t }
{ select valid_div here,following(1,tr) here,next(1,tr)
from t }
{ select valid_div here,following(1,tr) here,next(1,tr)
from t }
The rewriting pattern
root,descendant --> here,following is
currently the only pattern which is supported.
There exists a restriction for the successful application of the iterative query
schema:
- All search expressions except for the first one used by the specification of the selection to be performed
in one iteration step must be given as relative search expressions (in the example
above: here,next(1,tr)).
Now, our selection problem of gathering the green and red cells is solved by the
three-query script given as:
{ let info t := root,descendant(1,table)
from document d in http://... }
{ let sequence info v := valid_div root,descendant(1,tr) here,next(1,tr)
from t }
{ select root,child(1,tr)(1,td) root,child(2)(3,td)
from v }
An alternative (more compact) way to select the relevant cells form the table is
given as:
{ let info t := root,descendant(1,table)
from document d in http://... }
{ select sequence root,descendant(1,tr)(1,td) here,following(1,tr)(3,td)
from t }
Here, the virtual collection is only built implicitly, i.e. the pairs of rows don't really
exist.
|
Selecting Elements in a Specific Context
| TOP
|
Consider an information extraction task of the kind described by the following
abstract sample:
|
"There is a specific document d. That document contains a specific image which is
part of a table and
characterizes this table as the one which contains the first part of
information to be
selected, e.g. the text following a header line. The second part of information follows in the flow of the
document below the particular table, e.g. a set of header lines."
|
and and an associated sample script:
{ let info t := root,descendant(1,table)
from document d in http://...
where root,descendant(1,td)(1,img,{ src = "big.gif" }) applicable to t }
{ select root,descendant(1,h2)(1,p) from t }
{ select root,following(all,h2)
from t in context document d }
Now, the last query expression from the script above takes as a primary resource the object
bound to the variable t; it tries to access all h2-tags which follow this object. However, these elements are
located per definition outside the scope of the object. In order to get the desired
results the scope of the object can be expanded using the keyword in
context in addition to the respective secondary resource (in our example the whole document
d).
Secondary resources can also be given as objects bound to an info variable.
The use of secondary resources applies also to the specification of qualification conditions
for the case of structural and combined filters.
|
Assignment to a list of info variables
|
The concept of the assignment of the result of a search expression to an info variable can
be extended to the case of a multiple assignment. Consider for example the case that the
first and the last row of a table should be selected for a further processing in
successive subqueries:
select infolist r1 r2 := root,descendant(1,table)(1,tr) root,descendant(1,table)(-1,tr)
from document d ...
Here, the keyword infolist is followed by a set of variables
and a respective set of search expressions.
|
Resources for search expressions
|
A resource for the evaluation of search expressions can also be specified by the elements
bound to an info variable in the sense of:
{ let info t := root,descendant(1,table)
from document d ... }
...
{ select root,descendant(-1,tr)
from t }
If a set of elements is bound to a resource variable then the evaluation strategy is that
the search expression is sequentially applied on each of the elements yielding a set
of results.
The usual behavior of the HyQL-interpreter is that it annotates all documents after they
have been fetched. It does it in a way that all textual entities in an HTML-document are
transformed into sets of span-elements where each span-element encapsulates either a special character, a number, or
simply a word. For example, this means that a paragraph of text is now accessible as a
list of individual words and punctuation marks. A textstring like "Hagen-Nord" is
transformed into
<span pan="[1, 3, 2]" class="wordPAN">Hagen</span>
<span pan="[1, 3, 3]" class="extraPAN">-</span>
<span pan="[1, 3, 4]" class="wordPAN">Nord </span>
The attribute class identifies the syntactic category and the
attribute pan contains an internal identifier representing the
position of the respective element in the parsed document tree as used by the
HyQL-interpreter.
If the annotation of documents is not necessary then the directive
{ no annotation }
used as a subquery of an HyQL-script disables the annotating process.
The annotation of individual resources can then be enabled by specifying a resource using
the function
annotated. The interpreter supports queries of
the schemata:
{ select ... from annotated document d }
{ select ... from annotated INFOVARIABLE }
- The assignment to a list of info variables is not supported for the case of
building virtual collections