next up previous
Next: 18 Objects Up: 17 Assumption Grammars Previous: 17.1 Applying arbitrary methods

17.2 Handling regular expressions with AGs

The following `real-life' natural language tokenizer is put together in a few lines of BinProlog on top of a generic regular expression recognizer using higher-order programming (call/N) and AGs.

% NL tokenizer - converts NL strings of chars to lists of words

chars_words(Cs,Ws):-dcg_def([32|Cs]),words(Ws),!,dcg_val([]).

% a sequence of words
words(Ws):-star(word,Ws),spaces.

% a token which is punctuation or a sequence of letters
word(W):-spaces,(plus(letter,Xs);one(punct,Xs)),!,name(W,Xs).

% 0 or more space characters
spaces:-star(space,_).

% recognizers
space(X):- #X, is_space(X). % recognizes space

letter(X):- #X, is_an(X).   % recognizes alpha-numerics

punct(X):- #X, is_punct(X).  % recognizes punctuation

% regexp tools with  AGs and call/N
one(F,[X]):- call(F,X). % recognizes one X of type F

star(F,[X|Xs]):- call(F,X),!,star(F,Xs). % recognizes 0 or more
star(_,[]).

plus(F,[X|Xs]):- call(F,X),star(F,Xs).   % recognizes 1 or more

The interface predicate chars_words initializes the input list with dcg_def/1 and constrains it to be empty at the end with dcg_val/1. Otherwise, higher-order programming (in this case call/N) can be applied as usual. Parsing is done by combining basic recognizers (space/1, letter/1, punct/1) which consume their input with #/1 under the constraint of being of a specified type. To combine words into sequences we reuse the same regular expression operators.

Note that similar code based on conventional, translation based DCGs or EDCGs would have to use phrase in rather intricate ways and would not benefit from an efficiently implemented call/N, (as available in BinProlog 5.75).



next up previous
Next: 18 Objects Up: 17 Assumption Grammars Previous: 17.1 Applying arbitrary methods



Paul Tarau
Thu Apr 3 10:26:39 AST 1997