In this section we will overview some existing corpus
annotation schemes for both morphosyntax and syntax. We will consider
them insofar as they have something of interest to say about typical
problems encountered in dialogue annotation in connection with
the following typology of phenomena:
This typology says nothing about whether the phenomena
considered are classifiable as disfluent material or should rather
be taken as germane linguistic phenomena characteristic of speech
and not of writing. The classificatory perspective entertained
here lays emphasis on the impact that the listed phenomena are
likely to have on issues of annotation: e.g. if they would simply
require introduction of an extra part of speech category, or if
they are rather bound to have repercussions on syntactic parsing
and segmentation issues in general.
Note that, in some cases, the same phenomenon can
be treated under two different headings: interactional markers,
for example, pose both a problem of categorial classification
(how should they be labelled?) and an issue of segmentation, when
they happen to be multi-word units (e.g., is 'I see' in its interactional
usage to be treated as a single morphosyntactic unit, or should
it rather be treated as a complex syntactic constituent?). Clearly,
the two perspectives interact to a large extent.
Not all the annotation schemes overviewed here have
actually explicitly addressed all problems in our list. Most of
them simply came up with interesting practices which can easily/usefully
be extended to dialogue annotation proper with a view to the treatment
of such phenomena. For example, we will mention here Eagles 1996
recommendations on both morphosyntax and syntax annotation, although
they were initially intended to deal with written material only.
As pointed out in Leech et al. 1998, they can in fact be taken
as a useful starting point for dialogue annotation too, with the
proviso that a certain amount of customization be carried out.
Hopefully, this should pave the way to the ultimate integration
of practices in the scientific communities of NLP and speech.
Coding book:
Information about the purpose and domain of the CHAT
system as well as instructions for use are described in MacWhinney
(1994).
Number of annotators:
The CHAT system is a widespread standard system for
the transcription and coding of child language in many European
and non European languages. Approximately 60 groups of researchers
around the world are currently actively involved in new data collection
and transcription using the CHAT system. As a consequence of its
widespread use, it is impossible to calculate the exact number
of annotators.
Number of annotated dialogues:
A huge number of dialogues has been/is being annotated
with the CHAT coding scheme. This number exceeds the amount of
dialogues in the database, as many projects concerning child language
make use of CHAT without contributing to the overall CHILDES database.
The internationally recognized CHILDES database (http://sunger2.uia.ac.be/childes/database.html)
includes transcripts from over forty major projects in English
and additional data from 19 other languages. The additional languages
are Brazilian Portuguese, Chinese (Mandarin), Chinese (Cantonese),
Danish, Dutch, French, German, Greek, Hebrew, Hungarian, Italian,
Japanese, Mambila, Polish, Russian, Spanish, Swedish, Tamil, Turkish,
and Ukrainian. The total size of the database is now approximately
160 million characters (160 MB). Full documentation about the
database can be found at http://sunger2.uia.ac.be/childes/database.pdf.
Evaluations of scheme:
As a result of its worldwide use, CHAT is continuously
evaluated and updated to meet the needs of different languages
and different users. We are not aware of statistical/quantitative
evaluations of its reliability.
Underlying task:
Being first created as a tool for the study of language acquisition, the data collected mainly refer to parent-to-child or child-to-child spontaneous conversations, task-oriented dialogues in play and story-telling situations.
Some of the data coded by CHAT also include second
language learners and adults recovering from aphasic disorders.
List of phenomena annotated:
See below.
Examples:
See below.
Mark-up language:
CHAT's own format.
Existence of annotation tools:
The CHILDES system contains several separate, yet integrate, programs which are clustered around two major tools. The first tool is a full-fledged and ASCII-oriented editor (CED, Childes EDitor), specifically designed to facilitate the editing of CHAT files and to check for accuracy of transcriptions. CED also allows the user to link a full digitized audio recording of the interaction directly to the transcript. This is the system called "sonic CHAT". The CED editor is currently being extended to facilitate its use with videotapes. The plan is to make available a floating window in the shape of a VCR controller that can be used to rewind the videotape and to enter time stamps from the videotape into the CHAT file. An alternative way of analyzing video is to record from tape onto QuickTime movies and to link these digitized movies to the transcript.
The second tool, actually a bunch of several smaller
tools, is a set of computer programs called CLAN (Child Language
ANalysis) which serves different analysis purposes. The full system
is presented in detail in MacWhinney (1991) and illustrated through
practical examples in Sokolov and Snow (1994).
Usability:
CHAT-encoded databases have been set up as a result
of nearly a hundred major research projects in 20 languages. New
databases are continuously being set up worldwide.
Contact person:
Brian MacWhinney (macw@cmu.edu)
CHAT makes provision for two physically and in part also conceptually distinct ways of encoding morphological information in a corpus: i) morpheme splitting on the 'main line', that is the line of orthographic transcription, ii) morphological categorization on the 'morphology line', that is a separate tier of encoding specifically devised for containing morphological information.
In order to indicate the ways that words on the main line are composed from morphemes, CHAT uses the symbols -, +, #, ~, &, and 0: they are all used as concatenative operators and accordingly placed between two consecutive morphemes. These same six symbols are also used for parallel purposes on the morphology line, where these symbols form a part of a more extensive system.
Morphemicization on the main line is intended mostly for initial morphemic analysis or general quantitative characterization of morphological development. For more thorough analyses the morphology line is strongly recommended, especially for languages other than English.
The basic scheme for coding of words on the morphology
line is:
'part-of-speech' |
'pre-clitic' ~
'prefix' #
'stem'
= 'English translation'
& 'fusional suffix'
- 'suffix'
~ 'post-clitic'
where the gloss between quotes indicates the content and position of corresponding encoded information relative to the symbol/operator. For example, part-of-speech information precedes '|', while fusional suffix follows '&'. Furthermore the delimiter '+' is used between words in a compound (see infra).
The order of elements after the | symbol is intended to correspond to the linear order of morphemes within the word, as shown by the following example:
'sing-s' v|sing-3s
There are no spaces between any of these elements. The English translation of the stem is not a part of the morphology, but is included here for convenience in retrieval and data entry. The morphological status of the affix is identified by the type of delimiter.
In particular, '&' is used to signal that the affix is not realized in its usual phonological shape. For example, the form "men" cannot be broken down into a part corresponding to the stem "man" and a part corresponding to the plural marker "s", hence it is coded as n|man&PL. Similarly, the past forms of irregular verbs may undergo ablaut processes, e.g. "came", which is coded v|come&PAST, or they may undergo no phonological change at all, e.g. "hit", which is coded v|hit&PAST Sometimes there may be several codes indicated with the & after the stem. For example, the form "was" is coded v|be&PAST&13s.
Clitics are marked by a tilde, as in v|parl=speak&IMP:2S~pro|DAT:MASC:SG for Italian "parlagli" and pro|it~v|be&3s for English "it's." Note that part of speech coding is repeated for clitics. Both clitics and contracted elements are coded with the tilde. The use of the tilde for contracted elements extends to forms like "sul" in Italian, "ins" in German, or "rajta" in Hungarian in which prepositions are merged with articles or pronouns.
The category 'communicator' is used in CHAT for interactive and communicative forms which fulfill a variety of functions in speech and conversation. Many of these are formulaic expressions such as hello, good+morning, good+bye, please, thank+you. Also included in this category are words used to express emotion, as well as imitative and onomatopoeic forms, such as ah, aw, boom, boom-boom, icky, wow, yuck, yummy.
Pauses are treated in CHAT on the prosodic annotation
tier. Pauses that are marked only by silence are coded on the
main line with the symbol #. The number of # symbols represents
the length of pauses. Alternatively, a word after the symbol #
is added to estimate the pause length, as in #long.
Example:
*SAR: I don't # know -.
*SAR: #long what do you ### think -?
CHAT allows coding of exact length of the pauses,
with minutes, seconds, and parts of seconds following the #.
Example:
*SAR: I don't #0_5 know -.
*SAR: #1:13_41 what do you #2 think -?
When an item on the main line is incorrect in either
phonological or semantic terms it is marked by a following '[*]'.
The coding of that item on the morphology line should be based
on its target, as given in the 'error line'. If there is no clear
target, the form should be represented with 'xxx', as in the following
example:
*PAT: the catty [*] was on a eaber [*].
%mor: det|the *n|kitty v|be&PAST prep|on
det|a *n|xxx.
%err: catty = kitty $BLE $=cat,kitty ; eaber =
[?]
In this example the symbol '*' on the morphology line indicates the presence of an incorrect usage, in this case due to blending two different words into one. The detailed analysis of this error should be conducted on the 'error line'. Errors involving segmentation issues (such as omission of a syntactically obligatory unit etc.) will be treated in the following section.
A non standard or incorrect usage can be encoded directly on the main line by trailing after it the replacing standard form in square brackets: example, gonna [: going to]. The material on the %mor line corresponds to the replacing material in the square brackets, not the material that is being replaced. For example, if the main line has gonna [: going to], the %mor line will code going to.
Some special characters are intended to give information
about, for example, babbling, child-invented forms, dialect forms,
family-specific forms, filled pauses, interjections, neologisms,
phrasal repetitions, or other general special forms, according
to the following conventions. Note that recording of these phenomena
is not made at the coding level, but at the transcription level.
Letters | Categories | Example | Meaning | Coded Example |
@b | babbling | Abame | abame@b | |
@c | child-invented form | Gumma | sticky | gumma@c |
@d | dialect form | Younz | you | younz@d |
@f | family-specific form | Bunko | broken | bunko@f |
@fp | filled pause | Huh | - | huh@fp |
@i | interjection, interactional | Uhhuh | - | uhhuh@i |
@l | letter | B | letter b | b@l |
@n | neologism | Breaked | broke | breaked@n |
@o | onomatopoeia | woof woof | dog barking | woof@o |
@p | phonol. consistent forms | Aga | - | aga@p |
@pr | phrasal repetition | its a, its a | - | its+a@pr |
@s | second-language form | Istenem | my God | istenem@s |
@sl | sign language | apple sign | apple | apple@sl |
@ | general special form | Gongga | - | gongga@ |
Those compounds that are usually written as one word, such as "birthday" or "rainbow," should not be segmented. Those compounds that are generally separated by a hyphen in English orthography are separated by a + symbol in CHAT transcription (e.g., "jack-in-the-box" should be transcribed as "jack+in+the+box"). Rote forms to be counted as a single morpheme may also be joined with a + symbol (e.g., all+right).
Multi-word expressions which are concatenated through a '+' are assigned a unique part-of-speech tag at the level of morphosyntax. For example, the following idiomatic phrases can be coded: qn|a+lot+of, adv|all+of+a+sudden, adv|at+last, co|for+sure, adv:int|kind+of, adv|once+and+for+all, adv|once+upon+a+time, adv|so+far, and qn|lots+of.
The symbol *0 is used in CHAT to indicate omission
(recall that the symbol * is used to indicate incorrect usage),
as in the following examples:
*CHI: dog is eat.
%mor: *0det|the n|dog v:aux|be&PRES v|eat-*0PROG.
*PAT: the dog was eaten [*] the bone.
%mor: det|the n|dog v:aux|be&PAST&3S v|eat-*PERF det|the n|bone.
%err: eaten = eating $MOR $SUB
Here is an example of coding on the morphology line
that indicates how the omission of an auxiliary is coded:
*BIL: he going.
%mor: pro|he *0v|be&3S v|go-prog.
Note that the missing auxiliary is not coded on the main line, because this information is available on the morphology line. If a noun is omitted, there is no need to also code a missing article. Similarly, if a verb is omitted, there is no need to also code a missing auxiliary.
The CHAT system for error coding has the following
features:
1. it indicates what the speaker actually said, or the erroneous form
2. it indicates that what the speaker actually said was an error
3. it allows the transcriber to indicate the target form
4. it facilitates retrieval, both toward target forms and actually produced forms
5. it allows the analyst to indicate theoretically interesting aspects of the error by delineating the source of the error, the processes involved, and the type of the error in theoretical terms (on the error line)
In CHAT, the syntactic role of each word can be notated
before its part-of-speech on the morphology line. To capture
syntactic groupings, provision is made for coding syntactic structure
on the syntactic line. Clauses are enclosed in angle brackets
and their type is indicated in square brackets, as in the following
example:
*CHI: if I don't get all the cookies you promised to give
%syn: <C S X V M M D < S V < R V I > [CP] > [RC] > [CC] <
me, I'll cry.
S V > [MC].
In this notation, each word plays some syntactic role. The rules for achieving one-to-one correspondence to words on the main line apply to the syntactic line also. Higher order syntactic groupings are indicated by the bracket notation. The particular syntactic codes used in this example come from the following list. This list is not complete, particularly for languages other than English.
A | Adverbial Adjunct | V | Verb | |
C | Conjunction | X | Auxiliary | |
D | Direct Object | AP | Appositive Phrase | |
I | Indirect Object | CC | Coordinate Clause | |
M | Modifier | CP | Complement | |
P | Preposition | MC | Main Clause | |
R | Relativizer/Inf | PP | Prepositional Phrase | |
S | Subject | RC | Relative Clause |
An incomplete, but not interrupted, utterance, is
marked with the "trailing off" '+=8A' symbol on the
main line.
Example:
*SAR: smells good enough for +=8A
*SAR: what is that?
If the speaker does not really get a chance to trail off before begin interrupted by another speaker, then the interruption marker '+/.' is used instead. If the utterance that is being trailed off is a question, then the symbol '+..?' is used.
The symbol '+' can be used at the beginning of a
main tier line to mark the completion of an utterance after an
interruption. It is complementary to the trailing off symbol.
Example:
*CHI: so after the tower +...
*EXP: yeah.
*CHI: +, I go straight ahead.
Others' completion is marked through '++'. This symbol
can be used at the beginning to mark "latching", or
the completion of another speaker's. It is complementary to the
trailing off symbol.
Example:
*HEL: if Bill had known +...
*WIN: ++ he would have come.
Retracing without correction (simple repetition)
[/] takes place when speakers repeat words or whole phrases without
change. The retraced material is put in angle brackets.
Example:
*BET: <I wanted> [/] I wanted to invite Margie.
Several repetitions of the same word can be indicated
in the following way:
*HAR: It's(/4) like # a um # dog.
Retracing with correction [//] takes place when a speaker starts to say something, stops, repeats the basic phrase, changes the syntax but maintains the same idea. Usually, the correction
moves closer to the standard form, but sometimes
it moves away form it. The retraced material is put in angle brackets.
Example:
*BET: <I wanted> [//] uh I thought I wanted
to invite Margie.
Retracing with Reformulation [///] takes place when
retracings involve full and complete reformulations of the message
without any specific corrections.
Example:
*BET: all of my friends had [///] uh we had decided
to go home for lunch.
Unclear Retracing Type is marked by [/?].
CHAT distinguishes a False Start without retracing [/-], from false starts with correction. False starts with no retracing are dealt with in the following section. The symbols [/] and [//] are used when a false start is followed by a complete repetition or by a partial repetition with correction.
If the speaker terminates an incomplete utterance and starts off on a totally new tangent, this can be coded with the [/-] symbol:
*BET: <I wanted> [/-] uh when is Margie coming?
Note that if this coding is not in contrast with the coding of incomplete utterances (either trailed off or interrupted); this uniquely depends on the decisions about what a coder wants to count as an utterance.
The CHRISTINE corpus is, for spoken dialogues, what SUSANNE was for written corpora: a carefully annotated collection of real spoken material of British English only.
The CHRISTINE project is using the structural annotation scheme defined for the SUSANNE Corpus (which is probably the most detailed thing of its kind yet produced). The definition of the SUSANNE scheme can be found in G. Sampson's book, "English for the Computer" (see Sampson, 1995). The EAGLES group asked for a copy of this book when it was in proof and its contents (Chapter 6 in particular, which deals with extending annotation to spoken material) played a significant part in their decisions (see Section D.3 in this report for further details). In the CHRISTINE project, the annotation rules of Chapter 6 are being redefined on the basis of experience in actually applying them to sizeable quantities of spontaneous spoken English. G. Sampson (personal communication) reports that in most respects what is being done is only adding to already existing rules, not changing them. Additional annotation rules are not, at the present stage, into a form fit to circulate yet.
The CHRISTINE project is due to be completed at the end of 1999. there may be a few months' "polishing" after that, but then or soon afterwards the annotated corpus will be made available freely to all comers in the same way that the SUSANNE Corpus already is.
Some documentation available at:
http://iris1.let.kun.nl/TSpublic/tosca/index.html
Coding book:
documentation available at
http://www.ilc.pi.cnr.it/EAGLES96/annotate/annotate.html
Number of annotators:
not applicable
Number of annotated dialogues:
not applicable
Evaluations of scheme:
indirect evaluation through instantiation in many
different projects (see usability)
Underlying task:
standard development
List of phenomena annotated:
list of relevant phenomena provided below
Examples:
list of relevant examples provided below
Mark-up language:
not applicable
Existence of annotation tools:
eagles conformant annotation tools developed in other
projects
Usability:
schemes adopted in Multext, Sparkle, Parole
EAGLES is the ancestor of a family of standardization efforts for corpus annotation. It is then worth looking into EAGLES' methodology in some detail, as this will also offer a key to an understanding of the design and development of other Eagles-related annotation schemes.
EAGLES provides a list of morphosyntactic (major) categories.
1. | N [noun] | 2. | V [verb] | 3. | AJ [adjective] |
4. | PD [pronoun/
determiner] | 5. | AT [article] | 6. | AV [adverb] |
7. | AP [adposition] | 8. | C [conjunction] | 9. | NU [numeral] |
10. | I [interjection] | 11. | U [unique/
unassigned] | 12. | R [residual] |
13. | PU [punctuation] |
They represent the most general and obligatory level of morphosyntactic annotation, in the sense that any set of morphosyntactic tags is expected to convey at least information about morphosyntactic categories.
The set of Eagles category tags is not formally consistent, in that it does not provide a minimal set of mutually exclusive morphosyntactic classes. See, for example, the umbrella-category PD, including both determiners and pronouns, and its coexistence with the overlapping category AT for articles. Accordingly there is no general expectation that the mapping between the EAGLES category tags and a language specific instantiation of it should be one-to-one.
Morphosyntactic categories can further be specified by means of appropriate morphosyntactic features (such as gender, number, case etc.),expressed as supplementary tags. The combination of a category tag with its morphosyntactic feature specification yields complex tags of considerable length and granularity. As an illustration, we provide below the feature matrix for the category verb as detailed .
(i) | Person: | 1. First | 2. Second | 3. Third | |
(ii) | Gender: | 1. Masculine | 2. Feminine | 3. Neuter | |
(iii) | Number: | 1. Singular | 2. Plural | ||
(iv) | Finiteness: | 1. Finite | 2. Non-finite | ||
(v) | Verbform/ Mood: | 1. Indicative | 2. Subjective | 3. Imperative | 4. Conditional |
5. Infinite | 6. Participle | 7. Gerund | 8. Supine | ||
(vi) | Tense: | 1. Present | 2. Imperfect | 3. Future | 4. Past |
(vii) | Voice: | 1. Active | 2. Passive | ||
(viii) | Status: | 1. Main | 2. Auxiliary |
Examples of use of this matrix are provided for what is called "Intermediate Tag Set", a specific instantiation of a subset of the list of categories above:
A 3rd person, singular, finite, indicative, past tense, active, main verb,
non-phrasal, non-reflexive, verb is represented: V3011141101200
Wherever an attribute is inapplicable to a given word in a given tagset, the value 0 fills that attribute's place in the string of digits. When the 0s occur in final position, without any non-zero digits following, they can be dropped.
Eagles makes provision for disjunctive specification of morphosyntactic categories in cases of i) genuine systematic ambiguity in a given language (e.g. present indicative and present subjunctive forms in English, or some past participles and adjectives in Italian), ii) practical demands of fully automatic tagging.
The interjection and adverb categories are much broader and variegated than usually assumed in traditional grammar. Eagles 98 provides two illustrative lists of the level of granularity at which both categories can be subclassified, taken from Sampson (1995) and the London Lund Tagset respectively. In both cases a fine-grained functional or semantic analysis of the role of each subclass in dialogue interaction is presupposed. This aspect makes both proposals prohibitive for the purposes of automatic annotation. A practical strategy could be to add interjection to the Eagles inventory of part-of-speech categories and provide a rich feature matrix for subclassification, under the assumption that only the topmost attribute (part-of-speech) be disambiguated in automatic tagging.
Eagles 98 recommends to treat pauses and hesitators as punctuation marks, to eventually be attached as high in the syntactic tree as possible during parsing.
No specific recommendations are provided for word partials, and the suggestion is tentatively put forward to use the peripheral part-of-speech category U ('unique' or 'unassigned', see list above) for their tagging. Non standard forms (e.g. 'gonna') are recommended to be transcribed with standard spelling. Deviations from this practice should be documented and justified.
Eagles 98 leaves the matter open of whether multi-word units should be assigned a single tag or rather a multi-tag. Representation issues are not addressed either in any detail.
Coding of mistakes is neither envisaged nor excluded by Eagles 98 recommendations.
Eagles 98 provides a couple of illustrative examples of how syntactic incompleteness could be annotated. In the first one (drawn from the British National Corpus) syntactic incompleteness is annotated by means of a special marker (a slash following the non terminal constituent label) tagging the incomplete constituent as a whole. In the second example (from Sampson 1995), no new label is introduced to mark the incomplete constituent, but only a place holder, '#', which marks the position of the missing element within the incomplete constituent.
It is emphasized that the examples provided are only indicative and should not be taken as standards in any way.
Only one example is provided by means of illustration. Once more, it is drawn from Sampson 1995 and recast into an Eagles-conformant style. Both the retrace and the repair are within the minimal superordinate constituent, with the marker '#' used to signal the interruption point
and that [NPs any bonus
[RELCL he ] # money [RELCL he gets over that
]] is a bonus
It is not immediately clear from the example what word stretch the repair is meant to replace.
Cases of syntactic blending are illustrated by means
of a drastically incoherent sentence, annotated through maximal
parse brackets to enclose the whole parsable unit, and no information
about its internal structure. This is what the guidelines of the
British National Corpus call 'structure minimization principle':
[and this is what the # the <unclear>] # [ what's name now # now ] # <pause> [ that when it's opened in nineteen ninety-two <pause> the communist block will be able to come through Germany this way in ]
The syntactic annotation schemes developed within SPARKLE are an example of instantiation of Eagles recommendations at the morphosyntactic and syntactic levels, specifically geared towards the completion of two different tasks: i) use of morphosyntactically and syntactically annotated corpora for (semi)automatic acquisition of lexical information from them, and ii) use of annotated material for multi-lingual information retrieval and speech recognition. Both tasks are being carried out on four different languages (namely English, French, German and Italian).
In Sparkle, bootstrapping lexical information from a corpus is modelled as the process of extracting typical contexts of usage of a given lexical item in a shallow-parsed corpus. The acquired information is eventually put to use by either providing a lexicalized version of the shallow parser, or by augmenting the lexicon of another independent parser. In both cases, the ultimate goal of the lexicalized parser is to provide the analysis of a sentence in terms of functional relations holding between head words. Usefulness of this level of analysis is eventually assessed through industrial demonstrators for multilingual information retrieval and monolingual speech recognition.
Accordingly, Sparkle defines the following three possible levels of syntactic annotation:
i) chunking
ii) phrasal parsing
iii) functional parsing
In the following we will review in detail levels
i) and ii) only.
Coding book:
documentation available at
http://www.ilc.pi.cnr.it/sparkle.html
Number of annotators:
>5
Number of annotated material:
600 annotated sentences of English, German and Italian
Evaluation of scheme:
Evaluation of automatic annotation over all levels
available at: http//www.ilc.pi.cnr.it/sparkle.html
Underlying Task:
Language modelling for Speech Recognition, Multilingual
Information Retrieval
List of phenomena annotated:
List of relevant phenomena provided below.
Examples:
Provided below.
Mark-up language:
SPARKLE's own format.
Existence of annotation tool:
Software available for English, German and Italian.
Usability:
Speech Recognition and Multilingual Information Retrieval.
Contact Person:
Vito Pirrelli (vito@ilc.pi.cnr.it)
SPARKLE did not develop a specific set of word-level tags, but it simply built on pre-existing part-of-speech Eagles96-conformant encoding schemes. A straightforward extension of these schemes should make provision for the additional tags needed to cover phenomena which are specific of dialogues.
In SPARKLE, segmentation problems are dealt with differently, depending on which level of syntactic annotation one is considering. For the specific purposes of the present overview, we will limit ourselves to consideration of chunking and functional annotation only. This is done for ease of exposition, as these two levels, unlike complete phrase-structure trees, are clearly complementary, and exemplify two profoundly different perspectives on syntactic annotation: one based on the linear arrangement of word forms in a sentence and on the internal cohesion of relatively small syntactic islands, the other on an abstract representation of grammatical functions relative to a verb head. Traditionally, complete phrase-structure trees are assumed to simultaneously convey both types of information. For reasons that will be clear in a moment, syntactic annotation of dialogue favors a view whereby linear adjacency of word forms on the one hand and encoding of functional annotation on the other hand are to be dealt with separately.
In what follows, we first exemplify the SPARKLE approach to chunking through detailed illustration of the Italian chunking scheme.
The typology of phrase chunks in the Italian chunking
annotation scheme is summarised in the table below.
NAME | TYPE | POTGOV | EXAMPLES |
ADJ_C | adjectival chunk | Adj | bello 'nice',
molto bello 'very nice' |
BE_C | predicative chunk | Adj
past part | è bello '(it/(s)he) is nice',
è caduto '(it/he) fell' |
ADV_C | adverbial chunk | Adv | sempre 'always' |
SUBORD_C | subordinating chunk | Conj | quando 'when',
dove 'where' |
N_C | nominal chunk | noun
pron verb adj | la mia casa 'my house',
io 'I', questo 'this', l'aver fatto 'having done', il bello 'the nice (one)' |
P_C | prepositional chunk | Noun
pron verb adj | di mio figlio 'of my son',
di quello 'of that (one)', dell'aver fatto 'of having done', del bello 'of the nice (one)' |
FV_C | finite verbal chunk | Verb | sono stati fatti '(they) have been done',
rimangono '(they) remain' |
G_C | gerundival chunk | Verb | mangiando 'eating' |
I_C | infinitival chunk | Verb | per andare 'to go',
per aver fatto 'to have done' |
PART_C | participial chunk | Verb | finito 'finished' |
The following informal definitions are intended to
make the assumptions underlying this schema fully explicit. More
on this can be found in SPARKLE WP1 final report (Carroll et al.
1996), and related papers (Federici et al. 1996 and 1998).
ADJ_C
ADJ_Cs are chunks beginning with any premodifying adverbs and intensifiers and ending with a head adjective. This definition provides a necessary but not sufficient condition for identification of ADJ_C. In fact, adjectival phrases occurring in pre-nominal position are not marked as distinct chunks since their relationship to the governing noun is unambiguously identified within the nominal chunk (see example sentence above). The same holds in the case of predicate adjectival phrases governed by the verb essere 'be', which are part of BE_C (see below).
BE_C
BE_Cs consist of a form of the verb essere 'be' and an ensuing adjective/past participle including any intervening adverbial phrase. E.g.:
[BE_C è intelligente BE_C] '(he) is intelligent'
[BE_C è molto bravo BE_C] '(he) is very good'
[BE_C è appena arrivato BE_C] '(he) just arrived'
ADV_C
ADV_Cs extend from any adverbial pre-modifier to the head adverb. Once more, this definition provides a necessary but not sufficient condition for ADV_C. In fact, adverbial phrases that occur between an auxiliary and a past participle form are not identified as distinct chunks due to their unambiguous dependency on the verb. By the same token, adverbs which happen to immediately premodify verbs or adjectives are respectively part of a verbal chunk and an adjectival chunk. Finally, noun phrases used adverbially (e.g. questa mattina 'this morning') are treated as nominal chunks (see below). E.g.:
[FV_C ha sempre camminato FV_C] [ADV_C molto ADV_C] '(he) has always walked a lot'
[FV_C ha finito FV_C] [ADV_C molto rapidamente ADV_C] '(he) has finished very quickly'
SUBORD_C
SUBORD_Cs are chunks which include a subordinating
conjunction. Subordinating conjunctions are chunked as an independent
chunk in its own right only when they are not immediately followed
by a verbal group. Compare, for example, the chunk structure of
the following sentence
[FV_C non so FV_C] [SUBORD_C quando
SUBORD_C] [N_C il direttore N_C] [FV_C mi
riceverà FV_C] '(I) do not know when the director will
receive me'
with the chunk structure of the following sentence,
which differs from the previous one in having the subject of the
subordinate clause in postverbal position:
[FV_C non so FV_C] [FV_C quando mi riceverà FV_C] [N_C il direttore N_C].
N_C
N_Cs extend from the beginning of the noun phrase to its head. They include nominal chunks headed by nouns, pronouns, verbs in their infinitival form when preceded by an article (i.e. Italian nominalised infinitival constructions) and proper names. Noun phrases functioning adverbially (e.g. questa mattina 'this morning') are also treated as nominal chunks. All kinds of modifiers and/or specifiers occurring between the beginning of the noun phrase and the head are included in N_Cs. E.g.:
[N_C un bravo bambino N_C] 'a good boy'
[N_C tutte le possibili soluzioni N_C] 'all possible solutions'
[N_C i sempre più frequenti contatti N_C] 'the always more frequent contacts'
[N_C questo N_C] 'this'
[N_C il camminare N_C] 'walking'
[N_C il bello N_C] 'the nice (one)'
In the chunking scheme, nominal chunks cover only a portion of the range of linguistic phenomena normally taken care of by nominal phrases: namely only noun phrases with prenominal complementation.
P_C
P_Cs go from a preposition to the head of the ensuing nominal group. Most of the criteria given for N_Cs also apply to this case. Typical instances of P_Cs are:
[P_C per i prossimi due anni P_C] 'for the next two years'
[P_C fino a un certo punto P_C] 'up to a certain point'
FV_C
FV_Cs include all intervening modals, ordinary and
causative auxiliaries as well as medial adverbs and clitic pronouns,
up to the head verb. E.g.:
verbal chunk with auxiliary or modal verb and medial adverb:
[FV_C può ancora camminare FV_C] '(he) can still walk'
verbal chunk with pre-modifying adverb:
[FV_C non ha mai fatto FV_C] [ADV_C così ADV_C] '(he) has never done so'
the auxiliary essere 'be' in periphrastic verb forms (whether active or passive) such as sono caduto 'I fell', sono stato colpito 'I was hit', or mi sono accorto 'I realized', is dealt with as part of a finite verb chunk, unless the verb essere is followed by a past participle which the dictionary also categorizes as an adjective; in the latter case it is chunked as a BE_C (see above).
[FV_C è FV_C] [N_C un simpatico ragazzo N_C] '(he) is a nice guy'
fronted auxiliaries constitute separate FV_Cs:
[FV_C può FV_C] [N_C la commissione N_C] [I_C deliberare I_C] [P_C su questa materia P_C]? 'can the Commission deliberate on this topic?'
periphrastic causative constructions:
[FV_C fece studiare FV_C] [N_C il bambino N_C] '(he) let the child study'
clitic pronouns are part of the chunk headed by the immediately adjacent verb:
[FV_C lo ha sempre fatto FV_C] '(he) has always done it'
G_C
G_Cs contain a gerund form. When part of a tensed verb group (e.g. in progressive constructions), the gerundival verb form is not marked independently. G_C also includes gerund forms functioning as noun phrases.
[FV_C sta studiando FV_C] '(he) is studying'
[G_C studiando G_C] [FV_C ho imparato FV_C] [ADV_C molto ADV_C] 'by studying (I) have learned a lot'
I_C
Infinitival chunks (I_Cs) include both bare infinitives and infinitives introduced by a preposition.
[FV_C ha promesso FV_C] [I_C di arrivare I_C] [ADV_C presto ADV_C] '(he) has promised to arrive early'
[FV_C desidera FV_C] [I_C partire I_C] [ADV_C domani ADV_C] '(he) wishes to leave tomorrow'
PART_C
A past participle chunk (PART_C) includes participial constructions such as:
[PART_C finito PART_C] [N_C il lavoro N_C] , [N_C Giovanni N_C] [FV_C andò FV_C] [P_C a casa P_C] '(having) finished the job, John went home'
In this section we illustrate, by way of exemplification, the chunking of linguistic phenomena which are typical of dialogues. Examples are only indicative and represent an adaptation to English material of the principles underlying the Italian chunking schema outlined above.
multi-words
Chunking presupposes prior identification and marking of multi-word units.
error coding
Chunking presupposes prior identification and marking of errors and non standard forms.
trailing off, interruption, completion
*SAR: [FV_C smells FV_C] [Adj_C good enough Adj_C] [P_C for P_C]
retrace-and-repair sequences
*BET: [FV_C I wanted FV_C] [filler_C uh filler_C][FV_C I thought FV_C] [FV_C I wanted FV_C] [I_C to invite I_C] [N_C Margie N_C].
anacolutha (syntactic blending)
*BET: [FV_C I wanted FV_C] [filler_C uh filler_C] [WH_C when WH_C] [FV_C is Margie coming FV_C] [Punct_C ? Punct_C]
In EAGLES, a three-layered approach to the specification of grammatical dependencies for verbal arguments was followed (Sanfilippo et al., 1996). The first layer identifies the subject/complement and predicative distinctions as the most general specifications; this layer is regarded as encoding mandatory information. The second layer provides a further partition of complements into direct and indirect as recommended specifications. Finally, a more fine-grained distinction qualified as useful is envisaged introducing further labels for clausal complements and second objects.
The first step in tailoring the EAGLES standards to the needs of SPARKLE, has been to make provisions for modifiers. These were not treated in EAGLES since only subcategorizable functions were taken into consideration. Secondly, the relationship among layers of grammatical dependency specifications has been interpreted in terms of hierarchical links.
In general, grammatical relations (GRs) are viewed as specifying the syntactic dependency which holds between a head and a dependent. In the event of morphosyntactic processes modifying head-dependent links (e.g. the passive, dative shift and causative-inchoative diatheses), two kinds of GRs can be expressed:
For example, Paul in Paul was employed
by Microsoft is the final subject and initial object of employ.
The hierarchical organisation of GRs is shown graphically in Figure
2 below.
Each GR in the current version of the scheme is described individually below.
mod(type,head,dependent)
The relation between a head and its modifier; where appropriate, type indicates the word introducing the dependent; e.g.
mod(_,flag,red)
a red flag
mod(_,walk,slowly)
walk slowly
mod(with,walk,John)
walk with John
mod(while,walk,talk)
walk while talking
mod(_,Picasso,painter)
Picasso the painter
mod is also used to encode the relation between an event noun (including deverbal nouns) and its participants; e.g.
mod(of,gift,book)
the gift of a book
mod(by,gift,Peter)
the gift of a book by Peter
mod(of,examination,patient)
the examination of the patient
mod('s,doctor,examination)
the doctor's examination of the patient
cmod,xmod,ncmod
Clausal and non-clausal modifiers may (optionally) be distinguished by the use of cmod / xmod, and ncmod respectively, each with slots the same as mod. The GR cmod is for when the adjunct is controlled from within, and xmod for control from without. E.g.
cmod(because,eat,be)
he ate the cake because he was hungry
xmod(without,eat,ask)
he ate the cake without asking
arg_mod(type,head,dependent,initial_gr)
The relation between a head and a semantic argument which is syntactically realised as a modifier; thus a by-phrase can be analysed as a `thematically bound adjunct'. The
type slot indicates the word introducing the dependent: e.g.
arg_mod(by,kill,Brutus,subj)
killed by Brutus
subj(head,dependent,initial_gr)
The relation between a predicate and its subject; where appropriate, the initial_gr indicates the syntactic link between the predicate and subject before any
GR-changing process:
subj(arrive,John,_)
John arrived in Paris
subj(employ,Microsoft,_)
Microsoft employed 10 C programmers
subj(employ,Paul,obj)
Paul was employed by Microsoft
With pro-drop languages such as Italian, when the subject is not overtly realised the annotation is, for example, as follows:
subj(arrivare,Pro,_)
arrivai in ritardo '(I) arrived late'
where the dependent slot is filled by the abstract filler Pro, which indicates that person and number of the subject can be recovered from the inflection of the head verb
form.
csubj,xsubj,ncsubj
The GRs csubj and xsubj may be used for clausal subjects, controlled from within, or without, respectively. ncsubj is a non-clausal subject. E.g.
csubj(leave,mean,_)
that Nellie left without saying good-bye meant she was still angry
xsubj(win,require,_)
to win the America's Cup requires heaps of cash
dobj(head,dependent,initial_gf)
The relation between a predicate and its direct object--the first non-clausal complement following the predicate which is not introduced by a preposition (for English and German); initial_gf is iobj after dative shift; e.g.
dobj(read,book,_)
read books
dobj(mail,Mary,iobj)
mail Mary the contract
iobj(type,head,dependent)
The relation between a predicate and a non-clausal complement introduced by a preposition; type indicates the preposition introducing the dependent; e.g.
iobj(in,arrive,Spain)
arrive in Spain
iobj(into,put,box)
put the tools into the box
iobj(to,give,poor)
give to the poor
obj2(head,dependent)
The relation between a predicate and the second non-clausal complement in ditransitive constructions; e.g.
obj2(give,present)
give Mary a present
obj2(mail,contract)
mail Paul the contract
ccomp(type,head,dependent)
The relation between a predicate and a clausal complement which does have an overt subject; type indicates the complementiser / preposition, if any, introducing the clausal XP. E.g.
ccomp(that,say,accept)
Paul said that he will accept Microsoft's offer
ccomp(that,say,leave)
I said that he left
xcomp(type,head,dependent)
The relation between a predicate and a clausal complement which has no overt subject (for example a VP or predicative XP). The type slot is the same as for ccomp above.
E.g.
xcomp(to,intend,leave)
Paul intends to leave IBM
xcomp(_,be,easy)
Swimming is easy
xcomp(in,be,Paris)
Mary is in Paris
xcomp(_,be,manager)
Paul is the manager
Control of VPs and predicative XPs is expressed in terms of GRs. For example, the unexpressed subject of the clausal complement of a subject-control predicate is specified by saying that the subject of the main and subordinate verbs is the same:
Paul intends to leave IBM
subj(intend,Paul,_)
xcomp(to,intend,leave)
subj(leave,Paul,_)
dobj(leave,IBM,_)
arg(head,dependent)
The hierarchical organisation of GRs makes it possible to use underspecified GRs where no reliable bias is available for disambiguation. For example, both Gianni and Mario
can be subject or object in the Italian sentence
Mario, non l'ha ancora visto, Gianni
'Mario has not seen Gianni yet' / 'Gianni has not seen Mario yet'
In this case, the parser could avoid having to try to resolve the ambiguity by using the underspecified GR arg, e.g.
arg(vedere,Mario)
arg(vedere,Gianni)
dependent(introducer,head,dependent)
The most generic relation between a head and a dependent (i.e. it does not specify whether the dependent is an argument or a modifier). E.g.
dependent(in,live,Rome)
Marisa lives in Rome
It can be argued quite convincingly that the level of functional annotation (or any other syntactic representation which abstracts away dramatically from the surface ordering of syntactic units in a sentence) is relatively independent of the specific utterance through which grammatical functions happen to be concretely realized. For example, given the following orthographic transcription
i)
I I I go away
where the pronoun "I" is uttered thrice, it still makes sense to say that the subject of "go away" is one (namely the pronoun "I"), and that it just happens to be repeated more than once, owing to some extra-grammatical factors. The neat separation between chunked representations (where concretely realized syntactic units matter) on the one hand and the level of functional representation on the other hand, allows the annotator to get around somewhat puzzling issues such as "which one of the three overtly realized instances of 'I' is the subject of this utterance?". In fact it makes comparatively little sense to associate the label "subject" with any particular token of "I" in i) above. A level of annotation which abstracts away from the level of linear representation embodied in i) achieves this purpose:
subj(go, I,_)
Still linking the functionally annotated material with elements of i) can be useful. This could be achieved as follows: a) first, the three pronouns in a row are signalled as a repetition at some level of "edited" orthographic transcription; b) a target form ("I") is then added to the surface representation; c) finally, the target form is linked to the functionally annotated material.
Coding book:
No coding book is publicly available. References can be found at http://grid.let.rug.nl:4321
See also Bod and Scha (1997).
Number of annotators:
missing information
Number of annotated dialogues:
21000 sentences, Dutch
Evaluation of scheme:
missing information
Underlying task:
Information-seeking, telephone-mediated human-machine
dialogues for travel/transport domain.
Examples:
no examples available
Mark-up Language:
missing information
Existence of annotation tools:
Annotation was done semi-automatically, using a tool
called SEMTAGS.
Usability:
Used in the OVIS interactive spoken language system
for travel information to users using public transport in the
Netherlands.
Contact person:
Rens Bod (Rens.Bod@let.uva.nl)
List of phenomena annotated:
The OVIS system aims at reaching large vocabulary, speaker-independent continuous speech recognition technology, combined with natural language processing using a probabilistic partial parsing approach. The NLP Ovis component is a statistically based language processing system, based on the 'Data-Oriented Parsing' System developed and implemented at the Department of Computational Linguistics of Amsterdam University.
Hesitations, false starts, and additional noises produced by speakers are annotated at the morpho-syntactic level. The following is a slightly more detailed description of information represented at the syntactic and semantic levels of analysis.
1. Syntactic annotation
Syntactic annotation starts from a minimum level consisting in bracketing of constituents. Sentences are annotated with labelled constituent trees, as in the ATIS corpus. The syntactic categories have been reconsidered to fit the needs of the application. The original linguistically inspired annotation convention has received considerable revision: in particular, certain rather broad categories were introduced that are non-standard in linguistic theories. For instance, a notion of 'modifier-phrase' which includes adverbs, PP's, and various kinds of conjunctions and other combination of such constituents. Other ad hoc categories have been introduced to deal with peculiarities of Dutch word order which do not fit well in a purely surface-based syntactic description without features.
The grammar covers most of the common verbal subcategorization types (intransitives, transitives, verbs selecting app, and modal and auxiliary verbs), np-syntax (including pre- and postnominal modification, with the exception of relative clauses), pp-syntax, the distribution of vp-modifiers, various clausal types (declaratives, yes/no and wh-questions, and subordinate clauses), all temporal expressions and locative phrases relevant to the domain, and various typical spoken language constructs.
2. Semantic/pragmatic annotation
Every meaningful node is annotated with a formula expressing that meaning; if the meaning of a node depends on its daughter nodes, this formula contains variables referring to those daughter node meanings. When a new tree is constructed out of subtrees with such annotations, it is obvious how to compute the meaning of this tree.
Annotation for the Spoken English Corpus (SEC) is
based on the LOB Corpus tag-set. Almost every SEC tag is identical
to its LOB equivalent. The major difference between the tag-sets
is that LOB differentiates between relative and interrogative
WH-pronouns whereas SEC does not. For example, the LOB tag pair
WP (WH-pronoun, interrogative, nominative or accusative) and WPR
(WH-pronoun, relative, nominative or accusative) are covered by
the same SEC tag. Confusingly, this tag is also called WP, but,
unlike for LOB, does not imply that the WH-pronoun is interrogative.
The following table details the major differences between LOB
and SEC with regard to WH-pronouns:
As its name implies, the Spoken English Corpus is composed of transcriptions of spoken English. This inherently means that there will be differences between it and the LOB corpus which is comprised of written texts only. Phenomena that are used primarily for English in its written form will not be found in SEC. A good example is written abbreviations. These were marked in LOB in a pre-automatic-tagging phase by adding the sequence '\0' to the start of the abbreviated token whereas this is not required in SEC.
Some of the LOB tags do not appear in SEC even though, in theory, they would have been allowable. This is because, at just over 52 thousand words, SEC is much smaller than LOB which has over a million words. Naturally, in such a small corpus the coverage of rare parts-of-speech was reduced. This can also explain why annotation of SEC did not call for a significant extension of the LOB tagset.
Further information on the SEC can be found in Taylor
and Knowles (1988) and at the International Computer Archive of
Modern English (ICAME) corpus collection (http://nora.hd.uib.no/corpora.html).
Coding book:
Marie Meeter et al. 1995. Disfluency annotation stylebook for the Switchboard Corpus.
(ftp://ftp.cis.upenn.edu/pub/treebank/swbd/doc/DFL-book.ps)
Number of annotators:
missing information
Number of annotated dialogues:
2430 conversations, more than 240 hours, 3 million
words
Evaluations of scheme:
missing information
Underlying task:
missing information
List of phenomena annotated:
list of relevant phenomena provided below
Examples:
list of relevant phenomena provided below
Mark-up language:
missing information
Existence of annotation tools:
missing information
Usability:
missing information
Contact person:
Linguistic Data Consortium (ldc@ldc.upenn.edu)
Explicit editing terms (such as 'I mean') and discourse markers
(such as 'Well') are annotated respectively as '{E...}' and '{D...}'.
Use of curly brackets allows annotation of a sequence of words,
by simply including it into brackets.
Example:
{E I would say}
Only filled pauses are markers (hesitators) by '{F}'.
Fragmented or incomplete words are marked in the transcription with '-'.
Example:
you kn-
Transcribed texts are subdivided primarily into so-called "slash units". A slash unit is maximally a sentence but can be a smaller unit. Slash units below the sentence level correspond to those parts of the narrative which are not sentential but which the annotator interprets as complete.
Annotation makes provision for marking sequences of more than one word with one label only by encompassing them between curly brackets.
No specific marker is envisaged for this purpose.
When a turn does not constitute a complete constituent,
it is marked as incomplete with the symbol '-/'. It is possible
for the speaker to continue over more than one turn. In this case,
the annotation guidelines make provision for use of the symbol
'- -'. Combination of the two symbols means the following:
'- - -/' interruption with constituent left incomplete and following
completion
Example:
A: I'll do it if - - - /
B: Yeah/
A: - - you wish/
'- - /' interruption with complete slash unit and
following completion
Example:
A: I'll do it - - /
B: Yeah/
A: - - if you wish/
'- -' interruption with neither incomplete constituent
nor complete slash unit, and following completion
Example:
A: If you wish - -
B: Yeah/
A: - - I'll do it/
The entire restart with its repair is contained in square brackets. The Interruption Point is marked by a '+'.
Example:
[ we're + at the same time we're ] real scared
Syntactic blending is treated as a kind of incomplete slash unit, if the speaker continues speaking but has obviously begun a new slash-unit.
Example:
when it comes to being alone -/ now if you give him the freedom to walk around, he likes that/
The TRAINS project at the University of Rochester Department of Computer Science is a long-term effort to develop an intelligent planning assistant that is conversationally proficient in natural language. The goal is a fully integrated system involving on-line spoken and typed natural language together with graphical displays and GUI-based interaction. The primary application has been a planning and scheduling domain involving a railroad freight system, where the human manager and the system must co-operate to develop and execute plans.
The current system prototype, named TRIPS (The Rochester
Interactive Planning System), involves a more realistic domain
and more complicated planning problems, while continuing the emphasis
on dialogue-based, mixed-initiative interaction.
Coding book:
No coding book is available, but information can
be found in Core and Schubert (1997).
Number of annotators:
missing information
Number of annotated material:
Altogether, the Trains-93 corpus includes 98 dialogs, collected using 20 different tasks and 34 different speakers. This amounts to six and a half hours of speech, about 5900 speaker turns, and 55,000 transcribed words. The collection and transcription of the dialogues is documented in the technical note "The Trains 93 Dialogues"
(ftp://ftp.cs.rochester.edu/pub/papers/ai/94.tn2.Trains_93_dialogues.ps.gz)
The transcriptions themselves are available at http://www.cs.rochester.edu/research/speech/93dialogs
Evaluations of scheme:
missing information
Underlying task:
Task-driven, application-oriented problem solving
dialogues. The dialogues involve two participants: one who plays
the role of a user and has a certain task to accomplish, and another
who plays the role of the system by acting as a planning assistant.
List of phenomena annotated and examples:
For some of the phenomena annotated at the morpho-syntactic
level, see the general description below.
Mark-up language:
missing information
Existence of annotation tools
For collecting and annotating ``The Trains 93 Dialogues'', a set of tools has been developed for converting a DAT recording into a fully segmented and annotated dialogue. These tools allow the user to progress stepwise through this process, from creating the initial dialogue audio file, breaking up the dialogue into a sequence of single-speaker utterance files that preserve the sequentiality of the dialogue, annotating the utterance files, printing the contents of the dialogue, and updating the breakup of the dialogue. These tools are described in the Trains technical note, "Dialogue Transcription Tools" (ftp://ftp.cs.rochester.edu/pub/papers/ai/94.tn1.Dialogue_transcription_tools.ps.Z)
and are available through ftp, as well as on the
CD-ROM. The toolset itself is available in a tar file at ftp://ftp.cs.rochester.edu/pub/packages/dialog-tools/toolset.tar.gz.
Usability:
Used in the TRAINS system.
The collected dialogues have played an integral part
in the Trains project. They have also been used to train a parser
that uses statistical preferences, and to train a part-of-speech
tagger that models speech repairs (cfr. Heeman and Allen, 1994)
(ftp://ftp.cs.rochester.edu/pub/papers/ai/94.heeman.ARPA_HLT.ps.Z)
Contact person:
James Allen (james@cs.rochester.edu)
A short description
The TRAINS project is to be mentioned as an example of how the exigencies of spoken language can be accommodated in software development. In particular, the TRAINS project is especially relevant for our purposes in that it adopts an integration vs. normalization strategy (see. the section 5.1.1 in the report).
The traditional approach consists in removing disfluencies before they reach the parser or in having the parser skip over such material. However reasonable, this approach not only abstracts from real data but also neglects the important roles such segments can play in the dialogue structure. Repairs, for example, can contain referents that are needed to interpret subsequent text (e.g., Take the oranges to Elmira, uh, I mean, take them to Corning).
In contrast to the above strategy, the alternative adopted in TRAINS is a parser-level approach that includes in phrase structure those disfluencies (such as repairs, hesitations and overlapping backchannel acknowledgments) that constitute a common problem for parsers for mixed-initiative dialogues.
To handle the disfluencies in mixed-initiative dialogues caused by repairs, hesitations and acknowledgments, the dialogue parser uses metarules that allow the chart of a dialogue parser to contain parallel syntactic structures (what was first said and its correction) in the case of repairs, and interleaved syntactic structures in the case of interruptions.
The editing term metarule allows constituents to skip over words signaling turn keeping (um, ah) and repairs (I mean).
In the structure allowed by the metarule a constituent may be interrupted between two subcostituents by one or more editing terms, and a constituent can be interrupted in more than one location.
In the case of overlapping acknowledgments and continuation prompts, such as 'okay', 'right' etc. uttered by the second speaker in overlap with the 'main' talk, the continuation metarule allows a constituent to overlap or be embedded inside another constituent to which it is unconnected. In this way, a constituent can be built across tracks.
An interruption metarule is used to deal with interjected corrections, questions, and comments separately from any repair that may follow. An example of interruption is the following:
u: then e1 will have
s: oh e1
u: right
two boxcars of oranges
In the case of repairs, a repair metarule operates on what is being corrected (or reparandum) and the correction (or the alteration), to build parallel phrase structure trees: one with the reparandum and one with the alteration. For example, for an utterance such as "Take the ban- um the oranges", the repair metarule would build two VPs: take the ban- and take the oranges.
This parsing framework has two relevant consequences. First, it allows the parser to accommodate disfluency phenomena, thus leaving important aspects of dialogue structure untouched. In addition, in this way the parser has information about the syntactic structure of the utterance and the range of allowed structures. These sources of information are absent from preprocessing, normalizing routines, and the dialogue parser can still use acoustic cues, pattern matching, and other sources of information used in preprocessing techniques.