NITE

Natural Interactivity Tools Engineering

Development of a best practice workbench for multi-level/cross-level annotation of natural interactivity data. Funded by EU/HLT. 2001-2003. 7 European partners. Coordinated by NISLab

 


 

1 Objectives

NITE's overall objective is to build an integrated best practice workbench for multi-level, cross-level and cross-modality annotation, retrieval and exploitation of multi-party natural interactive human-human and human-machine dialogue data. NITE will: (i) building on the MATE workbench and Noldus Observer software, specify architecture, functionality and usability of the NITE workbench to support best practice handling of speech, gesture and facial expression; (ii) define best practice coding schemes for gesture, facial expression and cross-modality coding; (iii) specify a markup framework; (iv) construct a workbench according to the specification and incorporate coding schemes for gesture, facial expression and cross-modality issues; (v) thoroughly document the NITE workbench; (vi) ensure usability for both expert and (computationally) novice user communities through extensive testing with differently skilled users; (vii) iteratively evaluate the workbench with users world-wide to make it a standard.

2 Description of the work

NITE (Natural Interactivity Tools Engineering) aims to build an integrated best practice workbench for multi-level, cross-level and cross-modality annotation, retrieval and exploitation of multi-party natural interactive human-human and human-machine dialogue data. This aim will be pursued in the following way: (a) building on the MATE (Multilevel Annotation Tools Engineering) workbench and Noldus Observer software, we will specify architecture, functionality and usability of the NITE workbench to support best practice handling of speech, gesture and facial expression; (b) define best practice coding schemes for gesture, facial expression and cross-modality (and cross-level) coding of speech, gesture, and facial expression data; (c) specify a markup framework; (d) construct a workbench according to the specification, incorporating best practice coding schemes for gesture, facial expression and cross-modality issues, tools for handling video, tools for integrated presentation of multimodal communication, and tools for extracting and viewing multimodal information from the annotated corpora; (e) thoroughly document the NITE workbench; (f) ensure usability for both expert and (computationally) novice user communities through extensive testing with differently skilled users; (g) iteratively evaluate the workbench with users world-wide to promote it as a standard tool in the field. In addition to making use of the results from the MATE project whose maintenance is being supported by ELSNET (the European Language and Speech Network), NITE will incorporate the results from surveying existing natural interactivity coding schemes, tools, and corpora which are currently being produced by the ISLE (International Standards for Language Engineering) Working Group on Natural Interactivity and Multimodality. NITE will cooperate with the German SmartKom project. In close cooperation with the ISLE Working Group just mentioned, NITE will take part in the discussion world-wide on standards for the field, including such actors as the US NIST (National Institute of Standards and Technology), the US LDC (Linguistic Data Consortium), the European ELRA (The European Language Resources Association), and the US TalkBank project, among others.

3 Milestones and expected results

Month 6 NITE workbench specification
Month 12 Best practice coding schemes for gesture, facial expression and cross-modality annotation
Month 12 NITE workbench version 1
Month 14 Evaluation results for NITE workbench version 1
Month 21 NITE workbench version 2
Month 24 Evaluation results for NITE workbench version 2

4 Towards natural interactive systems

After the first speaker-independent spoken language dialogue systems (SLDSs) appeared on the market ten years ago, SLDSs technologies have been making steady progress tackling increasingly complex tasks in a growing range of languages, in noisier environments, etc. [Bernsen et al. 1998]. Speech interfaces no doubt represent a major step towards the creation of natural interactive systems which are able to communicate and exchange information with humans in the same ways in which humans communicate and exchange information with one another. As a modality (i.e. a form of information representation) for information exchange, speech is mastered by almost all people, children as well as adults, being learnt from early on and used in most communication in everyday life. However, even if speech may be the most fundamental and natural way for people to exchange information, it is not the only one. In fact, in natural communication, the exchange of information using speech is accompanied by redundant and complementary exchanges of information using non-verbal communication behaviour, such as lip movements, deictic and other forms of gesture, facial expression, gaze and bodily posture. Just like speech, these modalities are learnt from early on and used and understood by almost everybody. It seems obvious, therefore, that computer systems which understand and produce speech, such as SLDSs, are merely the first step towards systems which can produce and understand several or all of the mentioned modalities of natural communication when addressing, or being addressed by, humans.
Despite the fact that SLDSs are still limited in many ways, such as being shared-goal-only and only able to communicate in limited domains, the next steps towards natural interactivity are well underway. Computer systems which recognise, understand and/or produce lip movements, gesture, facial expression and bodily posture have been researched for more than a decade [Parke 1982, Massaro and Cohen 1983, Thalmann and Thalmann 1991, Badler et al. 1993]. Lip-movement generation software already exists together with software packages for the generation of facial expression, movement and bodily posture. Research prototypes exist which combine speech and gesture input using input devices such as mouse, pen, touch screen, data gloves and data suits, or which interpret facial expressions [Sannier et al. 1998, Popescu et al. 1999, Molet et al. 1997, Benoit et al. 2000]. There is still a long way to go, however, before computer output using speech, lip movements, gesture, facial expression, gaze and bodily posture, can be generated synchronously on the fly with a naturalness close to that of natural human communication. And it may be assumed that there is an even longer way to go, before systems are capable of decoding real-time 'communicator-independent' user input behaviour consisting of combinations of verbal and non-verbal communication. Still, the first commercial steps have already been taken and the next steps are likely to follow with increasing rapidity, including the first multimodal SLDSs. The application scope of such systems is virtually unlimited, not just for human-system communication purposes including people with special needs but also for general use in sophisticated animation software, such as 3-D avatar software.

5 The need for standardisation of coding schemes, markup languages and coding technology

Today's SLDSs are the results of a long history of progress in speech understanding and generation. This progress has demanded, and increasingly demands, annotated data to proceed. For the efficient creation, use and re-use of annotated corpora, standardisation of coding schemes, markup languages and coding technology are such obvious needs that the objectives of the MATE project (see below) have received broad approval world-wide. For the efficient and successful development of natural interactive systems more generally, the need for appropriate coding schemes, markup languages, large amounts of annotated data, and easy-to-use coding technology is just as evident as for speech-only technology development. This need for support of annotation of multimodal corpora and for tools which facilitate the retrieval and exploitation of the annotated information is being felt widely in research laboratories and increasingly among technology providers. And as in the case of speech, only more so, these needs include standardisation in order to facilitate re-use and exploitation of existing resources as well as to ensure that novel data can be exploited to the full. In order to avoid, in the general field of natural interactive systems, the problems and bottlenecks caused by lack of standardisation and tools which have been felt for years in the field of spoken language technologies, it seems timely to identify and/or develop state-of-the-art annotation schemes and technology for natural interactivity with a view to preparing standards in the field.
Without standardisation and widely used tools, teams and projects have to create and annotate corpora using homegrown coding schemes and often primitive coding technology. This is very costly to do and the results are not likely to be re-used to any significant extent. The consequence is that fewer annotated corpora are being created than needed, fewer annotated corpora and tools are being re-used than needed, and teams and projects tend to orient themselves towards the small pool of already existing annotated resources rather than creating the resources they really need. Progress, in other words, is significantly slower than might have been the case had we had standardised tools across the board.

6 The MATE approach and its results

The LE (Language Engineering) Telematics project MATE (Multilevel Annotation Tools Engineering, project number LE4-8370) which ended on 31 December 1999, aimed to facilitate re-use of language resources by addressing the problems caused by lack of standardisation of markup schemes and tools for annotating and retrieving information from spoken language corpora. Taking several steps beyond the state-of-the-art at the time, MATE successfully specified, designed, implemented and tested the MATE workbench which is a toolset for multi-party, multi-level, and cross-level spoken dialogue annotation [http://mate.nis.sdu.dk]. To do so, MATE developed and followed a methodology of (i) survey and evaluation of existing coding schemes at multiple levels, (ii) development of a standard framework for annotation, (iii) selection and development of state-of-the-art coding schemes, and (iv) state-of-the-art technology implementation using XML (Extended Markup Language) for corpus file representation. The MATE workbench enables annotation at the levels of prosody, (morpho-)syntax, co-reference, dialogue acts and human-machine (spoken) communication problems, as well as cross-level annotation. A standard framework for multilevel markup and co-existing annotation has been developed and implemented in the workbench which comes with a set of best practice coding schemes for the levels mentioned plus various tools for adding new coding schemes and levels, querying, retrieving and viewing information, and converting to/from different file formats. The MATE workbench is already being used to annotate corpora in Italian, English, German and French research projects. The MATE workbench will be made Open Source software in the autumn of 2000. ELSNET has kindly agreed to support communication with users during the first Open Source phase.

7 The NITE objectives

NITE will address the ambitious and highly non-trivial problem of generalising the existing MATE workbench to provide users with a standardised environment in which to annotate corpora from the entire domain of natural interactivity. Thus, NITE will build an integrated best practice workbench for multi-level, cross-level and cross-modality annotation, retrieval and exploitation of multi-party natural interactive human-human and human-machine dialogue data. To do so, NITE will follow the methodology and most of the basic software standards that have proven successful in MATE. Thus, (a) building on the MATE workbench and Noldus Observer software, we will specify architecture, functionality and usability of the NITE workbench to support best practice handling of speech, gesture and facial expression; (b) define best practice coding schemes for gesture, facial expression and cross-modality (and cross-level) coding of speech, gesture, and facial expression data; (c) specify a markup framework; (d) construct a workbench according to the specification, incorporating best practice coding schemes for gesture, facial expression and cross-modality issues, tools for handling video, tools for integrated presentation of multimodal communication, and tools for extracting and viewing multimodal information from the annotated corpora; (e) thoroughly document the NITE workbench; (f) ensure usability for both expert and (computationally) novice user communities through extensive testing with differently skilled users; (g) iteratively evaluate the workbench with users world-wide to promote it as a standard tool in the field.
By the end of the project, NITE will have produced a set of well-tested and well-documented tools, methods and guidelines for the creation, annotation, and application of multi-party natural interactive dialogue corpora. The results of NITE are relevant to a wide range of stakeholder groups, including producers of multimodal SLDSs and intelligent information presentation systems, researchers and developers annotating single-party or multi-party speech, gesture or facial expression, or their combinations, producers of any technology incorporating parts of natural interactive communication, such as technology for the disabled and advanced computer graphics (animations, animated agents, avatars), and producers of data capture technologies.


Main Contact:

Christoph Lauer

Members of the NITE team in Saarbrücken:

Michael Kipp

Norbert Reithinger

Ralf Engel


 

Christoph Lauer 5.2001