Mastersthesis DFKI GmbH / Universität des Saarlandes 2010.
In this thesis we develop an approach for determining contextually appropriate into- nation of clarification statements raised during continuous and cross-modal learning in autonomous robots. Autonomous robots which self-understand and self-extend in the environment in which they find themselves learn continuously about their surroundings. During the course of learning a robot might require additional information from its hu- man interlocutor. Spoken dialogue is a means through which robots can ask their interlocutor for new information, and also for clarifying the knowledge they have acquired about the situated environment. The ability to self-initiate a dialogue, besides adding autonomy to a robot's behavior, also allows the robot to connect its belief state to that of its listener. This enables the participating agents to perform grounding, and arrive at a common ground. A robot's grounding feedback is one of the means to arrive at a common ground. When a robot uses a grounding feedback (e.g. a clarification request) in a given context, it is important for it to be clear how the utterance relates to the preceding context and what it focuses on. Intonation is one means to indicate this relation to context. The task of making the grounding feedback utterances of conversational robots contextually appropriate therefore, inevitably also involves intonation assignment. Following the analysis of Purver et al.  on the forms of clarifications in hu- man dialogue, we develop strategies for formulating clarification requests in human- robot dialogue. The form of a clarification request, its content, and its intonation are all strongly influenced by current contextual details. We represent these contex- tual factors, communicative intentions, and the corresponding utterance meanings at all levels of processing, in an ontologically rich, relational structures based on Hybrid-Logic Dependency Semantics (HLDS). As for intonation, we combine the approaches of Steedman [2000a], Lambrecht  and Engdahl  to intonation assignment based on information structure (IS), an underlying partitioning of utterance content that reflects its relation to discourse context. The IS units are represented within the same HLDS structure. To achieve prosodic realization from the same grammar as used for utterance real- ization we extend our OpenCCG grammar for prosody. Following Pierrehumbert and Hirschberg  model of combinatory intonation, we add categories for pitch accents and boundary tones in our grammar. The best realizations, in terms of con- textual appropriateness of utterance content as well as its intonation contour, are then post-processed to MaryXML format. This format is finally fed to the MARY text to speech synthesizer for production. For empirical verification of this approach, we set up psycholinguist experiments to see whether differences in the placement of the main accent in clarification re- quests are perceivable in synthesized speech, and whether the situated context li- censes these accent placement. The preliminary analysis of the data provide evi- dence for sub ject's preference of accent placements that are congruent to the visual context than those that are not congruent to the visual scene.