Contextually Appropriate Intonation of Clarification Requests in Situated Human-Robot Dialogue
Raveesh Meena
Mastersthesis, DFKI GmbH / Universität des Saarlandes, 2010.
Zusammenfassung
In this thesis we develop an approach for determining contextually appropriate into-
nation of clarification statements raised during continuous and cross-modal learning
in autonomous robots.
Autonomous robots which self-understand and self-extend in the environment
in which they find themselves learn continuously about their surroundings. During
the course of learning a robot might require additional information from its hu-
man interlocutor. Spoken dialogue is a means through which robots can ask their
interlocutor for new information, and also for clarifying the knowledge they have
acquired about the situated environment.
The ability to self-initiate a dialogue, besides adding autonomy to a robot's
behavior, also allows the robot to connect its belief state to that of its listener. This
enables the participating agents to perform grounding, and arrive at a common
ground. A robot's grounding feedback is one of the means to arrive at a common
ground. When a robot uses a grounding feedback (e.g. a clarification request)
in a given context, it is important for it to be clear how the utterance relates to
the preceding context and what it focuses on. Intonation is one means to indicate
this relation to context. The task of making the grounding feedback utterances of
conversational robots contextually appropriate therefore, inevitably also involves
intonation assignment.
Following the analysis of Purver et al. [2003] on the forms of clarifications in hu-
man dialogue, we develop strategies for formulating clarification requests in human-
robot dialogue. The form of a clarification request, its content, and its intonation
are all strongly influenced by current contextual details. We represent these contex-
tual factors, communicative intentions, and the corresponding utterance meanings
at all levels of processing, in an ontologically rich, relational structures based on
Hybrid-Logic Dependency Semantics (HLDS).
As for intonation, we combine the approaches of Steedman [2000a], Lambrecht
[1994] and Engdahl [2006] to intonation assignment based on information structure
(IS), an underlying partitioning of utterance content that reflects its relation to
discourse context. The IS units are represented within the same HLDS structure.
To achieve prosodic realization from the same grammar as used for utterance real-
ization we extend our OpenCCG grammar for prosody. Following Pierrehumbert
and Hirschberg [1990] model of combinatory intonation, we add categories for pitch
accents and boundary tones in our grammar. The best realizations, in terms of con-
textual appropriateness of utterance content as well as its intonation contour, are
then post-processed to MaryXML format. This format is finally fed to the MARY
text to speech synthesizer for production.
For empirical verification of this approach, we set up psycholinguist experiments
to see whether differences in the placement of the main accent in clarification re-
quests are perceivable in synthesized speech, and whether the situated context li-
censes these accent placement. The preliminary analysis of the data provide evi-
dence for sub ject's preference of accent placements that are congruent to the visual context than those that are not congruent to the visual scene.