Humans intuitively combine language with spontaneous gesture to form multimodal utterances.
In such utterances, words and gestures appear highly coordinated and closely intertwined -
in other words, aligned to each other by the human speaker. These alignments concern the
meaning that the verbal and non-verbal behaviours convey, the form they take up in doing so,
the manner in which they are performed, their relative temporal arrangement, and their coordinated
organization in a phrasal structure of utterance. Their effects are essential for how meaning is
communicated by both modalities concertedly. The resulting confluence of language and gesture
has led many researchers (e.g. McNeill, 1992) to believe that speech and gesture are products of
the same generative process, starting from one ideational complex and comprising significant
interactions between speech and gesture. Yet, it is still an open question as to how language and
gesture align in producing a coherent multimodal utterance.
Our goal is to systematically investigate the ways in which speech and gesture align within
multimodal utterances in dialogueue, and we aim to achieve an understanding of the underlying,
intra-personal mechanisms allowing us to model the generation of coordinated language and
gesture for embodied conversational agents (ECA). We will focus on deictic gestures, which
directly point to a location or region in space, as well as on iconic gestures that impart visual
information to the utterance depicting what is being referred to (including the fusion of both
functions within single gestures). Concretely, we investigate the following research questions:
-
What kind(s) of meaning do people convey in concurrent speech and gesture to pursue their
communicative intentions?
At the level of meaning construction, we want to find out about the composition, representation,
and distribution of meaning as it comes to be expressed in speech and gesture.
-
What forms do speech and gesture take up to convey this meaning in context?
Concerning deictic gestures we study the pointers' "pointing cones", i.e. the domains singled out
by pointing gestures. With regard to iconic gestures, this mapping is until now only sparsely understood:
What particular gesture forms do speakers use to create a coverbal depiction of certain spatial aspects
of a referent? And what particular pieces of spatial meaning do the speakers choose to convey?
-
How are speech and gesture organized across as well as within incrementally produced, multimodal
deliveries?
We want to investigate in how far self-monitoring can explain the portioning of communicative intentions
and content into idea units. Self-monitoring can be regarded as a special case of alignment: Monitoring
one's own utterance beiong produced creates representations that are constantly being compared to the
intended representations, e.g. to detect failure in speech production.
Investigating these topics encompasses the empirical study and analysis of human behavior
as well as the conception of computational models of the processes involved and
their realisation in virtual humans. Our empirical studies are expected to elicit sets of dialogue
games, which will be annotated in order to apply statistical methodologies to extract significant
patterns in the data. Based on the patterns and behavioral units found in data analysis, we will model
the generation process that renders content representations and communicative intentions into verbal and
gestural behavior. As a starting point our modelling approach will rely on the multi-stage production
process conceived for the generation of natural language (Reiter & Dale, 2000). Possible aligning
actions between the two modalities will be addressed both within each stage as well as between any two
stages. This model of the generation process will directly inform the implementation of a prototype
simulation system embedded in our virtual human MAX.
Project Team
Cooperations
- B1 will cooperate with "A1 - Modelling partners" and "A2 - Processing of implicit common ground" on representations of dialogue states, common grounds, and communicative intent.
- B1 will also cooperate with A1 and "C1 - Interaction Space" on the overall technical setup for the prototype implementation, including a multimodal humanoid agent in a VR environment.
- The empirical data gathered will also include check backs, corrections and denials. It thus allows studying how implicit common ground is corrected, established and acknowledged, which are central research issues of the projects A2 and "C3 - Repairs and reformulations in dialogue".
- The annotation of gesture morphology conducted in B1 will inform the body coding representation to be developed in C1. Project C1, which is working on gesture imitation involving meaning-level analysis and representation of gestural behaviour, will also employ and test B1's results on meaning-form mappings in the gesture imitation scenario. Additionally, the mechanisms of outer-loop self-monitoring developed in B1 represent a starting point in C1 for the monitoring of others and ascribing them communicative intent.
- B1 relies on very fine-grained measure values concerning the alignment of gesture and speech. A case in point are synchrony effects of gesture stroke and onset of NL expressions. Based on preliminary work, relevant statistical tools for getting at these data will be developed and maintained within the project "X1 - Multimodal alignment corpora: statistical modeling and information management".
last modification: 26.07.2006
|