SFB 673, Project B1

Humans intuitively combine language with spontaneous gesture to form multimodal utterances. In such utterances, words and gestures appear highly coordinated and closely intertwined - in other words, aligned to each other by the human speaker. These alignments concern the meaning that the verbal and non-verbal behaviours convey, the form they take up in doing so, the manner in which they are performed, their relative temporal arrangement, and their coordinated organization in a phrasal structure of utterance. Their effects are essential for how meaning is communicated by both modalities concertedly. The resulting confluence of language and gesture has led many researchers (e.g. McNeill, 1992) to believe that speech and gesture are products of the same generative process, starting from one ideational complex and comprising significant interactions between speech and gesture. Yet, it is still an open question as to how language and gesture align in producing a coherent multimodal utterance. Our goal is to systematically investigate the ways in which speech and gesture align within multimodal utterances in dialogueue, and we aim to achieve an understanding of the underlying, intra-personal mechanisms allowing us to model the generation of coordinated language and gesture for embodied conversational agents (ECA). We will focus on deictic gestures, which directly point to a location or region in space, as well as on iconic gestures that impart visual information to the utterance depicting what is being referred to (including the fusion of both functions within single gestures). Concretely, we investigate the following research questions:

What kind(s) of meaning do people convey in concurrent speech and gesture to pursue their communicative intentions?
At the level of meaning construction, we want to find out about the composition, representation, and distribution of meaning as it comes to be expressed in speech and gesture.
What forms do speech and gesture take up to convey this meaning in context?
Concerning deictic gestures we study the pointers' "pointing cones", i.e. the domains singled out by pointing gestures. With regard to iconic gestures, this mapping is until now only sparsely understood: What particular gesture forms do speakers use to create a coverbal depiction of certain spatial aspects of a referent? And what particular pieces of spatial meaning do the speakers choose to convey?
How are speech and gesture organized across as well as within incrementally produced, multimodal deliveries?
We want to investigate in how far self-monitoring can explain the portioning of communicative intentions and content into idea units. Self-monitoring can be regarded as a special case of alignment: Monitoring one's own utterance beiong produced creates representations that are constantly being compared to the intended representations, e.g. to detect failure in speech production.

Investigating these topics encompasses the empirical study and analysis of human behavior as well as the conception of computational models of the processes involved and their realisation in virtual humans. Our empirical studies are expected to elicit sets of dialogue games, which will be annotated in order to apply statistical methodologies to extract significant patterns in the data. Based on the patterns and behavioral units found in data analysis, we will model the generation process that renders content representations and communicative intentions into verbal and gestural behavior. As a starting point our modelling approach will rely on the multi-stage production process conceived for the generation of natural language (Reiter & Dale, 2000). Possible aligning actions between the two modalities will be addressed both within each stage as well as between any two stages. This model of the generation process will directly inform the implementation of a prototype simulation system embedded in our virtual human MAX.

Project Team

Cooperations