Universität Bielefeld Technische Fakultät - AG Wissensbasierte Systeme

Lifelike Gesture Synthesis and Timing for Conversational Agents

Ipke Wachsmuth

Faculty of Technology
University of Bielefeld


Besides the inclusion of gesture recognition devices as an intuitive input modality, the synthesis of lifelike gesture is finding growing attention in human-computer interface research. In particular, the generation of synthetic gesture in connection with text-to-speech systems is one of the goals for embodied conversational agents which have become a new paradigm for the study of gesture and for human-computer interface. Embodied conversational agents are computer-generated characters that resemble similar properties as humans in verbal and nonverbal face-to-face conversation. Although promising work exists for the production of synthetic gestures, natural timing for the gesture stroke and synchronizing it with speech output remains a research challenge.

A mid-range goal of our research is the conception of an "articulated communicator" that conducts multimodal dialogue with a human partner in cooperating on a model airplane construction task. In this context an operational model was developed that enables lifelike gesture animations to be rendered in real time from representations of spatiotemporal gesture knowledge. Based on various findings on the production of human gesture, the model provides means for motion representation, planning, and control to drive the kinematic skeleton of a figure which comprises 43 degrees of freedom in 29 joints for the main body and 20 DOF for each hand. A movement plan is formed as a tree representation of a temporally ordered set of movement constraints in three steps:
(1)  retrieve a feature-based specification from a gestuary
(2)  adapt it to the individual gesture context
(3)  qualify temporal movement constraints in accordance with external timing constraints.

As the model is particularly conceived to enable natural cross-modal integration by taking into account temporal synchrony constraints, further work includes the integration of speech-synthesis techniques as well as run-time extraction of temporal constraints for the coordination of gesture and speech.

First demos see here, more general see VR Lab Showcase.

2p abstract [pdf] - slides [pdf; 1.1M] - draft paper (comments welcome) [pdf]

talk given: London 20-04-01, Heidelberg 25-06-01, Chicago 14-07-01, Edinburgh (Poster) 3-08-01

Ipke Wachsmuth, 2001-08-23