MAX and ACE

Max and the Articulated Communicator Engine

The Articulated Communicator Engine (ACE) is a toolkit for building animated embodied agents that are able to generate human-like multimodal utterances. It provides easy means of building an agent by defining a kinematic body model, and of specifying the ouput that the agent is to generate by describing the desired overt form of an utterance in a XML description language (MURML, a multimodal utterance representation markup language). The toolkit takes care of the rest, namely creating natural coverbal gestures, synthetic speech, and facial animations on-the-fly and coordinating them in a natural fashion. One demonstration application is Max - the Multimodal Assembly eXpert. Max is situated in a virtual environment for cooperative construction tasks, where he multimodally demonstrates to the user the construction of complex aggregates and guides the user through interactive assembly procedures.

Max in action:

See a sample interaction in assembly assistance (mpeg; 11 MB)
Demonstrating gesture recognition and synthesis capabilities in real-time gesture imitation (mpeg)

Max at the open house in our lab!

Nonverbal behaviors

Real-time gesture synthesis:
Max is able to create and execute gesture animations from MURML descriptions of their essential spatiotemporal features, i.e. of their meaningful "stroke" phases (see example below). To this end, an underlying anthropomorphic kinematic skeleton for the agent was defined comprising 103 DOF in 57 joints, all subject to realistic joint limits. This articulated body is driven in real-time by a hierarchical gesture generation model that emphasizes the accurate and relieable reproduction of the prescribed features. It includes two main stages:

High-level gesture planning:
During gesture planning, the expressive phase of a gesture is defined by setting up a fully qualified set of movement constraints. This stage includes optionally selecting a gesture from a lexicon of abstract templates (formulated in MURML), allocating body parts, expanding two-handed symmetrical gestures, resolving deictic references, and defining the timing of the stroke phase.
Motor planning and execution:
During lower-level motor planning, a solution is sought to control movements of the agent's upper limbs that satisfy the constraints at disposal. A kinematic model of human hand-arm movement is employed that is based on findings from human movement science and neurophysiology. Based on a hierarchical organization of motor control, several motor programs for the execution of different submovements are instantiated and arranged during planning. During execution-time, motor programs are able to activate and complete themselves at run-time and to transfer activation to other ones. That way, the motor programs coordinate the application of suitable motion generation techniques to control realistic movements of the hand, the wrist, and the arm.

The generality and flexibility of this approach allows, first,to synthesize a great variety of gestures and, secondly, to integrate the gestural movement precisely with external temporal constraints like synchrony with, e.g., pitch accents in simultaneous speech.

MURML specification of a two hand iconic gesture

Generated gesture

<gesture id="gesture_0">
    <constraints>
      <symmetrical dominant="right_arm" symmetry="SymMS">
	<parallel>
	  <static slot="PalmOrientation" value="DirL"/>
	  <static slot="ExtFingerOrientation" value="DirA"/>
	  <static slot="HandShape" value="BSflat"/>
	  <static slot="HandLocation" 
                  value="LocUpperChest LocCenterRight LocNorm"/>
	</parallel>
      </symmetrical>
    </constraints>
  </gesture>

gesture (mpeg)

Further examples of dynamic, single-handed gestures (mpg):

Modifying the manner of movement:
Accentuation of a gesture is increased by additionally superimposing motor programs that create beat-like movements in single joints and modulating the dynamic properties of movement trajectories (by deferring the velocity peak). In the following examples (mpeg), a synchronized beat-like movement within the elbow joint is superimposed to the pointing gesture and the iconic gesture:

Concatenation of gestures:
Gestures can be concatenated by simply specifying form and timing of the stroke phases. Fluent appropriate transitions are created automatically and in real-time by the generation model. In the following example, the stroke phases of experimentally observed gestures have been transcribed along with their start and end times. The agent mimicks the gestures and concatenates subsequent strokes in accord to the specified timings.

Gesture sequence (as transcribed from experimental data)

Facial expression:
The agent's face model comprises 21 different muscles that systematically deform the facial geometry. Facial actions like eye blink, animated speech, and emotional expressions employ muscle contractions in a coordinated way.

Facial expressions of different emotions (mpeg)

Verbal behaviors

Speech synthesis:
Our Text-to-Speech system builds on and extends the capabilities of txt2pho and MBROLA: It controls prosodic parameters like speechrate and intonation. Furthermore, it facilitates contrastive stress and offers mechanisms to synchronize pitch peaks in the speech output with external events.

Examples (German; wav-files):

Speech animation:
Lip-sync speech animations are created from the phonetic representation provided by the TTS system. To this end, realistic face postures during the articulation of different phonems (visems) have been defined in terms of muscle contractions and are interpolated in a timed fashion.

"Hallo, ich bin Max. Was kann ich fuer sie tun?" (mpeg)

Multimodal behaviors

Multimodal utterances:
Speech and gesture can be simply combined in a MURML definition of the desired outer form of a multimodal utterance. Taking such specifications as input, Max generates all (verbal and nonverbal) behaviors on-the-fly and synchronized. MURML descriptions contain the verbal part, augmented by nonverbal behaviors along with their possible affiliation to certain (co-expressive) linguistic elements. For each coverbal gesture, a narrowly focussed word within the affiliate, the employed pitch accent for prosodic focus, as well as the temporal offset between the affiliate/focus and the gesture stroke can be optionally specified in addition. The production model is based on the assumption that multimodal utterances are ideally produced in subsequent chunks (paired off an intonation phrase and a gesture phrase). Then, cross-modal coordination between gesture and speech takes place within each chunk (by adapting the gesture's timing) as well as across adjacent chunks (by adjusting the onsets of both, the gestural and the verbal phrase).

MURML utterance specification

Generated utterance

<utterance>
  <specification>
     And now take <time id="t1"/> this bar <time id="t2" chunkborder="true"/>
     and make it <time id="t3"/> this big. <time id="t4"/>
  </specification>

  <behaviorspec id="gesture_1">
    <gesture>
      <affiliate onset="t1" end="t2" focus="this"/>
      <function name="refer_to_loc">
        <param name="refloc" value="$Loc-Bar_1/>
      </function>
    </gesture>
  </behaviorspec>

  <behaviorspec id="gesture_2">
    <gesture>
      <affiliate onset="t3" end="t4"/>
      <constraints>
        <symmetrical dominant="right_arm" symmetry="SymMS">
          <parallel>
            <static slot="HandShape" value="BSflat (FBround all o) (ThCpart o)"/>
            <static slot="ExtFingerOrientation" value="DirA"/>
            <static slot="PalmOrientation" value="DirL"/>
            <static slot="HandLocation" value="LocLowerChest LocCenterRight LocNorm"/>
          </parallel>
        </symmetrical>
      </constraints>
    </gesture>
  </behaviorspec>
</utterance>

utterance (mpeg)
(German speech)

Co-articulation and transition effects emerge depending on the placement of the gesture's affiliate within the intonation phrase as well as the current movement conditions for the gesture (required preparation time). In result, the onsets of both, gesture and phonation, covary in time with the position of the corresponding element in the other modality (affiliate or gesture stroke) - as observed in humans. For example, when the verbal phrase is shortened, it may not be possible to fully retract the first gesture and a fluent transition is created automatically. Likewise, the duration of the silent pause between successive intonation phrases is extended in case a more time-consuming gesture preparation becomes necessary.

Re-producing experimental data see here

Selected Publications

S. Kopp: Surface Realization of Multimodal Output from XML representations in MURML. White paper at the Workshop on Representations for Multimodal Generation, Reykjavik, Iceland, April 2005. (PDF)
S. Kopp, I. Wachsmuth: Synthesizing Multimodal Utterances for Conversational Agents. The Journal of Computer Animation and Virtual Worlds, 15(1), 2004. (PDF)
S. Kopp, P. Tepper, J. Cassell: Towards Integrated Microplanning of Language and Iconic Gesture for Multimodal Output. Proceedings of the International Conference on Multimodal Interfaces (ICMI'04), pp. 97-104, ACM Press, 2004. (PDF)
S. Kopp, B. Jung, N. Lessmann, I. Wachsmuth: Max - A Multimodal Assistant in Virtual Reality Construction. KI-Küstliche Intelligenz 4/03, pp 11-17, Bremen: arenDTap Verlag, 2003.
S. Kopp: Synthesis and Coordination of Speech and Gesture for Virtual Multimodal Agents. Infix DISKI-265, Amsterdam: Akademische Verlagssgesellschaft Aka GmbH, 2003.
(Ph.D. thesis, Faculty of Technology, University of Bielefeld, Germany, Okt. 2002. Abstract (pdf))
A. Kranstedt, S. Kopp, I. Wachsmuth: MURML: A Multimodal Utterance Representation Markup Language for Conversational Agents. Technical report 2002/05, SFB 360 Situated Artifical Communicators, Universität Bielefeld, 2002.
S. Kopp, I. Wachsmuth: Model-based Animation of Coverbal Gesture. Proceedings of Computer Animation 2002 (pp. 252-257), IEEE Press, Los Alamitos, CA, 2002. (pdf)
I. Wachsmuth, S. Kopp: Lifelike Gesture Synthesis and Timing for Conversational Agents. In Wachsmuth & Sowa (eds.), "Gesture and Sign Language in Human-Computer Interaction", International Gesture Workshop (GW 2001), Revised Papers, pp.120-133, LNAI 2298, Springer-Verlag, 2002.
S. Kopp, I. Wachsmuth: A Knowledge-based Approach for Lifelike Gesture Animation. In W. Horn (ed.): ECAI 2000 - Proceedings of the 14th European Conference on Artificial Intelligence (pp. 663-667). Amsterdam: IOS Press, 2000.

Contact

Dr. Stefan Kopp

For general information about the Max project you can also contact Prof. Ipke Wachsmuth

Stefan Kopp, 2008-11-20