Gudrun Socher

Research


Research Interests



The research described on this page was conducted for my PhD within the joint research project Situated Artificial Communicators (SFB 360) at the University of Bielefeld, Germany.
I was working in the subproject Interaction of Speech and Image Understanding (Interaktion sprachlicher und visueller Informationsverarbeitung, B1) in the SFB 360 as well as a visiting graduate student at the Computer Vision Research Group at the California Institute of Technology in Pasadena, USA.

Integration of Speech and Image Understanding

My goal was to build a high-level image understanding component for an integrated speech and image understanding system. This system is designed that a human could interact with it as naturally as possible. The human gives spoken instructions like if s/he would instruct another human. The scenario is the assembly of toys using toy-blocks, screws, etc. (somewhat like Lego). The human gives instructions to the system and the system should carry them out. The instructions are like "take the green block, and put it on the red one next to the blue cube". In order to understand the human better, the system is equipped with a stereo camera that observes the scene (construction platform).
Publications:

  • Bayesian Reasoning on Qualitative Descriptions from Images and Speech.
    G. Socher, G. Sagerer, and P. Perona. In H. Buxton and A. Mukerjee (Eds.), ICCV'98 Workshop on Conceptual Description of Images, Bombay, India, to appear 1998.

  • Talking about 3D Scenes: Integration of Image and Speech Understanding in a Hybrid Distributed System.
  • G. Socher, G. Sagerer, F. Kummert, and T. Fuhr. In Proc. International Conference on Image Processing (ICIP-96), Lausanne, Sept. 16-19, 1996, pp. 18A2.

  • Generation of Language Models Using the Results of Image Analysis.
  • U. Naeve, G. Socher, G.A. Fink, F. Kummert, and G. Sagerer. In Proc. of Eurospeech'95, 4th European Conference on Speech Communication and Technology, Madrid, Spain, 18-21 Sep, pp. 1739-1742, 1995.
    Up

    3D Reconstruction and Camera Calibration

    We developed a method for camera calibration and metric reconstruction of the three-dimensional structure of scenes observing several, possibly small and nearly planar objects in one or more images. The projection of object models is formulated explicitly according to the pin-hole camera model in order to be able to estimate the pose parameters for all objects as well as relative poses and the focal lengths of the cameras. The pose estimation is accomplished by minimizing a multivariate non-linear cost function using the Levenberg-Marquardt method. Necessary prerequisites are simple geometric models containing descriptions of objects as a set of vertices, edges, and ellipses, as well as the correspondence between model and image features. Ellipses are projected in a elegant way using projective invariants.

    Publications:

  • 3-D Reconstruction and Camera Calibration from Images with known Objects.
  • G. Socher, T. Merz, and S. Posch. In D. Pycock (Ed.), Proc. British Machine Vision Conference (BMVC-95), Birmingham, UK, Sept. 11-14, Volume I, pp. 167-176, 1995.

  • Ellipsenbasierte 3-D Rekonstruktion.
  • G. Socher, T. Merz, and S. Posch. In G. Sagerer, S. Posch, & F. Kummert (Eds.), 17. DAGM-Symposium Mustererkennung, Bielefeld, Sept. 13-15, pp. 252-259. Springer-Verlag, Berlin, Heidelberg, New York/NY, 1995.

    Up

    Image Understanding

    Image understanding denotes the ability to extract specific, non-numerical information from images, and it is a key problem in computer vision and artificial intelligence.

    High-level image understanding is accomplished in our system by reconstructing the 3D scene from uncalibrated stereo images and by computing qualitative object properties as well as spatial relations. Non-numerical information is thus derived in different steps of abstraction.

    The object identification module reasons on the derived qualitative information using Bayesian networks.

    Publications:

  • Talking about 3D Scenes: Integration of Image and Speech Understanding in a Hybrid Distributed System.
  • G. Socher, G. Sagerer, F. Kummert, and T. Fuhr. In Proc. International Conference on Image Processing (ICIP-96), Lausanne, Sept. 16-19, 1996, pp. 18A2.

  • Semantic Models and Object Recognition in Computer Vision.
  • G. Sagerer, F. Kummert, and G. Socher. In K. Kraus & P. Waldhäusel (Eds.), International Archives of Photogrammetry and Remote Sensing, Volume XXXI, Part B3, Commission 3, Vienna, pp. 710-723, 1996.
    Up

    Bayesian Networks

    We use qualitative representation for image understanding results which is suitable for reasoning with Bayesian networks. Our representation is enhanced with probabilistic information to represent uncertainties and errors in the understanding of noisy sensory data. An object is not assigned single values for its properties (e.g. an object has the color `orange'), but a vector of probabilities (degrees of membership) for all categories of a property space. For example, the color space is characterized by the color categories red, yellow, orange, blue, green, purple, wooden, white. The color of an object is then represented as, for example, color(rhomb-nut) = (0.4, 0.3, 0.8, 0.1, 0.09, 0.2, 0.15, 0.05). This characterizes that the object rhomb-nut is most likely to be orange. However, the color `orange' is also somewhat red as well as somewhat dark yellow, and thus the degrees of membership for the categories red and yellow are higher than for the other color categories.
    The probabilistic information is supplied to a Bayesian networks in order to find the most plausible interpretation.
    We want to identify which is the object that is addressed in an instruction by the human. We search therefore the object which has the highest probability of being named in the instruction as well as being observed in the scene.
    We model the identified object as depending on the instruction and the scene. An instruction consists of type, color, size, and shape specifications.
    The scene depends on the objects in the scene. The objects in the scene are described by their type, e.g. cube, bar, and their color.

    Example

    Instruction: "I want the small round and white thing above the orange rhomb-nut"

    Scene:

    Search for the "small round and white thing" :

    Search for the "orange rhomb-nut" :

    Spatial Relations:

    IO RO left right above below behind in-front
    Socket (205,287) Rhomb-nut (199,211) 0.002 0.196 0.059 0.060 0.000 0.754
    Socket (207,178) Rhomb-nut (199,211) 0.164 0.103 0.506 0.002 0.181 0.082
    Result:

    Publications:

  • Bayesian Reasoning on Qualitative Descriptions from Images and Speech.
    G. Socher, G. Sagerer, and P. Perona. In H. Buxton and A. Mukerjee (Eds.), ICCV'98 Workshop on Conceptual Description of Images, Bombay, India, to appear 1998.
    Up

    Spatial Relations

    We developed an approach for generating and understanding relative spatial positions in a natural three-dimensional scene, in terms of six spatial prepositions, left, right, in-front, behind, above, and below. The three-dimensional structure of a scene is reconstructed from stereo images ( 3D-Reconstruction ).
    Our spatial model has two layers. First, a symbolic spatial description of the scene independent of reference frames is computed. Then, in the second layer, the meaning of each of the six prepositions is defined with respect to the current reference frame, based on the description from the first layer. The meaning definitions of the prepositions in the given model can be used in two ways. They allow the system to judge the degree of goodness of each of the six prepositions between two 3D objects according to a graduated scale; and given the 3D pose of one object, the admissible 2D image region of the other object can be inferred.
    The spatial model has been extensively tested in psycholinguistic experiments ( Vorwerg et al., 1997).

    Publications:

  • Projective relations for 3D space: Computational model, application, and psychological evaluation.
    C. Vorwerg, G. Socher, T. Fuhr, G. Sagerer, and G. Rickheit. In AAAI'97, Providence, Rhode Island, pp. 159 - 164, 1997.

  • A three-dimensional spatial model for the interpretation of image data.
  • T. Fuhr, G. Socher, C. Scheering, and G. Sagerer. In IJCAI-95 Workshop on Representation and Processing of Spatial Expressions, Montreal, 1995.
    Up

    Gudrun Socher - gudrun@vision.caltech.edu
    Last modified: Wed Dec 10 18:16:38 PST 1997