Research Overview

Our goal is create 3D virtual humans that can see, move, and behave just like real people by capturing 3D human behavior at scale from video and using this to train 3D human foundation models.

Humans have evolved to interact with humans, not computers. Can we make our interactions with computers more human-like? To answer this question, we capture human behavior at scale and to use this data to train digital humans that see us, understand us, and behave like us. We call this the Human Foundation Agent. We believe that, in the near future, (1) computers will see us, (2) AI will be embodied in the form of digital humans, and (3) this will fundamentally change our relationship with machines. The research of the Perceiving Systems department is structured to achieve these goals.

The surprise of the last three years is the outsized impact that language has had on solving long-standing problems in vision. At the beginning of the reporting period it was already clear that nearly all our work would combine vision and language in one way or another. Today, large models are central to achieving our goals. For example, we exploit the physical world knowledge implicit in LLMs and video diffusion models for problems as diverse as inverse graphics programming [], image relighting, human-object interaction reasoning, and bounded video generation []. This is a trend that will only accelerate.

Behavior capture

In training AI systems, scale matters – in particular the scale (and quality) of data. Human behavior is typically captured in a motion capture (mocap) studio. In Perceiving Systems, we have built, and continue to expand, the world's largest mocap dataset (AMASS), which enabled the field of generative human motion to emerge. AMASS, however, is not enough. Studio data is neither realistic nor scalable. Consequently, our goal is to capture human behavior from video at a massive scale along with rich contextual information about the scene and people’s interactions in the scene. To that end, we are pushing the state of the art in 3D human capture from video. Key innovations include:

Humans in context:

Most methods that regress human shape and pose (HPS) from an image do so from a tightly cropped image region around the person. This means that the network cannot exploit scene context and this makes it hard to place people in the 3D scene. With BEV [] and TRACE, [] we introduced methods that exploit the full image and integrate the problem of detection with pose estimation and tracking, resulting in improved 3D reasoning.

Humans in world coordinates:

The key problem of prior methods is that they estimate humans in camera coordinates rather than world coordinates. To estimate humans in world coordinates we need information about the camera like its focal length. In scenes containing human motion however, traditional camera calibration methods can fail. We observe that humans themselves can serve as a form of “calibration object”. WHAM [] exploits human motion over time to estimate of the camera’s angular velocity and 3D human pose in a global coordinate system with minimal foot sliding. WHAM is the first video-based method to outperform all single-frame and video methods. With CameraHMR [] we train a method to regress the camera field of view from a single image of a person and integrate this into our training and inference methods, resulting in state-of-the-art accuracy for single image HPS.

Faces:

The face is critical for communication and our methods capture 3D emotional content (EMOCA []), metrically accurate faces (MICA []), facial details (SMIRK []), and perform precise 3D tracking from video (SPARK []).

Contact:

Human-object, human-scene, and human-human contact are foundational for understanding and modeling human behavior. We introduced datasets (DAMON [], RICH [], INTERCAP [], HOT [], ARCTIC []) that enable the study of contact in 3D and 2D as well as methods that reason about contact from images (DECO []) and exploit contact in inferring 3D humans and scenes (MOVER []).

Synthetic data:

To enable accurate behavior capture, we created BEDLAM [], the first synthetic training dataset that enables state-of-the-art HPS results without any real training data. Groups worldwide are using BEDLAM and have verified that it is the single most important dataset in the field for achieving accurate results. We are hard at work on a significantly expanded version.

Behavior generation

Given captured human behavior, our goal is to model it such that we can generate it. For example, TEMOS [] is a text-conditioned generative model that leverages a variational autoencoder and transformer embeddings of text and motion. TEMOS is a foundation for TMR [], which embeds text and motion in latent spaces enabling text-based queries of large mocap libraries without manual labelling. SAMP generates human movement conditioned on a scene while MIME [] does the opposite – it generates a full 3D scene from human movement. GOAL [] and GraspXL [] generate hand-object grasping, while EMOTE [] and AMUSE [] generate full body motion from audio.

Behavior understanding

Large multi-modal vision-language models (VLMs) understand a lot about humans. Can we leverage this for 3D behavior capture and can we train these models to understand 3D humans? ChatPose [] is the first method that fine-tunes a VLM to understand 3D human pose. When asked about human pose, the method is trained to output a special pose token and the embedding of this token is then decoded by a simple projection layer to produce continous SMPL pose and shape parameters. We think this is the future. ChatPose is able to reason beyond the image about what 3D human pose means and it can answer questions about what poses people might adopt in the future. It combines, for the first time, the broad general knowledge of large models with the 3D world of humans.

Broader Impact

While our focus is on capturing, generating and understanding 3D humans, our foundational work contributes to other disciplines. For example, we continue to push the state of the art in animal shape and motion capture (PFERD [], VAREN [], BARC [], BITE []). We collaborate with doctors, biomechanics researchers, and psychologists so that our work has an impact outside vision and graphics (predicting the inside of the body from outside (SKEL [], OSSO [], HIT []), treating eating disorders [][][][], or designing custom surgical plates [] to heal broken limbs). And, while we focus on behavior, the modeling of human appearance is also important. During the reporting period we have developed neural models of clothing (HOOD [], ControurCraft [], GaussianGarments []), hair (HAAR [], MonoHair [], GaussianHaircut []), and overall appearance (TeCH [], TECA [], ECON []).

To have a wide impact on society, we make software and data available for research purposes. Our Software and Data Teams help acquire the best possible data and to share code widely; they are critical to our success. We also actively patent and license our technology. For the first time during this reporting period, several papers (TokenHMR [], ChatPose [], ChatHuman []) were collaborations between MPI and Meshcapade (under the terms of a cooperation agreement); a joint patent was field for ChatHuman. Cooperations like this are increasingly important because they provide the scale necessary to be competitive today.

Research Fields & Projects

Inferring and exploiting contact

Humans use touch to interact with each other and the world. While we use our hands and feet to support grasping and locomotion, we also leverage our entire body surface in our daily
interactions with the objects. We great, comfort,... Read more

Datasets for understanding humans and animals

Perceiving Systems has a long history of creating and supporting datasets that advance our research and that of the broader research community. In computer vision today, high-quality data at scale is often the key to success. Over... Read more

Human health and the 3D body

Our 3D body models and tools for estimating them from data are the world's most accurate and are widely used for evaluating body composition and associated health risk factors. Can we leverage body shape to further improve... Read more

The AI animator

Generative AI is evolving rapidly and many argue that GenAI will fully replace traditional graphics. There is nothing really wrong, however, with traditional graphics except that it requires extensive experience and time to create... Read more

Language, Vision, and World Models

Can we leverage large language models (LLMs) to teach computer to see? How much do multi-modal vision-language models (VLMs) really understand about the 3D world and how it works? We were using language to help solve vision problems... Read more

Human pose, shape, and motion capture

The Perceiving Systems department has pioneered the field of 3D human pose and shape (HPS) estimation and has driven global development of the field through the release of models, code, data, and benchmarks. Our SMPL body model and... Read more

Generating human motion

Animation of 3D humans today is mostly achieved through motion capture (mocap) or hand animation. We envision a very different world in which animation is ubiquitous, on demand, and in context. We are developing the datasets and... Read more

Robot Perception Group

Robot Perception Group Github Organization Page Our focus is on vision-based perception in multi-robot systems. Our goal is to understand how teams of robots, especially flying robots, can act (navigate, cooperate and communicate) in... Read more

Data Team

Computer vision research today is driven by data. Capturing and processing data is both specialized and time consuming. Our unique multi-disciplinary Data Team supports researchers in collecting and processing data ranging from 4D body... Read more

Completed Projects

These projects represent work in Perceiving Systems between Jan 2011 and the present that has been superseded by new work or that we are no longer pursuing. Read more

Clothing Capture and Modeling

While body models like SMPL lack clothing, people in images and videos are typically clothed. Modeling clothing on the body is hard, because of the variety of garments, varied topology of clothing, and the complex physical properties of... Read more

Modeling Human Movement

A key goal of Perceiving Systems is to model human behavior. One way of testing our models is by generating movement. One class of motion generation methods takes a short segment of human motion and predicts future motions; this is... Read more

Action and Behavior

Human behavior can be described at multiple levels. At the lowest level, we observe the 3D pose of the body over time. Poses can be organized into primitives that capture coordinated activity of different body parts. These further form... Read more

Differentiable Rendering

A long standing and conceptually elegant view of computer vision is to use a generative model of the physical image formation process and posterior inference to infer or explain the image observations. A key problem in this ... Read more

Image Segmentation and Semantics

Semantic segmentation is a fundamental problem of computer vision that requires answering what is where in a given image, video or 3D point cloud. The best performing recent techniques require human annotations to obtain ground truth... Read more

Groups and Crowds

People are often a central element of visual scenes. It has been a long-standing goal in computer vision to develop computational models that enable machines to detect crowds of people, analyze their motion and poses, infer their actions... Read more

Learning from Synthetic Data

Deep learning has brought rapid progress for many computer vision problems but current methods require large training datasets with annotated ground truth. Human annotators tend to be reasonably efficient for tasks like sparse 2D... Read more

Learning Deep Representations of 3D

Much of our work focuses on 3D models of objects and scenes. We would like to take advantage of current deep learning approaches in representing and reasoning about 3D. Unfortunately, the standard 2D convolutional models do not readily... Read more