[go: up one dir, main page]

Back

Perceiving Systems

Research Overview

Thumb xl 20210929 mpiis tuebingen 366

Behavior generation

Given captured human behavior, our goal is to model it such that we can generate it. For example, TEMOS [File Icon]  is a text-conditioned generative model that leverages a variational autoencoder and transformer embeddings of text and motion. TEMOS is a foundation for TMR [File Icon], which embeds text and motion in latent spaces enabling text-based queries of large mocap libraries without manual labelling. SAMP  generates human movement conditioned on a scene while MIME [File Icon] does the opposite – it generates a full 3D scene from human movement. GOAL [File Icon] and GraspXL [File Icon] generate hand-object grasping, while EMOTE [File Icon] and AMUSE [File Icon] generate full body motion from audio.

Behavior understanding

Large multi-modal vision-language models (VLMs) understand a lot about humans. Can we leverage this for 3D behavior capture and can we train these models to understand 3D humans? ChatPose [File Icon] is the first method that fine-tunes a VLM to understand 3D human pose. When asked about human pose, the method is trained to output a special pose token and the embedding of this token is then decoded by a simple projection layer to produce continous SMPL pose and shape parameters. We think this is the future. ChatPose is able to reason beyond the image about what 3D human pose means and it can answer questions about what poses people might adopt in the future. It combines, for the first time, the broad general knowledge of large models with the 3D world of humans.

Broader Impact

While our focus is on capturing, generating and understanding 3D humans, our foundational work contributes to other disciplines. For example, we continue to push the state of the art in animal shape and motion capture (PFERD [File Icon], VAREN [File Icon], BARC [File Icon], BITE [File Icon]). We collaborate with doctors, biomechanics researchers, and psychologists so that our work has an impact outside vision and graphics (predicting the inside of the body from outside (SKEL [File Icon], OSSO [File Icon], HIT [File Icon]), treating eating disorders [File Icon][File Icon][File Icon][File Icon], or designing custom surgical plates [File Icon] to heal broken limbs). And, while we focus on behavior, the modeling of human appearance is also important. During the reporting period we have developed neural models of clothing (HOOD [File Icon], ControurCraft [File Icon], GaussianGarments [File Icon]), hair (HAAR [File Icon], MonoHair [File Icon], GaussianHaircut [File Icon]), and overall appearance (TeCH [File Icon], TECA [File Icon], ECON [File Icon]).

To have a wide impact on society, we make software and data available for research purposes. Our Software and Data Teams help acquire the best possible data and to share code widely; they are critical to our success. We also actively patent and license our technology. For the first time during this reporting period, several papers (TokenHMR [File Icon], ChatPose [File Icon], ChatHuman [File Icon]) were collaborations between MPI and Meshcapade (under the terms of a cooperation agreement); a joint patent was field for ChatHuman. Cooperations like this are increasingly important because they provide the scale necessary to be competitive today.

Research Fields & Projects

Inferring and exploiting contact

Damonteaser

Datasets for understanding humans and animals

Bedlamteaser

Human health and the 3D body

Kellerteaser

The AI animator

Nytimesicon

Language, Vision, and World Models

Chatpose teaser

Human pose, shape, and motion capture

Pexels

Generating human motion

Teaser

Robot Perception Group

Lab new with people lowres

Data Team

20241104 datateam s

Completed Projects

Content 4dscanner

Clothing Capture and Modeling

Modeling Human Movement

Motiongeneration

Action and Behavior

Slide1

Differentiable Rendering

Sab inverse graphics pic

Image Segmentation and Semantics

Raghudeep project

Groups and Crowds

Combined image 3

Learning from Synthetic Data

Syntheticdata

Learning Deep Representations of 3D

Sab octnet coma

Efficient and Scalable Inference

Efficientandscalabalefig