Bimanual manipulation, fundamental to human daily activities, remains a challenging task due to its inherent complexity of coordinated control. Recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos; however, these methods overlook crucial agent-specific information necessary for bimanual coordination, such as end-effector positions. We propose Ag2x2, a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 diverse bimanual tasks from Bi-DexHands and PerAct2, including challenging scenarios with deformable objects like ropes. This performance outperforms baseline methods and even surpasses the success rate of policies trained with expert-engineered rewards. Furthermore, we show that representations learned through Ag2x2 can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition without expert supervision. By maintaining robust performance across diverse tasks without human demonstrations or engineered rewards, Ag2x2 represents a step toward scalable learning of complex bimanual robotic skills.
Comparison and Ablation Results
| Method | Bi-DexHands | PerAct2 | Overall | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (a) | (b) | (c) | (d) | (e) | (f) | Avg. | (g) | (h) | (i) | (j) | (k) | (l) | (m) | Avg. | |||
| Eureka | 0 | 0 | 0 | 2 | 1 | 5 | 14.8% | 0 | 1 | 0 | 0 | 7 | 2 | 0 | 15.9% | 15.4% | |
| R3M | 0 | 0 | 3 | 0 | 1 | 0 | 7.4% | 2 | 0 | 4 | 2 | 3 | 3 | 0 | 22.2% | 15.4% | |
| VIP | 1 | 3 | 1 | 7 | 2 | 0 | 25.9% | 0 | 0 | 4 | 5 | 5 | 3 | 0 | 27.0% | 26.5% | |
| Ag2Manip | 6 | 9 | 7 | 4 | 3 | 7 | 66.7% | 2 | 3 | 3 | 3 | 9 | 6 | 4 | 47.6% | 56.4% | |
| Expert Reward | 8 | 9 | 6 | 6 | 8 | 9 | 85.2% | 5 | 0 | 6 | 3 | 5 | 3 | 6 | 44.4% | 63.2% | |
| Ours (w/o hands) | 7 | 4 | 7 | 7 | 4 | 9 | 70.4% | 5 | 4 | 3 | 5 | 8 | 3 | 3 | 46.0% | 57.3% | |
| Ours (full) | 7 | 6 | 9 | 8 | 7 | 9 | 85.2% | 6 | 5 | 2 | 7 | 9 | 6 | 5 | 63.5% | 73.5% | |
@inproceedings{xiong2025ag2x2,
title = {Ag2x2: Robust Agent-Agnostic Visual Representations for Zero-Shot Bimanual Manipulation},
author = {Xiong, Ziyin and Chen, Yinghan and Li, Puhao and Zhu, Yixin and Liu, Tengyu and Huang, Siyuan},
booktitle = {IROS},
year = {2025}
}