Ag2x2

Zero-Shot Bimanual Manipulation

¹National Key Laboratory of General Artificial Intelligence, Beijing Institute for General Artificial Intelligence (BIGAI), ²School of Psychological and Cognitive Sciences, Peking University, ³Institute for Artificial Intelligence, Peking University, ⁴Beijing Key Laboratory of Behavior and Mental Health, Peking University, ⁵Yuanpei College, Peking University, ⁶Department of Computer Science and Technology, University of Cambridge, ⁷Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence.

Abstract

Bimanual manipulation, fundamental to human daily activities, remains a challenging task due to its inherent complexity of coordinated control. Recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos; however, these methods overlook crucial agent-specific information necessary for bimanual coordination, such as end-effector positions. We propose Ag2x2, a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 diverse bimanual tasks from Bi-DexHands and PerAct2, including challenging scenarios with deformable objects like ropes. This performance outperforms baseline methods and even surpasses the success rate of policies trained with expert-engineered rewards. Furthermore, we show that representations learned through Ag2x2 can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition without expert supervision. By maintaining robust performance across diverse tasks without human demonstrations or engineered rewards, Ag2x2 represents a step toward scalable learning of complex bimanual robotic skills.

Simulation Results

Method	Bi-DexHands		PerAct²	Overall
Eureka	0	0	0	2	1	5	14.8%	0	1	0	0	7	2	0	15.9%	15.4%
R3M	0	0	3	0	1	0	7.4%	2	0	4	2	3	3	0	22.2%	15.4%
VIP	1	3	1	7	2	0	25.9%	0	0	4	5	5	3	0	27.0%	26.5%
Ag2Manip	6	9	7	4	3	7	66.7%	2	3	3	3	9	6	4	47.6%	56.4%
Expert Reward	8	9	6	6	8	9	85.2%	5	0	6	3	5	3	6	44.4%	63.2%
Ours (w/o hands)	7	4	7	7	4	9	70.4%	5	4	3	5	8	3	3	46.0%	57.3%
Ours (full)	7	6	9	8	7	9	85.2%	6	5	2	7	9	6	5	63.5%	73.5%

Method

Bi-DexHands

PerAct²

Overall

(a)

(b)

(c)

(d)

(e)

(f)

Avg.

(g)

(h)

(i)

(j)

(k)

(l)

(m)

Avg.

Eureka

14.8%

15.9%

15.4%

R3M

7.4%

22.2%

15.4%

VIP

25.9%

27.0%

26.5%

Ag2Manip

66.7%

47.6%

56.4%

Expert Reward

85.2%

44.4%

63.2%

Ours (w/o hands)

70.4%

46.0%

57.3%

Ours (full)

85.2%

63.5%

73.5%

@inproceedings{xiong2025ag2x2, title = {Ag2x2: Robust Agent-Agnostic Visual Representations for Zero-Shot Bimanual Manipulation}, author = {Xiong, Ziyin and Chen, Yinghan and Li, Puhao and Zhu, Yixin and Liu, Tengyu and Huang, Siyuan}, booktitle = {IROS}, year = {2025} }

Robust Agent-Agnostic Visual Representations for

Zero-Shot Bimanual Manipulation

Ag2x2 learns bimanual manipulation without task-specific knowledge.

Abstract

Pipeline of Ag2x2

Simulation Results

Imitation Learning Results

BibTeX