Motivation
Break Banana (Single-Arm)
Break Banana (Dual-Arm)
Pull Piston (Single-Arm)
Pull Piston (Dual-Arm)
Push Piston (Single-Arm)
Push Piston (Dual-Arm)
Fetch Book (Gripper)
Fetch Book (Hand)
Flip Book (Gripper)
Flip Book (Hand)
Roll Dough (Gripper)
Roll Dough (Hand)
Use Pen (Gripper)
Use Pen (Hand)
Twist Cap (6-DoF Hand)
Twist Cap (12-DoF Hand)
The operator teleoperates the physical robot and its MujoCo digital twin, so apple→plate demonstrations are collected in real and simulation under the same interface, thereby reducing the sim-to-real gap.
Low-latency teleoperation demonstration: This video presents a rapid teleoperation scenario, where the human operator performs fast, abrupt motions to evaluate the system's end-to-end responsiveness. The robot's real-time tracking of the operator's movements highlights the extremely low latency of our teleoperation pipeline. Quantitative analysis shows the system achieves an average end-to-end delay of only 11 ms, enabling highly responsive and precise manipulation even under aggressive teleoperation conditions.
Simulation
297 objects
30 categories
200 tasks
100K trajectories
6.5M frames
361h video
Open Source
Lerobot v2.1 format
Real World
347 objects
17 categories
200 tasks
10K trajectories
3.2M frames
177.5h video
Open Source
Lerobot v2.1 format
Method
(a) Data filtering: From the real-world dataset we pre-screen demonstrations by kinematic smoothness (low acceleration and jerk), then replay them for post-validation and keep the clips that complete the task without collisions, forming a high-quality subset.
(b) Discriminator training: With the pretrained diffusion–transformer policy frozen, we compute a log-π proxy for each clip and train a discriminator that, conditioned on observations and language, outputs a quality score d(Ct) ∈ (0,1].
(c) Data-quality-aware post-training: During post-training, the score d(Ct) is converted to weights wi and used in the diffusion loss Lπ. At inference time, only the policy is used.
Basic Tasks Demonstration
Place Apple on Plate
Place Bowl into Bowl
Put Two Eggs into Box (Bimanual)
Lift Basket (Bimanual)
Move Block to Right Plate (Bimanual)
Stack Ring Blocks
Grab Square Blocks
Place Kettle on Base
Remove Pen Cap (Bimanual)
Separate Nested Bowls (Bimanual)
Open Cabinet Door
Open Laptop (Bimanual)
Dexterous manipulation sequences
Cut Leek
Use Pen
Roll Dough
Twist Cap
Fetch Book
Place Plates
Out-of-Distribution Generalization
Unseen Background
Unseen Lighting
Unseen Object
Occlusion
Clutter
Height Change
Cross-embodiment Generalization
w/o Discriminator
Shaking; Fail
w/ Discriminator
Smooth; Succeed
Joint Curve
Analysis
w/o Discriminator
Shaking; Fail
w/ Discriminator
Smooth; Succeed
Joint Curve
Analysis
Failure Cases
Semantic Misgrounding due to Instruction Ambiguity
The language-conditioned policy misinterprets the instruction “put the apple into the bowl”, confusing the referent ("smal bowl" vs. "big bowl")), and thus selects the incorrect target for placement.
Visual Misclassification under Object Similarity
The vision encoder, when executing the instruction “put the red apple into the bowl”, confuses visually similar red objects (apple vs. tomato) under mixed lighting, causing a target identification error.
Grasp Instability
The grasp lacks sufficient frictional stability or normal force, leading to slippage.
Bimanual Coordination Failure
Desynchronized joint trajectories between the two manipulators result in torque imbalance and object tilt.
Pose Drift and Accumulated Alignment Error
Small initial orientation deviation propagates across sequential steps, causing final assembly failure.
Workspace Boundary Violation
The target lies beyond feasible workspace, leading to joint saturation.