EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning


Building effective imitation learning methods that enable robots to learn from limited data and still generalize across diverse real-world environments is a long-standing problem in robot learning. We propose EquiBot, a robust, data-efficient, and generalizable approach for robot manipulation task learning. Our approach combines SIM(3)-equivariant neural network architectures with diffusion models. This ensures that our learned policies are invariant to changes in scale, rotation, and translation, enhancing their applicability to unseen environments while retaining the benefits of diffusion-based policy learning such as multi-modality and robustness. We show in a suite of 6 simulation tasks that our proposed method reduces the data requirements and improves generalization to novel scenarios. In the real world, we show with in total 10 variations of 6 mobile manipulation tasks that our method can easily generalize to novel objects and scenes after learning from just 5 minutes of human demonstrations in each task.


Real Robot Setup


In our real robot experiments, we show a series of experiments where we train mobile robots to perform everyday manipulation tasks from 5 minutes of single-view human demonstration videos. We select a suite of 6 tasks that involve diverse everyday objects, including rigid, articulated, and deformable objects (see Figure 6): (1) Push Chair: A robot pushes a chair towards a desk; (2) Luggage Packing: A robot picks up a pack of clothes and places it inside an open suitcase; (3) Luggage Closing: A robot closes an open suitcase on the floor; (4) Laundry Door Closing: A robot pushes the door of a laundry machine to close it; (5) Bimanual Folding: Two robots collaboratively fold a piece of cloth on a couch; (6) Bimanual Make Bed: Two robots unfold a comforter to make it cover the bed completely.

Data Collection

We collect 15 human demonstration videos for each real robot task. We use a ZED 2 stereo camera to record the movement of a human operator using their fingers to manipulate the objects of interest at 15 Hz. After data collection, we use an off-the-shelf hand detection model, an object segmentation model, and a stereo-to-depth model to parse out the human hand poses and object point clouds in each frame of the collected demos. We then subsample this data to 3 Hz and convert it into a format supported by our policy training algorithm. Below, we show sample human demonstrations for each task. In each task, we only collect human demos on a single object.

Training and Evaluation

We train all methods for 1,000 epochs. After training, we evaluate each method for 10 episodes and record the success rate of the method. We vary the evaluation scenarios from the training scenarios differently in each task. In Laundry Door Closing, we perform evaluations in-distribution. In Push Chair, Luggage Closing, and Bimanual Make Bed, we evaluate with novel objects to make the evaluation out-of-distribution to the training data. In Luggage Packing and Bimanual Folding, we not only switch to novel objects but also translate and rotate the layout of the scene.

Real Robot Results

Push Chair to Longer Desk



Ours (EquiBot)

Push Chair to Circular Desk



Ours (EquiBot)

Packing T-shirts



Ours (EquiBot)

Packing Towel



Ours (EquiBot)

Packing Cap



Ours (EquiBot)

Packing Shorts



Ours (EquiBot)

Close Luggage



Ours (EquiBot)

Laundry Door Closing



Ours (EquiBot)

Bimanual Fold



Ours (EquiBot)

Bimanual Make Bed



Ours (EquiBot)


      title={EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning}, 
      author={Jingyun Yang and Zi-ang Cao and Congyue Deng and Rika Antonova and Shuran Song and Jeannette Bohg},