EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Abstract

Building effective imitation learning methods that enable robots to learn from limited data and still generalize across diverse real-world environments is a long-standing problem in robot learning. We propose EquiBot, a robust, data-efficient, and generalizable approach for robot manipulation task learning. Our approach combines SIM(3)-equivariant neural network architectures with diffusion models. This ensures that our learned policies are invariant to changes in scale, rotation, and translation, enhancing their applicability to unseen environments while retaining the benefits of diffusion-based policy learning such as multi-modality and robustness. We show in a suite of 6 simulation tasks that our proposed method reduces the data requirements and improves generalization to novel scenarios. In the real world, we show with in total 10 variations of 6 mobile manipulation tasks that our method can easily generalize to novel objects and scenes after learning from just 5 minutes of human demonstrations in each task.

Video

Real Robot Setup

Tasks

In our real robot experiments, we show a series of experiments where we train mobile robots to perform everyday manipulation tasks from 5 minutes of single-view human demonstration videos. We select a suite of 6 tasks that involve diverse everyday objects, including rigid, articulated, and deformable objects (see Figure 6): (1) Push Chair: A robot pushes a chair towards a desk; (2) Luggage Packing: A robot picks up a pack of clothes and places it inside an open suitcase; (3) Luggage Closing: A robot closes an open suitcase on the floor; (4) Laundry Door Closing: A robot pushes the door of a laundry machine to close it; (5) Bimanual Folding: Two robots collaboratively fold a piece of cloth on a couch; (6) Bimanual Make Bed: Two robots unfold a comforter to make it cover the bed completely.

Data Collection

We collect 15 human demonstration videos for each real robot task. We use a ZED 2 stereo camera to record the movement of a human operator using their fingers to manipulate the objects of interest at 15 Hz. After data collection, we use an off-the-shelf hand detection model, an object segmentation model, and a stereo-to-depth model to parse out the human hand poses and object point clouds in each frame of the collected demos. We then subsample this data to 3 Hz and convert it into a format supported by our policy training algorithm. Below, we show sample human demonstrations for each task. In each task, we only collect human demos on a single object.

Training and Evaluation

We train all methods for 1,000 epochs. After training, we evaluate each method for 10 episodes and record the success rate of the method. We vary the evaluation scenarios from the training scenarios differently in each task. In Laundry Door Closing, we perform evaluations in-distribution. In Push Chair, Luggage Closing, and Bimanual Make Bed, we evaluate with novel objects to make the evaluation out-of-distribution to the training data. In Luggage Packing and Bimanual Folding, we not only switch to novel objects but also translate and rotate the layout of the scene.

Real Robot Results

Push Chair to Longer Desk

DP

DP+Aug

Ours (EquiBot)

Push Chair to Circular Desk

DP

DP+Aug

Ours (EquiBot)

Packing T-shirts

DP

DP+Aug

Ours (EquiBot)

Packing Towel

DP

DP+Aug

Ours (EquiBot)

Packing Cap

DP

DP+Aug

Ours (EquiBot)

Packing Shorts

DP

DP+Aug

Ours (EquiBot)

Close Luggage

DP

DP+Aug

Ours (EquiBot)

Laundry Door Closing

DP

DP+Aug

Ours (EquiBot)

Bimanual Fold

DP

DP+Aug

Ours (EquiBot)

Bimanual Make Bed

DP

DP+Aug

Ours (EquiBot)

BibTeX

@misc{yang2024equibot,
      title={EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning}, 
      author={Jingyun Yang and Zi-ang Cao and Congyue Deng and Rika Antonova and Shuran Song and Jeannette Bohg},
      year={2024}
}