The video above has sound. Please unmute to turn on narration.

Overview

Manipulating volumetric deformable objects in the real world, like plush toys and pizza dough, bring substantial challenges due to infinite shape variations, non-rigid motions, and partial observability. We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects based on structured implicit neural representations. ACID integrates two new techniques: implicit representations for action-conditional dynamics and geodesics-based contrastive learning. To represent deformable dynamics from partial RGB-D observations, we learn implicit representations of occupancy and flow-based forward dynamics. To accurately identify state change under large non-rigid deformations, we learn a correspondence embedding field through a novel geodesics-based contrastive loss. To evaluate our approach, we develop a simulation framework for manipulating complex deformable shapes in realistic scenes and a benchmark containing over 17,000 action trajectories with six types of plush toys and 78 variants. Our model achieves the best performance in geometry, correspondence, and dynamics predictions over existing approaches. The ACID dynamics models are successfully employed to goal-conditioned deformable manipulation tasks, resulting in a 30% increase in task success rate over the strongest baseline.

ACID Model

pull figure

ACID learns a set of structured implicit representations:

Implicit Geometry Module: we follow Peng et al. to use a PointNet + U-Net encoder to encode RGB-D input into three canonical feature planes. From the feature planes, we decode a coordinate-based implicit occupancy function.
Implicit Correspondence Module: from the canonical feature planes, we decode an implicit correspondence embedding field. The correspondence embedding is supervised with a novel geodesics-based contrastive loss, where positive pairs’ embedding distance is minimized, and the negative pairs’ embedding distance needs to respect the geodesics.
Implicit Dynamics Module: the action is encoded into canonical feature planes. Features from the implicit geometry module is fused, and we decode a forward-flow field.

PlushSim Data from Nvidia Omniverse

We build PlushSim manipulation environment based on Nvidia Omniverse Kit (documentation), which offers high-fidelity deformable object simulation and photorealistic sensor signals. We collect 17,665 action commands results for model training.

Planning Results

Here, we visualize example planned action sequences. The visual showcases rolling out the ACID model given a start state and a 3-action sequence. For more details, please check out our paper.

Dog

Teddy bear

Elephant

Octopus

Rabbit

Team

Citation

@article{shen2022acid,
  title={ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object Manipulation},
  author={Shen, Bokui and Jiang, Zhenyu and Choy, Christopher and J. Guibas, Leonidas and Savarese, Silvio and Anandkumar, Anima and Zhu, Yuke},
  journal={Robotics: Science and Systems (RSS)},
  year={2022}
}