Skip to the content.
The video above has sound. Please unmute to turn on narration.


Manipulating volumetric deformable objects in the real world, like plush toys and pizza dough, bring substantial challenges due to infinite shape variations, non-rigid motions, and partial observability. We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects based on structured implicit neural representations. ACID integrates two new techniques: implicit representations for action-conditional dynamics and geodesics-based contrastive learning. To represent deformable dynamics from partial RGB-D observations, we learn implicit representations of occupancy and flow-based forward dynamics. To accurately identify state change under large non-rigid deformations, we learn a correspondence embedding field through a novel geodesics-based contrastive loss. To evaluate our approach, we develop a simulation framework for manipulating complex deformable shapes in realistic scenes and a benchmark containing over 17,000 action trajectories with six types of plush toys and 78 variants. Our model achieves the best performance in geometry, correspondence, and dynamics predictions over existing approaches. The ACID dynamics models are successfully employed to goal-conditioned deformable manipulation tasks, resulting in a 30% increase in task success rate over the strongest baseline.

ACID Model

pull figure

ACID learns a set of structured implicit representations:

  1. Implicit Geometry Module: we follow Peng et al. to use a PointNet + U-Net encoder to encode RGB-D input into three canonical feature planes. From the feature planes, we decode a coordinate-based implicit occupancy function.
  2. Implicit Correspondence Module: from the canonical feature planes, we decode an implicit correspondence embedding field. The correspondence embedding is supervised with a novel geodesics-based contrastive loss, where positive pairs’ embedding distance is minimized, and the negative pairs’ embedding distance needs to respect the geodesics.
  3. Implicit Dynamics Module: the action is encoded into canonical feature planes. Features from the implicit geometry module is fused, and we decode a forward-flow field.

ACID Real Robot Demo

We directly test the ACID model trained in PlushSim with real world scenarios. We use a simple one-step horizon Model Predictive Control (MPC) to control the robot arm. The model showcases reasonable performance manipulation real plush animals into desirable target configurations.

ACID Model Qualitative Visualization

We visualize ACID model’s prediction qualitatively. The left side shows in observation video. Each frame is processed by ACID model, and the result is visualized on the right. As we can see, the ACID model rightfully captures object shape correspondence throughout the manipulation trajectory.

PlushSim Data from Nvidia Omniverse

We build PlushSim manipulation environment based on Nvidia Omniverse Kit (documentation), which offers high-fidelity deformable object simulation and photorealistic sensor signals. We collect 17,665 action commands results for model training.

ACID Planning Results in PlushSim

We show here some ACID planning results in PlushSim. We use a simple one-step horizon Model Predictive Control (MPC) to control the robot arm.

ACID Roll-out Results in PlushSim

Here, we visualize some additional action sequences roll-outs for ACID. The visual showcases rolling out the ACID model given a start state and a 3-action sequence. For more details, please check out our paper.



  title={ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object Manipulation},
  author={Shen, Bokui and Jiang, Zhenyu and Choy, Christopher and J. Guibas, Leonidas and Savarese, Silvio and Anandkumar, Anima and Zhu, Yuke},
  journal={Robotics: Science and Systems (RSS)},