Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik
2023년 9월 28일
General In-Hand Object Rotation with Vision and Touch
Haozhi Qi, Brent Yi, Sudharshan Suresh, Mike Lambeta, Yi Ma, Roberto Calandra, Jitendra Malik,
UC Berkeley, Meta AI, CMU, TU Dresden
The Centre for Tactile Internet with Human-in-the-Loop (CeTI)
Conference on Robot Learning (CoRL), 2023
Abstract
We introduce RotateIt, a system that enables fingertip-based object rotation along multiple axes by leveraging multimodal sensory inputs. Our system is trained in simulation, where it has access to ground-truth object shapes and physical properties. Then we distill it to operate on realistic yet noisy simulated visuotactile and proprioceptive sensory inputs. These multimodal inputs are fused via a visuotactile transformer, enabling online inference of object shapes and physical properties during deployment. We show significant performance improvements over prior methods and highlight the importance of visual and tactile sensing.
Many Objects, Many Axes
Interaction Visualization
We show interactive visualization for rotation over multiple axes.
Try to use your mouse to control the viewing angle!
More on [X-axis] [Y-axis] [Z-axis] [Irregular-axis]
Visual Sim-to-Real
Object Depth as the bridge of Simulation and Real-World.
In the real-world, use Segment-Anything to get object.
Ignore background both in Simulation and Real-World.
Tactile Sim-to-Real
Vision-based tactile sensing is hard to simulate. We use discrete contact location as the approximation.
In the simulation, we can directly query contact points.
In the real-world, we use simple color tracking to measure the pixel displacement, and then discretize it.
Vision and Touch Improve Manipulation of Hard Objects
We plot the relative improvements on varies objects shape for x-axis rotation. We find that point-cloud gives the largest improvement on objects with non-uniform w/d/h (width/depth/height) ratios and objects with irregular shapes such as the bunny and light bulb. The improvements on regular objects are smaller but still over 40%.
Similar to what we find in the oracle policy training, we observe the visuotactile policy has larger improvements on irregular and non-uniform objects.
Vision and Touch Improve Out-of-Distribution (OOD) Generalization
We show that not using point cloud results in a 22% decrease in generalization gap while using point-cloud can improve it to only 8% drop. Visuotactile information are critical for OOD generalization. Using proprioception only will lead to a 41% performance drop while using vision and touch can improve it to 15% drop.
Emergent Meaningful Latent Representation
After training, we freeze the policy and then try to predict 3D shapes from the learned embedding space.
In stage 1 (w/o shape) and stage 1 (w/ shape) comparison, we find shape information is preserved even the only learning signal is the task reward.
In stage 2 (proprioception only) and stage (visuotactile) comparison, we find the learned latent space can successfully reconstruct rough 3D shapes.
Bibtex
@inproceedings{qi2023general,
author={Qi, Haozhi and Yi, Brent and Ma, Yi and Suresh,
Sudharshan and Lambeta, Mike and Calandra, Roberto and Malik,
Jitendra},
title={{General In-Hand Object Rotation with Vision and Touch}},
booktitle={Conference on Robot Learning (CoRL)},
year={2023}
}
Acknowledgement
The interactive visualization and mesh visualization in paper are created by Viser.
This research was supported as a BAIR Open Research Common Project with Meta. In their academic roles at UC Berkeley, Haozhi Qi and Jitendra Malik are supported in part by DARPA Machine Common Sense (MCS), Brent Yi is supported by the NSF Graduate Research Fellowship Program under Grant DGE 2146752, and Haozhi Qi, Brent Yi, and Yi Ma are partially supported by ONR N00014-22-1-2102 and the InnoHK HKCRC grant. Roberto Calandra is funded by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden. We thank Shubham Goel, Eric Wallace, and Angjoo Kanazawa, Raunaq Bhirangi for their feedback. We thank Austin Wang and Tingfan Wu for their help on hardware. We thank Xinru Yang for her help on real-world videos.