Contact-rich tasks present significant challenges for robotic manipulation policies due to the complex dynamics of contact and the need for precise control. Vision-based policies often struggle with the skill required for such tasks, as they typically lack critical contact feedback modalities like force/torque information. To address this issue, we propose FoAR, a force-aware reactive policy that combines high-frequency force/torque sensing with visual inputs to enhance the performance in contact-rich manipulation. Built upon the RISE policy, FoAR incorporates a multimodal feature fusion mechanism guided by a future contact predictor, enabling dynamic adjustment of force/torque data usage between non-contact and contact phases. Its reactive control strategy also allows FoAR to accomplish contact-rich tasks accurately through simple position control. Experimental results demonstrate that FoAR significantly outperforms all baselines across various challenging contact-rich tasks while maintaining robust performance under unexpected dynamic disturbances.
FoAR consists of a point cloud encoder, a force/torque encoder, a future contact predictor, and a diffusion action head. The scene features and force features are fused under the guidance of the future contact predictor.
We design three challenging contact-rich tasks across two categories: surface force control (Wiping and Peeling) and instantaneous force impact (Chopping). These tasks require different capabilities in terms of the direction, intensity, and precision of applied contact forces. Moreover, these tasks are designed to have both non-contact phases and contact phases for thorough evaluations. For the Wiping task, we design two variants: one with a fixed orientation of the whiteboard, and another that allows arbitrary orientations, denoted as Wiping (General).
We evaluate our proposed approach against five baseline methods, including the vision-based policy RISE and three ablation variants: (1) RISE (force-token): incorporates encoded force/torque information as additional tokens within the RISE transformer, akin to \cite{vtt, seehearfeel, maniwav, octo}; (2) RISE (force-concat): directly concatenates the force feature with the vision feature for action generation; (3) FoAR (3D-cls): uses scene features directly in the future contact predictor, instead of a separate image encoder.
@article{
he2024force,
title = {FoAR: Force-Aware Reactive Policy for Contact-Rich Robotic Manipulation},
author = {He, Zihao and Fang, Hongjie and Chen, Jingjing and Fang, Hao-Shu and Lu, Cewu},
journal = {arXiv preprint arXiv:2411.15753},
year = {2024}
}