Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

TL;DR

Multisensory pretraining enhances RL for contact-rich tasks by
learning expressive representations
through masked autoencoding.

Multisensory Dynamic Pretraining

The MSDP framework with MSDP-Encoder (left), Pretraining (top right) and downstream RL (bottom right): The current multisensory observation gets projected with a CNN-stem and linear layers to the embedding space. The MSDP-encoder fuses all sensor embeddings to form our expressive multisensory latent representation. The encoder is trained via the decoder and (next) sensor observation reconstruction from a subset of sensor embeddings (Masked Autoencoding). This pretraining results in dynamic cross-sensor prediction, shaping and fusing sensor representations. For downstream RL we extract multisensory task-specific features via a single cross-attention layer for the critic and via pooling for the actor. Sensor embeddings are only masked during pretraining. Our Framework offers an expressive and robust multisensory representation for complex contact-rich manipulation tasks in simulation and the real world.

Real World Experiments

Real world setup and experimental results. MSDP enables training RL policies directly in the real world, with first successful episodes after only 2,000 / 1,000 online interactions, outperforming various baselines. Task success is detected via endeffector position or the aruco marker on the cube. Force torque readings are essential to consistently push the cube to the goal or to insert the peg under occlusion improving task success by 14% (cf. MSDP-noFT). Policies are learned directly on the pretrained multisensory representation, without any sim-to-real transfer, with only 6,000 online interactions.

Robustness Evaluation

We evaluate the final policy of MSDP-P in the Peg Insertion task under various disturbances to showcase the robustness of our pretrained multisensory encoder and policy. We evaluate each change that hasn't been observed during training for 20 trials. Trained on \(K_c\)=2000 cartesian-stiffness the policy achieves 90 % success rate with decreased (\(K_C\)=1500) and 100 % with increased cartesian-stiffness (\(K_c\)=2500). MSDP shows remarkable robustness against changed light settings, e.g., back light (100 %), front light (100 %), disco lights (100 %) and visual occlusion (partly blocked camera view, 95 %) and external forces.

Simulation Experiments

BibTeX

@misc{msdp_krohn2025,
  title= {Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning},
  author = {Rickmer Krohn and Vignesh Prasad and Gabriele Tiboni and Georgia Chalvatzaki},
  year = {2025},
  eprint = {2511.14427},
  archivePrefix = {arXiv},
  primaryClass = {cs.RO},
  url = {https://arxiv.org/abs/2511.14427},
}

Acknowledgments

This research is funded by the German Research Foundation (DFG) Emmy Noether Programme (CH 2676/1-1), the EU’s Horizon Europe project ARISE (Grant no.: 101135959), the German Federal Ministry of Education and Research (BMBF) project “RiG” (Grant no.: 16ME1001) and the European Research Council (ERC) project “SIREN” (Grant No.: 101163933). The authors gratefully acknowledge the computing time provided to them on the high-performance computer Lichtenberg II at TU Darmstadt, funded by the German Federal Ministry of Education and Research (BMBF) and the State of Hesse.