In this work, we focus on learning useful and robust deep world models using multiple, possibly unreliable, sensors. We find that methods by maximizing a reconstruction-based variational evidence lower-bound (ELBO) do not sufficiently encourage a shared representation between modalities; this can cause poor performance on downstream tasks and over-reliance on specific sensors.

A simple illustrative example of a robot in a 2D world with two sensory modalities: laser rangers that give its (x, y) position and a ceiling camera provides a scene image. A Deep Latent Space Model (SSM) trained using Product-of-Experts (PoE) fusion and a reconstruction-based loss did not learn a robust latent space from gathered data (left plots, colors indicate ground-truth position): the overlap between the two modality-specific latent spaces is small and model is over-reliant on the position sensor.

The experts were “miscalibrated” in that the camera expert predicts a much higher variance relative to the position expert and thus, has little influence during PoE fusion.

Our technique MuMMI encourages a consistent latent space across the different modalities (right plots) with calibrated experts.

Fig 1. A simple illustrative example of a robot in a 2D world with two sensory modalities: laser rangers that give its (x, y) position and a ceiling camera provides a scene image.

MuMMI uses MI-based lower-bound that is optimized via the InfoNCE loss. Within this contrastive framework, we explicitly encourage the different modality networks to be consistent with one another via a specially-designed density ratio estimator.

Fig 2. MuMMI training uses a density ratio estimator $f^m_\theta$ that acts to minimize the squared distances between the mean of each modality expert and a transformed fused latent code. This encourages the experts to project to points in a shared latent space.

Our final loss is

and it lower bounds the log-likelihood of observations and rewards.

We tasked our method to learn policies (in a self-supervised manner) on multi-modal Natural MuJoCo benchmarks and a challenging Table Wiping task. Experiments show our method significantly outperforms state-of-the-art deep reinforcement learning methods, particularly in the presence of missing observations.

Fig 3. Natural Mujuco environments and modalities used in our experiments (top) RGB images for walker stand/walk/run, finger spin and cup catch. The background images are continuously changing. (bottom) Additional two modalities for the walker, i.e., depth image and tactile sensor.

Fig 4. Model performance across training episodes. The performance curves are steeper for MuMMI on a majority of the tasks, indicating faster learning compared to competing approaches (Dreamer and CVRL).

## Code

Code for reproducing experiments presented in this paper can be found at this Github repo.

## Citation

If you find our code or the ideas presented in our paper useful for your research, consider citing our paper.

Kaiqi Chen, Yong Li, and Harold Soh. “Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised Deep Reinforcement Learning” International Conference on Robotics and Automation, 2021.

@inproceedings{Chen2021MuMMI,
title={Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised Deep Reinforcement Learning},
author={Kaiqi Chen and Yong Lee and Harold Soh},
year={2021},
booktitle={IEEE International Conference on Robotics and Automation (ICRA)}}


## Contact

If you have questions or comments, please contact Kaiqi Chen.

## Acknowledgements

This work was supported by the Science and Engineering Research Council, Agency of Science, Technology and Research, Singapore, through the National Robotics Program under Grant No. 192 25 00054.

## References

[1] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” arXiv preprint arXiv:1912.01603, 2019.

[2] X. Ma, S. Chen, D. Hsu, and W. S. Lee, “Contrastive variational model-based reinforcement learning for complex observations,” arXiv preprint arXiv:2008.02430, 2020.

Written by