Inspired by Social Projection Theory, we use the robot's self model to efficiently model humans.
We construct a shared latent space from different sensory modalities via contrastive learning.