Vision Model: Variational Autoencoder (VAE)
Overview
The Vision Model in World Models uses a Variational Autoencoder (VAE) to compress high-dimensional visual observations into a compact latent space.
Why Use a VAE?
Raw visual observations are:
- High-dimensional: Computationally expensive to process
- Redundant: Many pixels contain similar information
- Noisy: Not all visual details are relevant for decision-making
A VAE addresses these issues by learning to:
- Encode observations into a low-dimensional latent space
- Decode latent vectors back to reconstructed observations
- Regularize the latent space for smooth interpolation
The Reparameterization Trick
A key innovation in VAEs is the reparameterization trick, which enables backpropagation through the stochastic sampling process.
Loss Function
The VAE is trained to minimize:
L = Reconstruction Loss + KL Divergence
Reconstruction Loss
Measures how well the decoder reconstructs the input.
KL Divergence
Regularizes the latent space to be close to a standard normal distribution.
Latent Space Properties
A well-trained VAE produces a latent space with desirable properties:
- Continuity: Similar inputs map to nearby points
- Completeness: Every point in latent space decodes to a valid output
- Smoothness: Interpolation between points produces meaningful transitions