Skip to main content

Unsupervised learning for computer vision

Picture of Kamil Rzechowski, Senior Machine Learning Engineer

Kamil Rzechowski

Senior Machine Learning Engineer
Jan 22, 2026|19 min read
mirrors_in_circle
autoencoder

Figure 1. Autoencoder architecture. An autoencoder consists of an encoder, a bottleneck, and a decoder. Source: https://en.wikipedia.org/wiki/Autoencoder

simMIM model architecture

Figure 1. SimMIM model architecture. Source Figure 1. SimMIM model architecture. Source https://arxiv.org/pdf/2111.09886

“We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.”

From: https://arxiv.org/abs/2211.07636

MAE Model architecture

Figure 3. MAE model architecture. Source https://arxiv.org/pdf/2111.06377