Unsupervised learning for computer vision

Traditional machine learning for computer vision relies on meticulously labeled datasets—where humans tag every image with its correct category. This process creates a significant bottleneck: manual labeling can cost millions of dollars for large-scale projects, require months of expert time, and remain prone to human error and inconsistency. Unsupervised learning offers a fundamentally different approach, allowing algorithms to discover structure and patterns in visual data without any human-provided labels.

In machine learning, ML engineers often face the challenge of having data but no ground truth. Labeling datasets is costly and time-consuming, often requires specialized domain expertise, and is prone to human errors and subjectivity.

Therefore, unsupervised learning or self-supervised or semi-supervised approaches gain huge attention and are widely used in industry. The most successful examples are Large Language Models, autoregressive models like OpenAI GPT or Google Gemini, that use massive unlabeled text data for initial pretraining using next-token prediction tasks.

In this article, we will discuss approaches used for training image models with no labels given a priority.

Approaches for unsupervised learning

In unsupervised settings, we don’t have labels given, and we don’t create any pseudo labels. The most classic examples of unsupervised learning are:

dimensionality reduction techniques like Principal Component Analysis (PCA) or t-SNE (t-Distributed Stochastic Neighbor Embedding)
clustering algorithms like K-means clustering or DBSCAN.

The unsupervised methods utilise data density to find similarities, patterns, and structure in data using data itself. Self-supervised methods are a subclass of supervised learning. It creates labels from the data. It can be, for example, by:

rotating the image and predicting the image rotation, or solving a jigsaw puzzle
Masked Image Modeling (MIM), where we mask the part of the image and try to predict the missing part
autoencoders, where we squeeze the image by using the bottleneck in a network, and try to reconstruct the image from its squeezed representation
maximizing the Mutual Information (MI) between different representations, such as between augmented versions of the same input
generative models, such as GANs or Stable Diffusion, utilize an auxiliary network to guide the image generation network in the first case. The generator creates a fake image, attempting to deceive the discriminator, and the discriminator attempts to distinguish between fake images and real images. Pseudo labels for this approach are: fake image 0 and 1 real image. In the case of Stable Diffusion, we use the input image itself to guide the generation process.
autoregressive video modeling

Semi-supervised techniques use partially labelled data and the massive amount of unlabeled data to train the model. This approach can:

label unlabeled data using similarities to labeled data
connect the self-supervised approach with supervised learning. For example, attach a task-specific head to the backbone network and train the model on a downstream task using labelled data, and train using the MI for unlabeled data using only the backbone model.

It is worth noting that all unsupervised learning techniques require a significantly larger dataset than supervised methods.

Dimensionality Reduction and Clustering

PCA

Let’s take as an example a grayscale image of size 64x64 pixels. In order to apply dimensionality reduction to an image, we need to treat it not as an image, but rather as a point in 64x64 = 4096-dimensional space. Then by using PCA, we can sort those dimensions from the one with the highest variance, to the ones with the lowest variance. By taking the subset of dimensions with the highest variance, we obtain the dimensionally reduced image representation that preserves the maximum information about the original image. This condensed image representation is useful for further downstream tasks, such as image classification using clustering algorithms.

We can choose the number of dimensions smartly, instead of specifying an arbitrary number. We can choose the number of dimensions, such that 90% of the original image information is preserved.

Clustering

Having the condensed image representation, we can easily apply k-Means. Applying K-Means directly to the image before dimensionality reduction would yield worse results and would take much more time.

Autoencoders

The Deep Learning approach for dimensionality reduction would apply to an autoencoder. An autoencoder consists of an encoder, a bottleneck, and a decoder. An autoencoder reconstructs the input image as an output. The bottleneck forces the model to learn the lower-dimensional representation of the input. After training, the encoder network can be used as a feature extractor for downstream tasks.

Figure 1. Autoencoder architecture. An autoencoder consists of an encoder, a bottleneck, and a decoder. Source: https://en.wikipedia.org/wiki/Autoencoder

Masked Image Modeling

The MIM aims to recover the original image signal. In most cases, some ratio of the original image is masked. Sometimes, the informed masking strategy is used, where additional information is used to select patches to mask or some heuristics, like, for example, block masking. The additional information can come from auxiliary neural networks or from features obtained using classical computer vision algorithms.

Usually, in MIM, the low-level features, such as the raw pixel, are used. The model recovers pixel values via a regression task. In that case, it was proven empirically that for successful MIM learning, most of the image has to be masked. It can be achieved using a large number of masked patches, when the patch size is small, or using a large patch size. In the second option, the ratio of masked patches obviously has to be smaller, but still, most of the image should be masked. For example, MAE uses a masking ratio of 80%, leaving only 39 out of 196 patches for a patch size of 8, and SimMIM uses a wide range of masking ratios (10%- 70%) for a patch size of 32, achieving competitive results.

Figure 1. SimMIM model architecture. Source Figure 1. SimMIM model architecture. Source https://arxiv.org/pdf/2111.09886

Some models use a high-level features prediction task. For example, EVA (Exploring the Limits of Masked Visual Representation Learning at Scale) utilises the CLIP model and training architecture and adds MIM as a pretext task. It learns to reconstruct the masked text-image alignment, with a ratio of 40% masked image patches. In their paper "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale," Researchers from Cornell University came to the conclusion that:

““We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.””

From: https://arxiv.org/abs/2211.07636

Figure 3. MAE model architecture. Source https://arxiv.org/pdf/2111.06377

Apart from the commonly used regression loss, some approaches use classification loss or contrastive loss. For example, MUST (Li et al. ICLR 2023) uses cross entropy loss, Mask-Point (Liu et al. ECCV 2022) uses a focal loss, ExtreMA (Wu et al. TMLR 2023) uses a cosine distance and MAE-CLIP (Weers et al. CVPR 2023) uses InfoNCE loss.

Mutual Information maximisation

The MI task creates two versions of the same image and asks the neural network to maximise the information. For example, in the case of DINO, the model minimises the entropy between different crops of the same image patch. Authors of the DINO paper additionally add a few tricks to boost the MI task performance and stabilize the training. They adopt the teacher-student network approach, where both share the same architecture, but the gradient backwards pass is only propagated through the student network, and the teacher network weights are updated using an exponential moving average (EMA) from the student model weights. Building the teacher from past iterations of the student network stabilizes the training as the student network performance fluctuates throughout the training. We can consider the teacher network to be a model soup and therefore leads the student network throughout the training. Additionally, authors of the DINO add sharpening and centering of the teacher output to avoid collapse. Centering the output vector prevents one dimension from dominating, while the sharpening has an opposite effect. Two obvious solutions for MI would be uniform distribution or a single pick distribution. Sharpening and cropping together aim to avoid both of those obvious solutions.

Authors of the paper Explicit Mutual Information Maximization for Self-Supervised Learning use a Siamese network. It consists of two identical encoders. They propagate through each branch the same image with different augmentations applied to each one of them. They maximise the mutual information between two output vectors Z and Z′ by minimizing their joint entropy H(Z, Z′), but also maximise their marginal entropy H(Z) and H(Z′), respectively, which naturally avoids trivial constant solutions.

Generative models

Generative models create an image from noise. We can name two main architectures: Generative Adversarial Networks (GANs) and newer Diffusion Models.

GANs

Generative Adversarial Networks consist of two models: generator and discriminator. Both models play a game. The generator tries to fool the discriminator by generating fake images. The discriminator tries to classify the given image as real or fake. Over the training, both models improve their performance. However, this model architecture is prone to training collapse. If one model starts to dominate the other, then the other stops learning and the whole training process collapses. Because of this drawback, Diffusion Models were proposed. They provide stable training and therefore better performance (image quality and dimensionality). GANs require only fake vs real image labels for training, so no labeling needs to be done a priori.

Stable Diffusion

Diffusion Models start by creating a noise image by sampling from the Gaussian distribution. Then they gradually remove the noise from the initial noisy image until the target high-quality image is obtained. The diffusion model learns to predict noise between two neighboring timesteps. After completing n steps, the final image is obtained. The target image is given as an input image + noise. Therefore, no additional labels are required.

Autoregressive Video Modeling

A similar approach to autoregressive models training in Natural Language Processing (NLP) can be applied in the vision domain. The image can be generated per token, and the video can be generated per frame in an iterative manner. This way, the model training does not require any additional labeling. Even though there exist autoregressive models for image generation like PixelRNN, VQ-VAE, or VAR that focus on Next-Pixel Prediction, Next-Token Prediction, and Next-Scale Prediction, they don’t shine enough to get the broader community's attention. SOTA video generation architectures like SORA, WAN, or CNTopB usually follow the denoising diffusion process. There exist a few autoregressive video generation architectures. I find them promising, at least from a self-supervised perspective.

NOVA can generate video frames in an autoregressive manner, conditioned on text, image, and optical flow. It achieves comparable performance to State of the Art (SOTA) Stable Diffusion models on image generation tasks, and although it requires labels to learn the conditioning, it can be trained solely on videos, without access to any labeled samples. MAGI-1: Autoregressive Video Generation at Scale generates video autoregressively in 24-frame chunks, where each chunk generation follows the denoising diffusion process. ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models generates frame by frame autoregressively and builds on top of the Stable Diffusion architecture for frame generation.

Unsupervised learning in your business

Unsupervised learning techniques enable the identification of learning patterns from unlabeled data, making them suitable for a wide range of applications where labeling is not possible or too expensive.

It can be applied among others to:

anomaly detection (for example, fraud detection, insurance claims)
Network or system failure detection
Customer segmentation (behaviour, spending, engagement)
Recommendation systems

Combining unsupervised learning with supervised techniques allows applying AI to an even wider range of scenarios, while labeling only a small subset of your data.

Summary

We observed multiple approaches to learning patterns from unlabeled data. Unsupervised learning can create a powerful feature extractor that can be later finetuned on a downstream task. It can significantly boost the model performance and utilise the data that is too difficult or too expensive to label. It can also significantly speed up ML model development, and the trained model can be further used to iteratively label the unlabeled dataset and improve the dataset quality. With a better dataset, we can train a better model and iteratively improve our system performance. If you wonder whether your massive, messy data is useful and can be used to improve your business and customer experience, please reach out to us. We will analyse your data and propose the best solution.