Skip to main content

Unsupervised learning for computer vision

Picture of Kamil Rzechowski, Senior Machine Learning Engineer

Kamil Rzechowski

Senior Machine Learning Engineer
Jan 22, 2026|19 min read
mirrors_in_circle
1x = np.load('dataset.npy')
2x.shape # [10000,64,64]
3x[0].shape # [64,64]
4x = x.reshape((X.shape[0],-1)) # flatten images
5pca = PCA(n_components=50)
6x_reduced = pca.fit_transform(x)
7x_reduced.shape # [10000,50]
8x_recovered = pca.inverse_transform(x_reduced)
1from sklearn.decomposition import PCA
2pca = PCA()
3pca.fit(x_train)
4cumulative_sum = np.cumsum(pca.explained_variance_ratio_)
5dims_to_keep = np.argmax(cumulative_sum >= 0.9) + 1
6dims_to_keep # => 161
7pca = PCA(dims_to_keep)
8x_reduced = pca.fit(x)
9x_reduced.shape # [10000,161]
1from sklearn.cluster import KMeans
2kmeans = KMeans(n_clusters=10, random_state=0, n_init="auto").fit(x_reduced)
3kmeans.labels_ # array([9, 1, 1, 6, 5, 5, ....], dtype=int32)
4kmeans.predict(x_test)
autoencoder

Figure 1. Autoencoder architecture. An autoencoder consists of an encoder, a bottleneck, and a decoder. Source: https://en.wikipedia.org/wiki/Autoencoder

simMIM model architecture

Figure 1. SimMIM model architecture. Source Figure 1. SimMIM model architecture. Source https://arxiv.org/pdf/2111.09886

“We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.”

From: https://arxiv.org/abs/2211.07636

MAE Model architecture

Figure 3. MAE model architecture. Source https://arxiv.org/pdf/2111.06377

Subscribe to our newsletter and never miss an article