Unsupervised Learning (II): Dimensionality Reduction

Objectives

Understand the motivation for dimensionality reduction.
Explain key dimensionality reduction techniques (PCA, t-SNE, UMAP).
Describe the idea of PCA (linear method that keeps most variance) and apply PCA to correlate features with components.
Interpret explained variance ratio and decide how many components to keep.
Use PCA loadings and correlation circles to see how features contribute.
Recognize t-SNE and UMAP as nonlinear methods for complex data visualization.
Connect dimensionality reduction with clustering.

Instructor note

40 min teaching/demonstration
40 min exercises

From Clustering to Dimensionality Reduction: Simplifying Complexity

In the last episode, we talked about clustering, which is a core unsupervised ML technique that groups similar data points into clusters based on their features, without requiring labeled data. The fundamental value of clustering lies in its ability to reveal segments and patterns that are not immediately obvious, with applications ranging from customer segmentation in marketing to anomaly detection in network security.

Despite its usefulness, clustering comes with notable limitations. A major challenge is determining the appropriate number of clusters in advance, as in K-Means, where results can vary depending on initialization. Clustering results are also highly sensitive to the choice of distance metric and scaling of features, which can significantly alter outcomes. Furthermore, clustering often struggles with high-dimensional data due to the “curse of dimensionality”, where distance measures lose their discriminative power, making it harder to identify meaningful groups. Given these challenges, especially around data sensitivity and interpretability, it is often crucial to preprocess data with another form of unsupervised learning before clustering: Dimensionality Reduction.

Where clustering seeks to group samples, dimensionality reduction focuses on simplifying the feature space itself. By transforming a high-dimensional dataset into a lower-dimensional subspace while preserving its most critical relationships, dimension reduction can mitigate noise, reduce computational cost, and reveal the most discriminative features that define the data’s structure. This process not only addresses clustering’s sensitivity to irrelevant features but also provides a powerful foundation for visualizing potential clusters in two or three dimensions, making the entire analytical process more robust and insightful.

Methods such as PCA (Principal Component Analysis), t-SNE (t-Distributed Stochastic Neighbor Embedding), and UMAP (Uniform Manifold Approximation and Projection) make data more manageable, reduce noise, and improve clustering performance.

Since the Penguins dataset includes multiple features (bill length, bill depth, flipper length, body mass, species labels, etc.), plotting it directly in two dimensions can make it difficult to capture all the relationships and structures hidden in the dataset. That is why we parepare the pairplots between all pair of features to achieve a clear visualization of their correlations.

In this episode, we will explore dimensionality reduction techniques and apply them to the Penguins dataset. Our goal is to take the dataset’s multiple features, and project them into a simpler, lower-dimensional space for a better visualization. This process not only helps us better understand the data but also prepares it for downstream tasks, such as clustering or classification, by reducing noise and highlighting the most informative aspects of the dataset.

Data Preparation

Following the procedures used in previous episodes, we apply the same preprocessing steps to the Penguins dataset, including handling missing values and detecting outliers. For the clustering task, categorical features are not required, so encoding them is unnecessary.

Data Processing

The data processing is straightforward for the clustering task: we simply extract the numerical variables and apply standardization.

penguins = sns.load_dataset('penguins')
penguins_dimR = penguins.dropna()
penguins_dimR.duplicated().value_counts()

species = penguins_dimR["species"]

from sklearn.preprocessing import StandardScaler

X = penguins_dimR[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Training Model & Evaluating Model Performance

Principal Component Analysis

We start with Principal Component Analysis (PCA), a powerful statistical technique used to reduce the dimensionality of data while preserving as much variability as possible. This is achieved by transforming the original variables into a new set of uncorrelated variables called principal components. Each principal component is a linear combination of the original variables, and they are ordered such that the first few retain most of the variation present in the original variables.

1. A simplified view: PCA with two components

We begin with a simple subspace consisting of two principal components. That is, we will transform the Penguins dataset, which originally contains four numerical features (bill length, bill depth, flipper length, and body mass), into a reduced subdataset represented by just two composite variables. These two new variables, called principal components, form a simplified version of the Penguins dataset and can essentially capture the most significant variance in the Penguins dataset while reducing complexity. This allows us to visualize the structure of the data in a two-dimensional space and better understand the relationships among the penguins.

We build a model using the PCA class from sklearn.decomposition, specifying that the X_scaled data should be reduced to two principal components.”

from sklearn.decomposition import PCA

# construct a PCA model with 2 PCs
pca_2 = PCA(n_components=2)
X_pca_2 = pca_2.fit_transform(X_scaled)
explained_var_2 = pca_2.explained_variance_ratio_

print(f'''The explained variance of PC1    is {explained_var_2[0]:.2%}
The explained variance of PC2    is {explained_var_2[1]:.2%}
The explained variance (PC1+PC2) is {explained_var_2.sum():.2%}''')

The term explained variance ratio tells us how much of the total variability in the original Penguins dataset is captured by each principal component. The first principal component has an explained variance ratio of 0.72, it means 72% of the variability in the dataset can be represented by that single component.

For each penguine in the dataset, the contributions of its four original featuers to the new components are available in the X_pca_2 dataset.

X_pca_2_species = penguins_dimR.join(pd.DataFrame(X_pca_2,
	index = penguins_dimR.index,
	columns = ['PC_1', 'PC_2']))

What does the data in two principal components represent?

When we have two components,
- PC1 is the direction in feature space along which the data varies the most,
- PC2 is the direction orthogonal to PC1 that captures the next largest variance.
When we have three components, PC3 captures the next largest source of variance, orthogonal to both PC1 and PC2.
- In such a situation, we have to use 3D visualization to visualize the three components.
When we have four componnents, PC4 captures the next largest source of variance, orthogonal to both PC1, PC2, and PC3.
- It id difficult for us to visualize the four dimensional space.

A short summary

fewer components → lower-dimensional representation → some information is lost, but main patterns remain.
more components → higher-dimensional representation → more information retained.

After reducing the dimensionality of the Penguins dataset to two principal components, we can now visualize the transformed data. Each penguin, originally described by four numerical features (bill length, bill depth, flipper length, and body mass), is now represented by just two composite variables.

By plotting the two components, we can examine how penguins cluster in this reduced space.

../_images/8-pca2-penguins-distributions-species-sex-island.png

2. Preserving full variance: PCA with four components

The two-component model does a good job in capturing the main structure of the original features in the Penguins dataset, but it naturally raises an important question: how many principal components are truly necessary to represent the dataset effectively? Choosing the optimal number of components requires balancing simplicity against information retention — fewer components make visualization and interpretation easier, while more components preserve a greater share of the original variance. By examining metrics such as the explained variance ratio, we can make an informed choice about the number of components needed to capture the essential patterns in the Penguins dataset.

Here, we consider another case in which the new dataset has four principal components, preserving all of the original features in the Penguins dataset. By setting the parameter n_components=4 and running the code example, we obtain a full representation of the Penguins data in the new component space. This can be verified by examining the explained_variance_ratio, which shows how much variance each component contributes.

X_scaled_temp = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)

# build a new PCA model with 4 PCs
pca_4 = PCA(n_components=4)
X_pca_4 = pca_4.fit_transform(X_scaled_temp)
explained_var_4 = pca_4.explained_variance_ratio_

print(f'''The explained variance of PC1 is {explained_var_4[0]:.2%}
The explained variance of PC2 is {explained_var_4[1]:.2%}
The explained variance of PC3 is {explained_var_4[2]:.2%}
The explained variance of PC4 is {explained_var_4[3]:.2%}
The explained variance of ALL is {explained_var_4.sum():.2%}''')

In PCA, each original variable (feature) is essentially correlated with each principal component (PC), which can be described by corr_var_comp, which qualifies how strongly each original variable contributes to a component. The corr_var_comp ranges from -1 to 1:

+1 means a perfect positive correlation between the variable and the component.
0 indicates no linear correlation, and the component does not explain that variable at all.
-1 suggests a perfect negative correlation.

In practical applications, the square of the correlation between a variable and a component is often denoted cos² (cosine squared). The cosine of the angle between the variable vector and the component axis tells you how much of the variable’s variance is explained by that component. Squaring it gives the proportion of variance of that variable explained by the component, hence cos².

We first calculate corr_var_comp and square it to get the proportion of each variable’s variance explained by the component, as shown in the figure below.

../_images/8-pca4-correlation-variable-component.png

This allows us to examine and visualize in detail how each original penguine feature contributes to the four principal components. As discussed in the previous subsection, each principal component is a linear combination of the original features and can be expressed mathematically as follows:

\[PC1_i = w_1 * bill\_length_i + w_2 * bill\_depth_i + w_3 * flipper\_length_i + w_4 * body\_mass_i\]

By analyzing the PCA loadings, we can see that the first two principal components explain approximately 90% of the total variance in the Penguins dataset. This indicates that most of the important information in the original four features is already captured by these two components. Consequently, we can conclude that reducing the dataset to two principal components is sufficient for visualization and analysis, striking a balance between simplicity and information retention while effectively summarizing the underlying structure of the data.

3. Correlation circle

To gain deeper insight into how the original variables relate to the principal components, we can use a correlation circle, also known as a variable factor map. This graphical tool provides a visual representation of the contribution and correlation of each original variable with the components.

In a 2D PCA plot (PC1 vs. PC2 in the left subplot, and PC3 vs. PC4 in the right subplot), each original feature is represented as a vector (arrow) pointing from the origin.

The direction of the arrow indicates whether the variable (feature) is positively or negatively correlated with the principal components.
The length of the arrow indicates the strength of correlation — longer arrows mean the variable (feature) contributes strongly to the components.
The circle itself (radius = 1) represents the maximum possible correlation between a variable (feature) and the components, since PCA projects standardized variables (features) .

In addition, the correlations between variables (features) can also be specified.

Variables (features) pointing in similar directions are positively correlated (body mass and flipper length), that is why we performed regression task yesterday using these two features.
Variables (features) pointing in opposite directions are negatively correlated (for specific pairs of components).
Variables (features) at roughly 90° to each other are nearly uncorrelated.

../_images/8-pca4-correlation-circle.png

t-SNE

Since PCA is based on linear combinations of all features, it has some inherent limitations. In particular, it may fail to capture complex, non-linear relationships in the data, and it primarily focuses on maximizing global variance, which can overlook subtle local structures, intricate clusters, or hierarchical patterns – that are quite common in real-world datasets.

To address these challenges, we move beyond linear methods and explore advanced non-linear dimensionality reduction techniques. Algorithms such as t-SNE (t-Distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection) are particularly effective at capturing the non-linear structure of the data. These methods are designed to preserve local neighborhood relationships, revealing intricate patterns that PCA might miss and providing a more detailed view of the underlying structure of the penguin population.

In this subsection, we apply the t-SNE algorithm to visualize local structures and potential clusters, and we will explore UMAP for complementary insights in the next subsection.

t-SNE, is a powerful dimensionality reduction technique primarily used for visualizing high-dimensional data in two or three dimensions. Unlike linear methods such as PCA, t-SNE is non-linear and focuses on preserving the local structure of the data, meaning that points that are close in the high-dimensional space remain close in the lower-dimensional embedding. This is the main idea of t-SNE and it models the pairwise similarities between data points in both the high-dimensional and low-dimensional spaces.

We build a t-SNE model with 2 principle components using the TSNE class in the sklearn.manifold before fitting the model. The hyperparameter perplexity controls the number of effective neighbors each point considers when learning the embedding. Higher values emphasize global structure, lower values emphasize local relationships.

from sklearn.manifold import TSNE

# build a t-SNE model having 2 PCs
tsne = TSNE(n_components = 2, perplexity = 50)
X_tsne = tsne.fit_transform(X_scaled)

After fitting the t-SNE model, we can visualize the Penguin dataset in two dimensions, allowing us to explore the relationships between individual data points, identify potential clusters, and observe how different species are distributed across the embedding.

../_images/8-tsne-penguins-distributions-species-sex-island.png

Warning

It is noted that t-SNE is primarily a visualization tool rather than a feature extraction method, and the resulting embedding should not be used directly for downstream tasks such as classification. Each time you run t-SNE, the coordinates can change slightly due to randomness, even with the same data. As such, the features produced by t-SNE may not be stable or meaningful for predictive modeling.

Exercise

Here, we will perform a clustering task using K-Means on a dataset reduced to two components obtained from the t-SNE model (code examples are availalbe in the Jupyter Notebook.

Group the data into 9 clusters kmeans = KMeans(n_clusters=9).
Repeat the clustering several times to observe how the cluster assignments change.
Change the number of clusters (e.g., 5 or 11) to see how it affects the result.

UMAP

UMAP (Uniform Manifold Approximation and Projection) is a non-linear dimensionality reduction technique similar in purpose to t-SNE, but with some key differences:

Preserves both local and some global structure of the data.
Generally faster and more scalable than t-SNE, especially for large datasets.
Can be used for visualization (2D/3D) or as a preprocessing step for downstream tasks like clustering or classification.

UMAP is based on manifold learning and graph theory.

Manifold learning means it assumes that even though our data might have many dimensions, the real structure of the data lies on a lower-dimensional surface (like a curve or sheet) inside that high-dimensional space. UMAP tries to find and keep that structure.
Graph theory means UMAP represents the data as a graph: each point is a node, and connections (edges) show how close points are to each other. Then it uses math to squeeze this graph down into 2D or 3D while keeping the structure as much as possible.

UMAP it constructs a high-dimensional graph representation of the data and then projects it onto a low-dimensional space, by applying a series of optimization steps. This results in a visual representation of the data that is both accurate and interpretable. In comparison to PCA and t-SNE, UMAP offers a good balance of accuracy, efficiency, and scalability, making it a popular choice for dimensionality reduction in machine learning and data analysis.

Using the same procedure as in the t-SNE subsection, we build and fit the UMAP model, and visualize the Penguin dataset in two dimensions by species, island, and sex.

import umap

umap_model = umap.UMAP(n_components = 2, n_neighbors=10)
X_umap = umap_model.fit_transform(X_scaled)

X_umap_species = penguins_dimR.join(pd.DataFrame(X_umap,
									index = penguins_dimR.index,
									columns = ['PC_1', 'PC_2']))

../_images/8-umap-penguins-distributions-species-sex-island.png

Exercise

Here we perform the same clustering task using K-Means on a dataset reduced to two components obtained from the UMAP model (code examples are availalbe in the Jupyter Notebook.

Keypoints

Representative method include PCA, t-SNE, and UMAP
PCA is a method to reduce dimensions via creating new variables called principal components (PCs), which are linear combinations of the original features.
Perform PCA task and then decide optmal number of component.
T-SNE and UMAP are both nonlinear dimensionality reduction methods mainly used to visualize high-dimensional data in 2D or 3D.
T-SNE focuses on keeping similar points close and dissimilar points apart in lower-dimensional space, but not applicable for feature reduction or predictive modeling.
UMAP preserves both local relationships (nearby points stay close) and some global structure, scales well to large datasets, and is mainly used for exploration and visualization.