Data Overview

The BRCA dataset contains gene expression measurements from 445 breast cancer tumor samples and 353 genes obtained from The Cancer Genome Atlas (TCGA). Each row represents a single tumor sample, while each column corresponds to a specific gene's expression level. Larger expression values indicate higher activation of that gene within the tumor.

In addition to the gene expression data, six clinical variables are provided for each patient: Subtype (one of Basal-like, Luminal A, Luminal B, HER2-enriched, or Normal-like), ER-Status (estrogen receptor), PR-Status (progesterone receptor), HER2-Status (human epidermal growth factor receptor 2), Node (number of lymph nodes involved), and Metastasis (indicator of whether the cancer has metastasized).

The heatmap below visualizes the expression levels of all 353 genes across the 445 tumor samples. Each row represents a gene, and each column represents a tumor. The consistent horizontal bands indicate genes that are expressed across nearly all samples, while variations in color intensity highlight differences in gene activation patterns between tumor subtypes.

Gene expression heatmap for BRCA dataset

Model Selection

To explore gene-expression patterns within the BRCA dataset, several dimensionality reduction techniques were applied. Each method offers distinct advantages in uncovering latent tumor structure and subtype separability.

PCA (Principal Component Analysis)

Provides a linear projection that captures maximum variance in gene expression, serving as a strong baseline for identifying dominant global patterns.
Facilitates interpretation by ranking genes according to their contribution to major variance directions, highlighting biologically influential genes.
Useful for visualizing broad subtype separations, noise structure, and is a great benchmark before applying nonlinear methods.

UMAP (Uniform Manifold Approximation and Projection)

Preserves both local and global relationships, making it ideal for capturing smooth transitions between tumor subtypes.
Efficient and scalable for high-dimensional gene-expression data, while maintaining cluster integrity.
Reveals manifold structure underlying nonlinear subtype differences.

t-SNE (t-Distributed Stochastic Neighbor Embedding)

Optimized for local neighborhood preservation, effectively uncovering compact and well-separated subtype clusters.
Highlights nonlinear patterns in gene co-expression that linear models like PCA may fail to capture.
Particularly effective at visualizing distinct Basal-like clusters due to strong local manifold sensitivity.

Isomap

Extends MDS by preserving geodesic distances, capturing nonlinear global structure in gene-expression manifolds.
Balances global and local geometry better than purely local methods like t-SNE.

Spectral Embedding

Leverages graph Laplacian structure to extract low-dimensional embeddings that respect sample connectivity patterns.
Captures community-like grouping in expression similarity graphs, uncovering intrinsic subtype clusters.

Hyperparameter Selection Strategy (Using Jaccard)

Because this is an unsupervised exploratory task, the goal is to identify meaningful structure rather than maximize predictive accuracy. All dimensionality reduction methods were therefore constrained to a two-dimensional embedding (n_components = 2), which provides a clear and consistent visual summary of tumor-level variation in gene expression. Higher dimensions would not add interpretive value since most methods (e.g., t-SNE, UMAP, Isomap) are not ordered or nested by variance explained.

Hyperparameter tuning was guided by the principle of neighborhood preservation. Specifically, parameter sets were evaluated using the mean Jaccard similarity between each sample's nearest neighbors in the original high-dimensional space and those in the low-dimensional embedding. This metric quantifies how faithfully local relationships are maintained after projection. The parameters yielding the highest Jaccard index were selected for each method.

This approach ensures that all methods are compared on equal footing; two-dimensional embeddings optimized for local structure preservation, allowing direct assessment of which algorithm most effectively reveals the intrinsic organization of breast cancer gene expression data.

The visualization below shows the latent 2D representations of the BRCA gene expression data across all dimensionality reduction methods. Each embedding reflects the model performance after hyperparameter tuning using an unsupervised methodology (Jaccard neighborhood preservation).

Interpretation Using Labels

For this section we interpret the 2D patterns from the tuned models using the clinical data labels. I have plotted the tuned latent 2D rerpresentations across all clinical labels to interpret the discovered patterns.

Subtype

Across all methods, Basal-like tumors form a well defined and separable cluster, reflecting their strong distinction from the other subtypes. Linear methods such as PCA struggle to capture nonlinear subtype boundaries, and in particular, HER2 enriched cases overlap with Luminal clusters, suggesting that linear components do not fully explain the variance associated with HER2 amplification. In contrast, nonlinear embeddings such as UMAP and t-SNE produce a more coherent HER2 enriched grouping, indicating that preserving local manifold structure is important for capturing this subtype. Neither linear nor nonlinear approaches clearly separate Luminal A from Luminal B, which reflects their biological similarity and limited expression contrast. Further work may be needed to disentangle these two hormonally driven phenotypes, potentially through feature weighting or latent representations in higher dimensional space.

HER2-Status

Across all embeddings, nonlinear methods such as UMAP and t-SNE capture clear separation between HER2 positive and HER2 negative samples, while linear projections like PCA fail to identify any distinct partitioning. This pattern reaffirms the observations seen in the HER2 enriched subtype, indicating that nonlinear manifold learning better reflects the biological variance associated with HER2 amplification. The clear clustering in nonlinear embeddings supports the idea that HER2 expression introduces complex, nonlinear relationships in the transcriptomic space that are not adequately represented in linear subspace models.

Node and Metastasis

For both Node and Metastasis features, the patterns across all embedding methods appear largely uninformative, with no consistent clustering or separability observable between classes. Both linear and nonlinear methods fail to uncover structure related to nodal involvement or metastatic spread. The Node variable (0-3) remains diffusely distributed in each embedding space, suggesting it does not align with the main sources of transcriptional variance. The Metastasis feature is dominated by class imbalance, with only a small number of positive samples, making any grouping statistically unreliable. Together, these results indicate that neither nodal status nor metastasis is strongly reflected in the unsupervised latent spaces, implying they are driven more by downstream clinical and spatial progression rather than molecular expression patterns captured by these methods.

Nonlinear Quantitative Evaluation

Because the observed subtype patterns were clearly nonlinear, linear PCA was excluded from the quantitative comparison. I focused instead on four nonlinear embedding methods: t-SNE, UMAP, Isomap, and Spectral Embedding. Each was evaluated using a 5-Nearest-Neighbors (KNN) classifier trained on 80% of the samples and tested on the remaining 20%. The accuracy reflects how well local subtype neighborhoods are preserved after dimensionality reduction.

As expected, t-SNE achieved the highest KNN accuracy (approx 0.71), consistent with its design objective of preserving local neighborhoods in the low-dimensional space. The other nonlinear methods: UMAP, Isomap, and Spectral Embedding, performed similarly, all around 0.62-0.64 accuracy. This suggests that while t-SNE is quantitatively strongest for local consistency, the remaining methods also retain meaningful structure within their embeddings.

From a visual standpoint, however, UMAP and Isomap produce clearer and more interpretable spatial organization of samples, balancing cluster separation with smooth transitions across subtypes. These methods therefore offer the most informative and visually interpretable low-dimensional summaries of the BRCA expression data.