The BRCA
dataset contains gene expression measurements from
445 breast cancer tumor samples and
353 genes obtained from The Cancer Genome Atlas (TCGA).
Each row represents a single tumor sample, while each column corresponds
to a specific gene's expression level. Larger expression values indicate
higher activation of that gene within the tumor.
In addition to the gene expression data, six clinical variables are provided for each patient: Subtype (one of Basal-like, Luminal A, Luminal B, HER2-enriched, or Normal-like), ER-Status (estrogen receptor), PR-Status (progesterone receptor), HER2-Status (human epidermal growth factor receptor 2), Node (number of lymph nodes involved), and Metastasis (indicator of whether the cancer has metastasized).
The heatmap below visualizes the expression levels of all 353 genes across the 445 tumor samples. Each row represents a gene, and each column represents a tumor. The consistent horizontal bands indicate genes that are expressed across nearly all samples, while variations in color intensity highlight differences in gene activation patterns between tumor subtypes.
To explore gene-expression patterns within the BRCA
dataset,
several dimensionality reduction techniques were applied. Each method offers
distinct advantages in uncovering latent tumor structure and subtype separability.
Because this is an unsupervised exploratory task, the goal is to identify meaningful structure
rather than maximize predictive accuracy. All dimensionality reduction methods were therefore
constrained to a two-dimensional embedding (n_components = 2
), which provides a
clear and consistent visual summary of tumor-level variation in gene expression. Higher
dimensions would not add interpretive value since most methods (e.g., t-SNE, UMAP, Isomap)
are not ordered or nested by variance explained.
Hyperparameter tuning was guided by the principle of neighborhood preservation. Specifically, parameter sets were evaluated using the mean Jaccard similarity between each sample's nearest neighbors in the original high-dimensional space and those in the low-dimensional embedding. This metric quantifies how faithfully local relationships are maintained after projection. The parameters yielding the highest Jaccard index were selected for each method.
This approach ensures that all methods are compared on equal footing; two-dimensional embeddings optimized for local structure preservation, allowing direct assessment of which algorithm most effectively reveals the intrinsic organization of breast cancer gene expression data.
The visualization below shows the latent 2D representations of the BRCA gene expression data across all dimensionality reduction methods. Each embedding reflects the model performance after hyperparameter tuning using an unsupervised methodology (Jaccard neighborhood preservation).
For this section we interpret the 2D patterns from the tuned models using the clinical data labels. I have plotted the tuned latent 2D rerpresentations across all clinical labels to interpret the discovered patterns.
Across all methods, Basal-like tumors form a well defined and separable cluster, reflecting their strong distinction from the other subtypes. Linear methods such as PCA struggle to capture nonlinear subtype boundaries, and in particular, HER2 enriched cases overlap with Luminal clusters, suggesting that linear components do not fully explain the variance associated with HER2 amplification. In contrast, nonlinear embeddings such as UMAP and t-SNE produce a more coherent HER2 enriched grouping, indicating that preserving local manifold structure is important for capturing this subtype. Neither linear nor nonlinear approaches clearly separate Luminal A from Luminal B, which reflects their biological similarity and limited expression contrast. Further work may be needed to disentangle these two hormonally driven phenotypes, potentially through feature weighting or latent representations in higher dimensional space.
Across all embeddings, nonlinear methods such as UMAP and t-SNE capture clear separation between HER2 positive and HER2 negative samples, while linear projections like PCA fail to identify any distinct partitioning. This pattern reaffirms the observations seen in the HER2 enriched subtype, indicating that nonlinear manifold learning better reflects the biological variance associated with HER2 amplification. The clear clustering in nonlinear embeddings supports the idea that HER2 expression introduces complex, nonlinear relationships in the transcriptomic space that are not adequately represented in linear subspace models.
For both Node and Metastasis features, the patterns across all embedding methods appear largely uninformative, with no consistent clustering or separability observable between classes. Both linear and nonlinear methods fail to uncover structure related to nodal involvement or metastatic spread. The Node variable (0-3) remains diffusely distributed in each embedding space, suggesting it does not align with the main sources of transcriptional variance. The Metastasis feature is dominated by class imbalance, with only a small number of positive samples, making any grouping statistically unreliable. Together, these results indicate that neither nodal status nor metastasis is strongly reflected in the unsupervised latent spaces, implying they are driven more by downstream clinical and spatial progression rather than molecular expression patterns captured by these methods.
Because the observed subtype patterns were clearly nonlinear, linear PCA was excluded from the quantitative comparison. I focused instead on four nonlinear embedding methods: t-SNE, UMAP, Isomap, and Spectral Embedding. Each was evaluated using a 5-Nearest-Neighbors (KNN) classifier trained on 80% of the samples and tested on the remaining 20%. The accuracy reflects how well local subtype neighborhoods are preserved after dimensionality reduction.
As expected, t-SNE achieved the highest KNN accuracy (approx 0.71), consistent with its design objective of preserving local neighborhoods in the low-dimensional space. The other nonlinear methods: UMAP, Isomap, and Spectral Embedding, performed similarly, all around 0.62-0.64 accuracy. This suggests that while t-SNE is quantitatively strongest for local consistency, the remaining methods also retain meaningful structure within their embeddings.
From a visual standpoint, however, UMAP and Isomap
produce clearer and more interpretable spatial organization of samples,
balancing cluster separation with smooth transitions across subtypes.
These methods therefore offer the most informative and visually interpretable
low-dimensional summaries of the BRCA
expression data.