GitHunt
JA

Preserving Clusters and Correlations: Dimensionality reduction with exceptional high global structure preservation

Dimensionality reduction with very high global structure preservation

License: MIT
Build Status

PCC

PrePrint: https://arxiv.org/abs/2503.07609
Authors: Jacob Gildenblat, Jens Pahnke

pip install pccdr

from pcc import PCUMAP
pcumap_embedding = PCUMAP(device='cuda').fit_transform(X)

⭐ This is a python package for dimensionality reduction (DR) with high global structure preservation.

⭐ That means that unlike in popular DR methods like UMAP, the distances between transformed points - will actually mean something.

⭐ Implementes the Landmark Mantel Correlation (LMC) Objective that maximizes the correlation of distances of all points from landmarks in the high and low dimensional data.

⭐ Use UMAP + LMC Objective for simply enhancing the widely used UMAP method with global structure preservation.

⭐ Implementes the Multi resolution Clustering Supervistion (MiCS) objective that creates a dimensionality reduction where multiple clusters are predictable.

⭐ Use MiCS + LMC objective for extremely high global structure preservation, and competitive local structure.

A few visual examples

Mammoth

Image Description
Fashion MNIST An example on the Fashion-Mnist dataset
MSI An application on Mass Spectometry Imaging
Macosko single cell dataset An application illustarting the global structure preservation on the Macosko single cell dataset compared to UMAP

PCC is built on the idea of sampling reference points, meausring distances of all data points from the reference points, and maximizing the correlations of these distances in the high dimensional data, and the transformed low dimensional data.

Usage examples

See examples/macosko.ipynb for more detailed explanation and usage examples.

There are two modes:

Plugging Landmark Mantel Correlation UMAP, for getting a meaningful transformation where distances between points mean something

Here we use the excellent recent TorchDR library, and add plug in our objective into UMAP.

from pcc import PCUMAP
pcumap_embedding = PCUMAP(device='cuda', n_components=2).fit_transform(X)

MiCS + LMC as a standalone DR method with a multi task objective

This optimizes a local structure preservation multi task objective that tries to predict which clusters points belong to,
as well as global structure preservation loss that maximizes corerlations between distances of all points to sampled reference points.

First, lets cluster the points with different clustering models:

np.random.seed(0)

clusters = []
n_clusters_list = [4, 8, 16, 32, 64]
for n_clusters in n_clusters_list:
    kmeans = KMeans(n_clusters=n_clusters, random_state=0, n_init="auto")
    cluster_labels = kmeans.fit_predict(X)
    clusters.append(cluster_labels)

Then we can call PCC:

pcc_reducer = PCC(n_components=2, num_epochs=2000, num_points=1000, pearson=True, 
                  spearman=False, beta=5, k_epoch=2)
pcc_embedding = pcc_reducer.fit_transform(X, clusters)
jacobgil/pcc | GitHunt