Transform not giving expected results for similar data #741

jh83775 · 2021-07-28T13:53:50Z

There seems to be an issue where the transform on new data does not give the expected results.
If we fit_transform on some data X, then transform Y = X + e (for some small perturbation e), the transforms are quite different (see picture and reproducing code below).

# Import modules
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import make_blobs
from umap import UMAP

# Create the data set
np.random.seed(314159)
X,y = make_blobs(n_samples=1000,centers=6,random_state=10)

# Y is X plus random small displacement
Y = (X+(2*np.random.random(size=X.shape)-1)/1000)

# Transform with UMAP
projection = UMAP(min_dist=0,random_state=16180339)
U = projection.fit_transform(X)
V = projection.transform(Y)

# Plot results
plt.style.use('fivethirtyeight')
def plot_scatter(ax,A,B,lA,lB,title,xlim=None,ylim=None):
    scatter1 = ax.scatter(*A.T,c=2*y+1,
               cmap = plt.cm.Paired,
               vmin=0,vmax=12,
               marker='o')

    scatter2 = ax.scatter(*B.T,c=2*y,
               cmap = plt.cm.Paired,
               vmin=0,vmax=12,
               alpha=1,
               marker='.')

    elements = np.array([(a,b) for a,b in zip(scatter1.legend_elements()[0],
                                              scatter2.legend_elements()[0])
                        ]).flatten()
    
    legend = ax.legend(elements,[lA+'0',lB+'0',lA+'1',lB+'1',lA+'2',lB+'2',
                                 lA+'3',lB+'3',lA+'4',lB+'4',lA+'5',lB+'5',],
                       title="Labels")
    ax.add_artist(legend)
    ax.set_title(title)
    if xlim:
        ax.set_xlim(xlim)
    if ylim:
        ax.set_ylim(ylim)
    return

fig,axs = plt.subplots(1,2,figsize=(16,8))
axs=axs.flatten()
ax=axs[0]
plot_scatter(ax,X,Y,'X','Y','Original Data')
ax=axs[1]
plot_scatter(ax,U,V,'U','V','Transformed Data',xlim=(-10,28))

plt.show()

We see that the two datasets X and Y, which are approximately the same in the original space, are mapped to different regions by UMAP. This could be a problem for people using UMAP in an ML pipeline (e.g. light purple points are closer to dark red than dark purple, and may be misclassified)

Upon some investigation, I think that this is caused by two problems: One where the initial embedding is off (init_graph_transform); and another where optimize_layout... is called.

In init_graph_transform we have the line

result[row_index, d] += (
                    graph[row_index, col_index]
                    / num_neighbours
                    * embedding[col_index, d]
                )

Which gives an initial embedding for Y which can be different to that of X, even though Y≈X. This is fixed by normalising via the row sum instead of the num_neighbours.

row_sum = np.sum(graph[row_index])
⋮
result[row_index, d] += (
                    graph[row_index, col_index]
                    / row_sum
                    * embedding[col_index, d]
                )

After this change, the initial embeddings for X and Y are practically identical, and this greatly improves the resulting transform:

However, it's still not quite right (you'll notice that the blue points in particular are slightly off). This is due to the optimize_layout... functions, and when their move_other argument is activated. I think that move_other should only be True in the fit_transform(), not in the transform(). The way this is currently coded is a check on the shape of the tail/head embeddings:

move_other = head_embedding.shape[0] == tail_embedding.shape[0]

However, this is inaccurate when new 'to be transformed' data is the same shape as the old fit_transform data. I suggest that move_other should be passed in as an argument depending on where the optimize_layout functions are called from. If we force move_other=False in the transform step, we arrive at the final embedding for the new data, which seems to be accurate (that is to say, it is what one might intuitively expect):

I have a pull request ready with these changes implemented that I'll put in shortly, but do let me know if I've got the wrong end of the stick somewhere.

Many thanks

lmcinnes · 2021-07-28T17:06:22Z

This is an excellent analysis -- and yes, it all looks right. Thanks for taking the time to look into this, and also for providing such a well documented report on the issue. I'll look at the PR soon and hopefully we can get it merged very quickly. Again, many thinks for this!

Potential Fix for Issue #741 (Transform not giving expected results for similar data #741)

jh83775 mentioned this issue Jul 28, 2021

Potential Fix for Issue #741 (Transform not giving expected results for similar data #741) #742

Merged

lmcinnes closed this as completed in #742 Jul 29, 2021

lmcinnes added a commit that referenced this issue Jul 29, 2021

Merge pull request #742 from jh83775/master

dd415cc

Potential Fix for Issue #741 (Transform not giving expected results for similar data #741)

jh83775 mentioned this issue Sep 3, 2021

Shift of UMAP points when using .transform() #755

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform not giving expected results for similar data #741

Transform not giving expected results for similar data #741

jh83775 commented Jul 28, 2021 •

edited

Loading

lmcinnes commented Jul 28, 2021

Transform not giving expected results for similar data #741

Transform not giving expected results for similar data #741

Comments

jh83775 commented Jul 28, 2021 • edited Loading

lmcinnes commented Jul 28, 2021

jh83775 commented Jul 28, 2021 •

edited

Loading