Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The neighborhood graph is not connected #53

Closed
Julien-Devillars opened this issue May 22, 2020 · 6 comments
Closed

The neighborhood graph is not connected #53

Julien-Devillars opened this issue May 22, 2020 · 6 comments
Assignees
Labels

Comments

@Julien-Devillars
Copy link

Hi there, I am currently using the tapkee executable as a pipeline in my project in python, I am creating a text file with all my values to reduce wich looks like this :

0.301961,0.376471,0.443137,0,0.992157,0 0.188235,0.235294,0.298039,0.847059,0.752941,0 0.192157,0.239216,0.294118,0.854902,0.72549,0 0.266667,0.282353,0.380392,0.901961,0.505882,0.8 0.215686,0.211765,0.215686,0.913725,0.498039,0.8 0.25098,0.262745,0.286275,0.917647,0.494118,0.8 0.309804,0.305882,0.333333,0.905882,0.419608,0.8 0.258824,0.321569,0.396078,0.831373,0.811765,0 0.239216,0.298039,0.376471,0.831373,0.807843,0 0.231373,0.290196,0.368627,0.819608,0.741176,0 0.290196,0.360784,0.443137,0.784314,0.780392,0 0.301961,0.376471,0.443137,0,0.992157,0 0.188235,0.235294,0.298039,0.847059,0.752941,0 0.192157,0.239216,0.294118,0.854902,0.72549,0 0.266667,0.282353,0.380392,0.901961,0.505882,0.8 0.215686,0.211765,0.215686,0.913725,0.498039,0.8 0.25098,0.262745,0.286275,0.917647,0.494118,0.8 0.309804,0.305882,0.333333,0.905882,0.419608,0.8 0.258824,0.321569,0.396078,0.831373,0.811765,0

This example works fine but if I add one more line like

0.1,0.2,0.3,0.4,0.5,0

I get the message "[warning] The neighborhood graph is not connected"

image

So I check for solution and the only help I could find is this message that you wrote in your documentation.

Please note that "[warning] The neighborhood graph is not connected" message in most cases means
that ’tapkee’ run was unsuccessful. As a result, Tapkee() might return the matrix of NaN’s. One of
possible workarounds is to specify the higher number of neigbors (’-k’ option, default is 10). See
below for the example.

Then i tried to change the k values and it works when it's equal or superior to the number of rows divised by 2. So if I have 30 rows I have to make the neighbour values to 15 or higher.

Is it right or something else is wrong ? I am supposed to use this method with a lot of rows (4000 to
million), it gonna be long no ? I may not understand perfectly how it works.

Thanks for your help.

@Julien-Devillars Julien-Devillars changed the title Isomap - The neighborhood graph is not connected The neighborhood graph is not connected May 22, 2020
@lisitsyn
Copy link
Owner

Hey! Sorry it took a while to respond you.

The problem is that the neighborhood strategy being used in Tapkee is dead simple: it finds the nearest neighbors for each point and then uses these connections exclusively. This means that if you add a point that is not a neighbor for any other point it gets 'dropped' out of graph (never reachable).

To solve the problem it might make sense to allow increasing the number of connections if some points lack them to be connected. Other mitigation might be to notify what are the disconnected points so that you can drop them from the dataset. I am not aware what mitigation might be the best, though.

@iglesias iglesias self-assigned this Apr 8, 2024
@iglesias
Copy link
Collaborator

iglesias commented Apr 8, 2024

Hi,

Only now I have gone through this. Just in case, excuses for the years-long lead time.

I am wondering what is the meaning of that a point is not a neighbor of any other point?
Even if a point is very, very far away from all the others, at least one of the others must be the closest to it. Unless there's some sort of maximum distance between neighbors, I am guessing.

I will start running the investigation together with one for ** On entry to DLASCL parameter number 4 had an illegal value I've been encountering and keep you posted.

@iglesias
Copy link
Collaborator

iglesias commented Apr 8, 2024

I am wondering what is the meaning of that a point is not a neighbor of any other point?

All right, I think this little inquietude doesn't really matter: even if every point has at least one neighbor the overall neighborhood graph could of course have more than one connected component

@lisitsyn
Copy link
Owner

lisitsyn commented Apr 8, 2024

It is just a matter of a form of the distance matrix. Once it becomes a block matrix the whole thing stops working because eigendecomposition is not unique, I think.

Basically, every connected component should be embedded separately. But it doesn't make a lot of sense since we would not have any idea how they are positioned against each other. So that's why I think it is better to either:

  • Drop the smallest components
  • Try to fix the graph so that it is connected
  • Drop some outliers, maybe
  • Throw an exception

@iglesias
Copy link
Collaborator

iglesias commented Apr 8, 2024

Basically, every connected component should be embedded separately. But it doesn't make a lot of sense since we would not have any idea how they are positioned against each other.

Ah, that's a good point, I didn't think about the separate embeddings.

So that's why I think it is better to either:

  • Drop the smallest components
  • Try to fix the graph so that it is connected
  • Drop some outliers, maybe
  • Throw an exception

This is what I thought earlier, it would the in the second point:

The issue is that given a number of neighbors (k) k_0 by the user (or the default one, if unprovided) the neighbor graph can be, let's say, unconnected and it is expected the result won't be trustworthy then.
If we would use a k equal to the number of points (or minus 1), then it would be surely connected, right?
From there, I think something that can be automated (perhaps as an optional routine enabled by default) is to rerun the neighbors binary-searching between k_0 and the number of points until one that produces a connected graph is found. It could of course be refined looking for the minimum k that makes it connected.

iglesias added a commit to iglesias/tapkee that referenced this issue Apr 8, 2024
@iglesias
Copy link
Collaborator

iglesias commented Apr 8, 2024

I opened #83 with a simple fix to recompute the neighbors if the neighborhood graph is not connected.

By the way, I didn't manage to reproduce with the data in the opening comment. The number of vectors I got was either 19 or 20 (with the extra one), different to the 21 in the screenshot. I used the swissroll5000 from the jmlr repo to test this. In that one the binary search went from the initial 10 to 2505, which worked well for the graph, but then stopped quickly due to out of memory usage in the eigendecomposition. Then, I updated it to just multiplying by 2.

@iglesias iglesias closed this as completed May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants