Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Projection to Random Gaussian #45

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

issararab
Copy link
Collaborator

No description provided.

@issararab issararab linked an issue Jul 24, 2024 that may be closed by this pull request
Copy link
Owner

@bittremieux bittremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Mainly a few questions to briefly discuss and minor typos before merging.

src/ann_solo/config.py Show resolved Hide resolved
@@ -214,6 +215,57 @@ def spectrum_to_vector(spectrum: MsmsSpectrum, min_mz: float, max_mz: float,
return vector


def spectrum_to_vector(spectrum: MsmsSpectrum, transformation: ss.csr_matrix,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I assume that the sparse vectors need to be converted to dense vectors to be compatible with the Faiss index? Is there a benefit to using SparseRandomProjection over GaussianRandomProjection?

Copy link
Collaborator Author

@issararab issararab Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Both use random projections, but each has its own advantages.
SparseRandomProjection is more efficient in calculation and requires less memory, so very ideal to very large vectors.
GaussianRandomProjection is not sparse and main advantage as known in the community is its ability to maintain pairwise distance between data points, after transformation. And I think that's what we want to aim for. Let's use a matrix using GaussianRandomProjection for transformation of spectra to low-dim vectors.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scikit-Learn documentation says this:

Sparse random matrices are an alternative to dense Gaussian random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.

Neither this statement nor that the Gaussian random projections should be better at conserving the pairwise distance is immediately obvious to me. Let's evaluate both for our specific context, then we can make an informed decision.

src/ann_solo/spectrum.py Show resolved Hide resolved
Returns
-------
np.ndarray
The low-dimensional transformed spectrum vector with unit length.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Unit length is only true if norm=True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what you had in the old version of the spectrum_to_vector docstring :).
It is obvious, from the docstring, the parameters of the function, and the code that you get a unit length vector if the norm parameter is True.

We can modify it with sthg else if you like.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's just remove "with unit length" to make the documentation a bit more correct.

spectrum = spectrum.set_mz_range(min_mz, max_mz)
# Convert a spectrum to a binned sparse vector
data = np.array(spectrum.intensity, dtype=np.float32)
indices = np.array(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Nice way to avoid converting it to a dense vector.

indices = np.array(
[math.floor((mz - min_mz) / bin_size) for mz in spectrum.mz],
dtype=np.int32)
indptr = np.array([0, len(spectrum.mz)], dtype=np.int32)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I think you can use np.arange instead.

(data, indices, indptr), (1, dim), np.float32, False)

# Transform
transformed_vector = (sparse_vector @ transformation).toarray()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: This is pretty cool, I've probably never used this operator myself in code yet. 🙂 Is this matrix multiplication preferable over using transform()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all similar, all vectorized alternatives.
We generate a random guassian matrix and transposed it, so we can use the @ operator, np.dot() function, or pass the fitted model instead and use transform() . I choose the first option :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think that transform() adds some safety checks, so maybe that's slightly preferable.

# Transform
transformed_vector = (sparse_vector @ transformation).toarray()
if norm:
transformed_vector /= np.linalg.norm(transformed_vector)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: Maybe there could be a small performance increase by computing the norm on the sparse vector and only afterwards converting to a dense vector?

Copy link
Collaborator Author

@issararab issararab Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll modify the transformation to Gaussian projection (given its advantage), and we'll need no further conversion to dense vector after the last dot product.

src/ann_solo/spectrum.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spectrum to vector conversion
2 participants