Modify Projection to Random Gaussian #45

issararab · 2024-07-24T18:49:31Z

No description provided.

bittremieux

Looks good. Mainly a few questions to briefly discuss and minor typos before merging.

src/ann_solo/config.py

bittremieux · 2024-07-26T14:41:11Z

src/ann_solo/spectrum.py

@@ -214,6 +215,57 @@ def spectrum_to_vector(spectrum: MsmsSpectrum, min_mz: float, max_mz: float,
    return vector


+def spectrum_to_vector(spectrum: MsmsSpectrum, transformation: ss.csr_matrix,


question: I assume that the sparse vectors need to be converted to dense vectors to be compatible with the Faiss index? Is there a benefit to using SparseRandomProjection over GaussianRandomProjection?

That is correct. Both use random projections, but each has its own advantages.
SparseRandomProjection is more efficient in calculation and requires less memory, so very ideal to very large vectors.
GaussianRandomProjection is not sparse and main advantage as known in the community is its ability to maintain pairwise distance between data points, after transformation. And I think that's what we want to aim for. Let's use a matrix using GaussianRandomProjection for transformation of spectra to low-dim vectors.

The Scikit-Learn documentation says this:

Sparse random matrices are an alternative to dense Gaussian random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.

Neither this statement nor that the Gaussian random projections should be better at conserving the pairwise distance is immediately obvious to me. Let's evaluate both for our specific context, then we can make an informed decision.

src/ann_solo/spectrum.py

bittremieux · 2024-07-26T14:42:36Z

src/ann_solo/spectrum.py

+    Returns
+    -------
+    np.ndarray
+        The low-dimensional transformed spectrum vector with unit length.


typo: Unit length is only true if norm=True.

I think that's what you had in the old version of the spectrum_to_vector docstring :).
It is obvious, from the docstring, the parameters of the function, and the code that you get a unit length vector if the norm parameter is True.

We can modify it with sthg else if you like.

Yes, let's just remove "with unit length" to make the documentation a bit more correct.

bittremieux · 2024-07-26T14:43:53Z

src/ann_solo/spectrum.py

+    spectrum = spectrum.set_mz_range(min_mz, max_mz)
+    # Convert a spectrum to a binned sparse vector
+    data = np.array(spectrum.intensity, dtype=np.float32)
+    indices = np.array(


praise: Nice way to avoid converting it to a dense vector.

bittremieux · 2024-07-26T14:44:37Z

src/ann_solo/spectrum.py

+    indices = np.array(
+        [math.floor((mz - min_mz) / bin_size) for mz in spectrum.mz],
+        dtype=np.int32)
+    indptr = np.array([0, len(spectrum.mz)], dtype=np.int32)


todo: I think you can use np.arange instead.

bittremieux · 2024-07-26T14:47:06Z

src/ann_solo/spectrum.py

+        (data, indices, indptr), (1, dim), np.float32, False)
+
+    # Transform
+    transformed_vector = (sparse_vector @ transformation).toarray()


comment: This is pretty cool, I've probably never used this operator myself in code yet. 🙂 Is this matrix multiplication preferable over using transform()?

They are all similar, all vectorized alternatives.
We generate a random guassian matrix and transposed it, so we can use the @ operator, np.dot() function, or pass the fitted model instead and use transform() . I choose the first option :)

But I think that transform() adds some safety checks, so maybe that's slightly preferable.

bittremieux · 2024-07-26T14:49:12Z

src/ann_solo/spectrum.py

+    # Transform
+    transformed_vector = (sparse_vector @ transformation).toarray()
+    if norm:
+        transformed_vector /= np.linalg.norm(transformed_vector)


comment: Maybe there could be a small performance increase by computing the norm on the sparse vector and only afterwards converting to a dense vector?

I'll modify the transformation to Gaussian projection (given its advantage), and we'll need no further conversion to dense vector after the last dot product.

src/ann_solo/spectrum.py

Modify Projection to Random Gaussion

8958046

issararab linked an issue Jul 24, 2024 that may be closed by this pull request

Spectrum to vector conversion #27

Open

issararab requested a review from bittremieux July 24, 2024 18:51

bittremieux reviewed Jul 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify Projection to Random Gaussian #45

Modify Projection to Random Gaussian #45

issararab commented Jul 24, 2024

bittremieux left a comment

bittremieux Jul 26, 2024

issararab Jul 28, 2024 •

edited

Loading

bittremieux Jul 29, 2024

bittremieux Jul 26, 2024

issararab Jul 28, 2024

bittremieux Jul 29, 2024

bittremieux Jul 26, 2024

bittremieux Jul 26, 2024

bittremieux Jul 26, 2024

issararab Jul 28, 2024

bittremieux Jul 29, 2024

bittremieux Jul 26, 2024

issararab Jul 28, 2024 •

edited

Loading

		@@ -214,6 +215,57 @@ def spectrum_to_vector(spectrum: MsmsSpectrum, min_mz: float, max_mz: float,
		return vector


		def spectrum_to_vector(spectrum: MsmsSpectrum, transformation: ss.csr_matrix,

Modify Projection to Random Gaussian #45

Are you sure you want to change the base?

Modify Projection to Random Gaussian #45

Conversation

issararab commented Jul 24, 2024

bittremieux left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

issararab Jul 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

issararab Jul 28, 2024 • edited Loading

Choose a reason for hiding this comment

issararab Jul 28, 2024 •

edited

Loading

issararab Jul 28, 2024 •

edited

Loading