Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Projection to Random Gaussian #45

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/ann_solo/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,10 @@ def __init__(self) -> None:
self._parser.add_argument(
'--bin_size', default=0.04, type=float,
help='ANN vector bin width (default: %(default)s Da)')
# ANN vector length after gaussian random projection.
self._parser.add_argument(
bittremieux marked this conversation as resolved.
Show resolved Hide resolved
'--low_dim', default=400, type=int,
help='ANN vector length (default: %(default)s)')
# ANN vector length after hashing.
self._parser.add_argument(
'--hash_len', default=800, type=int,
Expand Down
33 changes: 26 additions & 7 deletions src/ann_solo/spectral_library.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,14 @@
import numexpr as ne
import numpy as np
import tqdm
from sklearn.random_projection import SparseRandomProjection
from spectrum_utils.spectrum import MsmsSpectrum

from ann_solo import reader
from ann_solo import spectrum_match
from ann_solo import utils
from ann_solo.config import config
from ann_solo.spectrum import get_dim
from ann_solo.spectrum import process_spectrum
from ann_solo.spectrum import spectrum_to_vector
from ann_solo.spectrum import SpectrumSpectrumMatch
Expand Down Expand Up @@ -115,6 +117,12 @@ def __init__(self, filename: str) -> None:
if create_ann_charges:
self._create_ann_indexes(create_ann_charges)

# Gaussian vector projection
_vec_len, _, _ = get_dim(config.min_mz, config.max_mz, config.bin_size)
self._transformation = (
SparseRandomProjection(config.low_dim, random_state=0).fit(
np.zeros((1, _vec_len))).components_.astype(np.float32).T)

def _get_hyperparameter_hash(self) -> str:
"""
Get a unique string representation of the hyperparameters used to
Expand Down Expand Up @@ -155,10 +163,15 @@ def _create_ann_indexes(self, charges: List[int]) -> None:
smoothing=0.1):
charge = lib_spectrum.precursor_charge
if charge in charge_vectors.keys():
spectrum_to_vector(process_spectrum(lib_spectrum, True),
config.min_mz, config.max_mz,
config.bin_size, config.hash_len, True,
charge_vectors[charge][i[charge]])
charge_vectors[charge][i[charge]] = spectrum_to_vector(
process_spectrum(lib_spectrum, True),
self._transformation,
config.min_mz,
config.max_mz,
config.bin_size,
config.low_dim,
norm=True,
)
i[charge] += 1
# Build an individual FAISS index per charge.
logging.info('Build the spectral library ANN indexes')
Expand Down Expand Up @@ -435,9 +448,15 @@ def _get_library_candidates(self, query_spectra: List[MsmsSpectrum],
query_vectors = np.zeros((len(query_spectra), config.hash_len),
np.float32)
for i, query_spectrum in enumerate(query_spectra):
spectrum_to_vector(
query_spectrum, config.min_mz, config.max_mz,
config.bin_size, config.hash_len, True, query_vectors[i])
query_vectors[i] = spectrum_to_vector(
query_spectrum,
self._transformation,
config.min_mz,
config.max_mz,
config.bin_size,
config.low_dim,
norm=True,
)
mask = np.zeros_like(candidate_filters)
# noinspection PyArgumentList
for mask_i, ann_filter in zip(mask, ann_index.search(
Expand Down
54 changes: 53 additions & 1 deletion src/ann_solo/spectrum.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import mmh3
import numba as nb
import numpy as np
import scipy.sparse as ss
from spectrum_utils.spectrum import MsmsSpectrum

from ann_solo.config import config
Expand Down Expand Up @@ -163,7 +164,7 @@ def hash_idx(bin_idx: int, hash_len: int) -> int:
return mmh3.hash(str(bin_idx), 42, signed=False) % hash_len


def spectrum_to_vector(spectrum: MsmsSpectrum, min_mz: float, max_mz: float,
def _spectrum_to_vector(spectrum: MsmsSpectrum, min_mz: float, max_mz: float,
bin_size: float, hash_len: int, norm: bool = True,
vector: np.ndarray = None) -> np.ndarray:
"""
Expand Down Expand Up @@ -214,6 +215,57 @@ def spectrum_to_vector(spectrum: MsmsSpectrum, min_mz: float, max_mz: float,
return vector


def spectrum_to_vector(spectrum: MsmsSpectrum, transformation: ss.csr_matrix,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I assume that the sparse vectors need to be converted to dense vectors to be compatible with the Faiss index? Is there a benefit to using SparseRandomProjection over GaussianRandomProjection?

Copy link
Collaborator Author

@issararab issararab Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Both use random projections, but each has its own advantages.
SparseRandomProjection is more efficient in calculation and requires less memory, so very ideal to very large vectors.
GaussianRandomProjection is not sparse and main advantage as known in the community is its ability to maintain pairwise distance between data points, after transformation. And I think that's what we want to aim for. Let's use a matrix using GaussianRandomProjection for transformation of spectra to low-dim vectors.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scikit-Learn documentation says this:

Sparse random matrices are an alternative to dense Gaussian random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.

Neither this statement nor that the Gaussian random projections should be better at conserving the pairwise distance is immediately obvious to me. Let's evaluate both for our specific context, then we can make an informed decision.

min_mz: float, max_mz: float, bin_size: float, dim: int,
norm: bool) -> np.ndarray:
"""
Convert a single spectrum to a dense NumPy vector.

Peaks are first discretized to mass bins of width `bin_size` starting from
`min_mz`, after which they are transformed using sparse random projections.

Parameters
----------
spectrum : MsmsSpectrum
The spectrum to be converted to a vector.
transformation : ss.csr_matrix
Sparse random projection transformation to convert sparse spectrum
vectors to low-dimensional dense vectors.
min_mz : float
The minimum m/z to include in the vector.
max_mz : float
The maximum m/z to include in the vector.
bin_size : float
The bin size in m/z used to divide the m/z range.
dim : int
The high-resolution vector dimensionality.
bittremieux marked this conversation as resolved.
Show resolved Hide resolved
norm : bool
Normalize the vector to unit length or not.
Returns
-------
np.ndarray
The low-dimensional transformed spectrum vector with unit length.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Unit length is only true if norm=True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's what you had in the old version of the spectrum_to_vector docstring :).
It is obvious, from the docstring, the parameters of the function, and the code that you get a unit length vector if the norm parameter is True.

We can modify it with sthg else if you like.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's just remove "with unit length" to make the documentation a bit more correct.

"""
# set the spectrum range between min and max mz
spectrum = spectrum.set_mz_range(min_mz, max_mz)
# Convert a spectrum to a binned sparse vector
data = np.array(spectrum.intensity, dtype=np.float32)
indices = np.array(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Nice way to avoid converting it to a dense vector.

[math.floor((mz - min_mz) / bin_size) for mz in spectrum.mz],
dtype=np.int32)
indptr = np.array([0, len(spectrum.mz)], dtype=np.int32)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: I think you can use np.arange instead.


# Instantiate the sparse matrix
sparse_vector = ss.csr_matrix(
(data, indices, indptr), (1, dim), np.float32, False)

# Transform
transformed_vector = (sparse_vector @ transformation).toarray()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: This is pretty cool, I've probably never used this operator myself in code yet. 🙂 Is this matrix multiplication preferable over using transform()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all similar, all vectorized alternatives.
We generate a random guassian matrix and transposed it, so we can use the @ operator, np.dot() function, or pass the fitted model instead and use transform() . I choose the first option :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think that transform() adds some safety checks, so maybe that's slightly preferable.

bittremieux marked this conversation as resolved.
Show resolved Hide resolved
if norm:
transformed_vector /= np.linalg.norm(transformed_vector)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment: Maybe there could be a small performance increase by computing the norm on the sparse vector and only afterwards converting to a dense vector?

Copy link
Collaborator Author

@issararab issararab Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll modify the transformation to Gaussian projection (given its advantage), and we'll need no further conversion to dense vector after the last dot product.


return transformed_vector.ravel()

class SpectrumSpectrumMatch:

def __init__(
Expand Down