Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] v0.11.0 RC #132

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open

Conversation

jspaezp
Copy link
Collaborator

@jspaezp jspaezp commented Dec 5, 2024

What does this pr do:

  1. adds back the flashlfq support.
  2. fixes the python API to work with the docs.
  3. centralizes linting/formatting expectations to a makefile.
  4. migrates dependency management to uv.
  5. Adds PR template.
    Addresses:

Notes:

Most of the lines added are just this file: data/phospho_rep1.traditional.pin
Which is a backport of the original testing file (phospho_rep1.pin which was ""fixed"" in another PR by removing the ragged aspect of the protein column PROT1\tPROT2 -> PROT1:PROT2)

Blockers:

Unhandled things:

  1. No idea why windows is failing tests.
  2. Update docs/docstrings/vignettes

jspaezp and others added 12 commits September 6, 2024 16:48
* ✨ cherry picks internal fixes from !68 and !70

* Cherry pick feature/confidence_streaming branch

* ✨ adds filelock dependency for tests

* 💄 linting

* 💄 reformat to satisfy linter
k

* ✨ imports type annotations from future for python 3.9

* ✨ make pytest and cli behave with type annotations in Python 3.9

* ✨ test dropping Python 3.9 support

- inspired by
  https://github.com/wfondrie/mokapot/pull/126/files#diff-1db27d93186e46d3b441ece35801b244db8ee144ff1405ca27a163bfe878957fL20

* Set scale_to_one to false in *all* cases

* Fixed path problems probably causing errors under windows

* Fix more possible path issues

* Fix warning about bitwise not in python 3.12

* Fix problem with numpy 2.x's different str rep of floats

* Make hashing of rows for splitting independent of numpy version and spectra columns

* Feature/streaming fix windows (wfondrie#48)

* ✨ log more infos
* ✨ uses uv for env setup; fix dependencies

---------

Co-authored-by: Elmar Zander <elmar.zander@googlemail.com>
Fixed retention time division by 60.
Time is required in minutes for FlashLFQ, it's already in minutues

Co-authored-by: William Fondrie <fondriew@gmail.com>
)

CSV_SUFFIXES = [".csv", ".pin", ".tab", ".csv"]
CSV_SUFFIXES = [
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record... I still dislike naming so many tab-delimited file formats as "comma separated values (csv)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolute agree. I just don't see a better way, as those other extensions are already out their in the wild.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall any tool that generates a tab delimited .csv of the top of my head. Do you happen to have an example? (I wont deal with it in this PR but in the future we could split csv-tsv formats internally)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right. I somehow misread your initial comment. Yes, since we really never have "comma-separated" values anywhere, why not get completely rid of it and replace "comma separated/CSV" with "tab separated/TSV" everywhere.

For the record: when I started on this code base, it was something with "comma separated" everywhere, but a separator variable sep was passed around, which was always set to "\t". I got rid of all the explicit file reading/writing stuff and moved that into the readers/writers, set the separator (I think) unconditionally to "\t", but did not rename the variables/classes. So: my bad ;)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I think adding support for .csv would be a good idea in the future (comma separated file)

@jspaezp
Copy link
Collaborator Author

jspaezp commented Dec 5, 2024

Edit: a7401c3 does some progress, figured out the confidence but still need to "pipe" some columns needed by flashlfq, since _optional_columns was removed as an attribute from the confidence object

@gessulat and @ezander

I might need help with this one to understand how to update the documentation.

Right now if I try to do this (part of tests/unit_tests/test_writer_flashlfq.py):

# Using the psms_ondisk fixture from your tests ...
def test_sanity(psms_ondisk, tmp_path):
    """Run simple sanity checks"""

    mods, scores = mokapot.brew([psms_ondisk])
    conf = mokapot.assign_confidence(
        [psms_ondisk],
        scores_list=scores,
        eval_fdr=0.05,
        deduplication=False,  # RN fails with deduplication = True with an error saying that the column "ExpMass" does not exist 
    )

# When set to dedup=False it fails with error ` KeyError: 'proteinIds'`

so .... where are these columns specified? how can one assign confidence without proteins?

https://github.com/jspaezp/mokapot/blob/08d73afec23a072642f37ba510bc6d2a7d3577db/mokapot/confidence.py#L380-L388

https://github.com/jspaezp/mokapot/blob/08d73afec23a072642f37ba510bc6d2a7d3577db/tests/unit_tests/test_writer_flashlfq.py#L8-L19

@jspaezp
Copy link
Collaborator Author

jspaezp commented Dec 7, 2024

Note:

There seems to be a difference on what 'OnDiskDataset' and 'LinearPsmDataset' mean by spectra:

on disk psm is all of these:

  ... labels = find_required_column("label", columns)

  # Optional columns
    filename = find_optional_column(filename_column, columns, "filename")
    calcmass = find_optional_column(calcmass_column, columns, "calcmass")
    expmass = find_optional_column(expmass_column, columns, "expmass")
    ret_time = find_optional_column(rt_column, columns, "ret_time")
    charge = find_optional_column(charge_column, columns, "charge_column")
    spectra = [c for c in [filename, scan, ret_time, expmass] if c is not None]

https://github.com/jspaezp/mokapot/blob/73a0e14df017dcb0d8ba5c2ed2cfa2d17d581eab/mokapot/parsers/pin.py#L223-L232

and the linear psm defines it as

spectrum_columns : str or tuple of str
        The column(s) that collectively identify unique mass spectra. Multiple
        columns can be useful to avoid combining scans from multiple mass
        spectrometry runs.

https://github.com/jspaezp/mokapot/blob/73a0e14df017dcb0d8ba5c2ed2cfa2d17d581eab/mokapot/dataset.py#L255-L260

which would seem more like the OnDisk ... of specId_column (the linear psm uses as an index the compound index made from the columns defined by 'spectrum_columns' whilst the on disk dataset assumes there is a single column that can be used as a primary index).

@ezander
Copy link
Contributor

ezander commented Dec 8, 2024

@jspaezp I think there have been some changes in the meaning in some of the code entities (variables, functions etc.) that are not reflected in the naming and also quite some divergence between code and documentation. Maybe it makes sense to discuss these issues on teams or skype next week?

@jspaezp jspaezp force-pushed the feature/auto_pin_handling2 branch from c487e00 to 7109058 Compare December 8, 2024 14:12
@jspaezp
Copy link
Collaborator Author

jspaezp commented Dec 8, 2024

@ezander 100%, Emailed @gessulat to get a meeting scheduled.

@jspaezp jspaezp force-pushed the feature/auto_pin_handling2 branch from 7109058 to 9841733 Compare December 8, 2024 14:29
proteins = conf._protein_column
else:
proteins = None
# TODO: make this work again ...
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self:

# todo: nice to have: move this column renaming stuff into the
# column defs module, and further, have standardized columns
# directly from the pin reader (applying the renaming itself)

level_column_names = [
"PSMId",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question (@ezander and @gessulat) Is it a dealbreaker for you to have this column hard-coded? I believe that the current behavior in the published version of mokapot is to preserve all the "spectrum columns" (which usually will be psm id + file name + rank ... all columns that uniquely identify a PSM). Let me know if having a single "PSMId" column is a requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jspaezp
I looked again at the current state of column handling in mokapot, which is a little bit confusing. This may not answer your question directly, but may give a basis for moving forward.

Input columns are determined by parsers/pin.py

  • All are case-insensitive
  • Required
    • specid (becomes later PSMId, not identical to the "spectrum")
    • peptide
    • proteins
    • label (target column)
    • scannr
  • Optional
    • rollup columns: modifiedpeptides, precursor, peptidegroup
    • filename
    • calcmass
    • expmass
    • ret_time
    • charge
  • Spectra is [filename, scannr, ret_time, expmass]. Since all except scannr are optional, can
    span between 1 and 4 columns.
    Is used for two things (IIRC): in model training for competition (only those with same "spectrum" compete) and later
    in the deduplication process in confidence writing.
  • From this an OnDiskPsmDataset is created and the names of the inferred columns are set
  • IMHO the dataset should rather get a reader, and the plain CSVFileReader should be wrapped in a ColumnMappedReader so
    that in the code, only hard-coded names (or names defined as constants) can be used (except maybe for the feature
    columns, but there could also be some mapping scheme e.g. from "foo" -> "feature_001", "bar" -> "feature_002", ...)
  • Also: any conversion necessary for the input file should be done here (e.g. convert label from int to bool)

The level_columns (yeah, the naming...) are the columns of the intermediate files, that get written for each rollup
level before confidence assignment is made.

  • During writing of the "level files" deduplication is done on the columns as given by the "spectra columns" i.e. some
    subset of [filename, scannr, ret_time, expmass]
  • Columns are renamed from the level_input_column_names to the level_column_names. Most importantly the "specId"
    column is renamed to "PSMId".
    (It's not quite clear why the "target_column" is not renamed, probably the name taken from the input file is usually
    correct by some coincidence). This whole renaming business (as mentions should IMO move to the input file reader as
    mentioned earlier)

output_column_names are the names of the columns after confidence estimation on the "(rollup) level files".

  • The names are hard-coded except for the extra_output_columns, which (as of now) only contain the optional rollup
    levels. One more point for defining those as string constants somewhere.
  • Additionally there are the score, q-value, and posterior_error_prob columns. The - is problematic when writing
    to SQL (should be q_value, but that clashed with other code momentarily)
  • The names of the columns could be inferred from reader columns when writing the output files. However, due to the
    possibility of multiple input files going into one output file (and the inherited code structure) it was necessary
    to write the column header before they could be inferred from a reader (but this could be changed, of course).

Note: it is still somewhat messy, but can be cleared up. The special treatment of proteins (which, according to some,
shouldn't be mokapot's business anyway) makes that somewhat harder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns are renamed from the level_input_column_names to the level_column_names. Most importantly the "specId"
column is renamed to "PSMId".

This assumes there is a single column that uniquely identifies the spectrum. Since we are not writing say ... the file name, there is the clear assumption that the information would be redundant with the PSMId.

I believe that the point I want to make is... SINCE we are requesting and supporting multiple columns denoting uniqueness of the spectrum why are we not writing them? IMHO If the user wants a single column called PSMId, they should pass it as an input. (input columns should be the same as output columns)

IMHO the dataset should rather get a reader, and the plain CSVFileReader should be wrapped in a ColumnMappedReader so
that in the code, only hard-coded names (or names defined as constants) can be used (except maybe for the feature
columns, but there could also be some mapping scheme e.g. from "foo" -> "feature_001", "bar" -> "feature_002", ...)

I don't think it matters, the interface should be defined in the ABC, not in the OnDiskPSMDataset

Additionally there are the score, q-value, and posterior_error_prob columns. The - is problematic when writing

Cool, I will make sure columns are sql-friendly.

The names of the columns could be inferred from reader columns when writing the output files. However, due to the
possibility of multiple input files going into one output file (and the inherited code structure) it was necessary
to write the column header before they could be inferred from a reader (but this could be changed, of course).

But we check when multiple datasets are passed that all of them have the same columns ... (I dont think we should support multiple input files where each can have different column names)

My proposal:

Target/Label_column: [str]
	Column that marks whether an element is a target/decoy.
	Bool? (I think it has to be a bool RN)

Spectrum_columns: list[str]
	The series of columns that denote which PSMs are competing with each other. For example: 
        1. We can have both a target and a decoy coming from the same scan number + file.
        2. We can have multiple peptides suggested as targets for each scan.
	We should also warn if a float column is passed here, since testing equality in floats is not reliable.

Peptide_column: [str]
	A column that denotes a peptide, this column + spectrum_columns should identify uniquely a PSM (thus have no duplicates)

Extra rollup columns: list[str]
	Other columns that can be used to compete the spectra for summary.
	Each of those columns will be used to summarize the output as at a different level.
	For example: modifiedpeptides, precursor, peptidegroup
	
Feature_columns: list[str]
	List of all the columns that can be used as possible scores to build a model.

  1. Trim the PSMDataset API to contain only those groups (that includes the LinearPSMDataset and the OnDiskPsmDataset).
  2. Request other columns as needed for other outputs (move to ‘to_flashlfq’ or so …)
  3. Store the best scores internally in the Dataset (instead of being passed by brew and then used again in the confidence estimation)
  4. Write columns as-is (maybe replace spaces + punctuation/hyphenation to '_')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaning up the confidence api over here: jspaezp#2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at it... having the support for the variable compound key is especially hard for the sqlite output ... at least with the current implementation, where the writer requires hard-coded columns.

@jspaezp
Copy link
Collaborator Author

jspaezp commented Dec 16, 2024

Why is there a "do_rollup" separate from the Confidence ??

mokapot/mokapot/rollup.py

Lines 59 to 241 in 0d1f437

@typechecked
def do_rollup(config):
# todo: refactor: this function is far too long. Should be split. Probably
# at least one function to configure the input readers, one to write the
# intermediate/temp files, and one that computes the statistics (q-values
# and peps and writes the output files)
base_level: str = config.level
src_dir: Path = config.src_dir
dest_dir: Path = config.dest_dir
file_root: str = config.file_root + "."
# Determine input files
if len(list(src_dir.glob(f"*.{base_level}s.parquet"))) > 0:
if len(list(src_dir.glob(f"*.{base_level}s.csv"))) > 0:
raise RuntimeError(
"Only input files of either type CSV or type Parquet should "
f"exist in '{src_dir}', but both types were found."
)
suffix = ".parquet"
else:
suffix = ".csv"
target_files: list[Path] = sorted(
src_dir.glob(f"*.targets.{base_level}s{suffix}")
)
decoy_files: list[Path] = sorted(
src_dir.glob(f"*.decoys.{base_level}s{suffix}")
)
target_files = [
file for file in target_files if not file.name.startswith(file_root)
]
decoy_files = [
file for file in decoy_files if not file.name.startswith(file_root)
]
in_files: list[Path] = sorted(target_files + decoy_files)
logging.info(f"Reading files: {[str(file) for file in in_files]}")
if len(in_files) == 0:
raise ValueError("No input files found.")
# Configure readers (read targets/decoys and adjoin is_decoy column)
target_readers = [
get_target_decoy_reader(path, False) for path in target_files
]
decoy_readers = [
get_target_decoy_reader(path, True) for path in decoy_files
]
reader = MergedTabularDataReader(
target_readers + decoy_readers,
priority_column="score",
reader_chunk_size=10000,
)
# Determine out levels
levels = compute_rollup_levels(base_level, DEFAULT_PARENT_LEVELS)
levels_not_found = [
level for level in levels if level not in reader.get_column_names()
]
levels = [level for level in levels if level in reader.get_column_names()]
logging.info(f"Rolling up to levels: {levels}")
if len(levels_not_found) > 0:
logging.info(
f" (Rollup levels not found in input: {levels_not_found})"
)
# Determine temporary files
temp_files = {
level: dest_dir / f"{file_root}temp.{level}s{suffix}"
for level in levels
}
logging.debug(
"Using temp files: "
f"{ {level: str(file) for level, file in temp_files.items()} }"
)
# Determine columns for output files and intermediate files
in_column_names = reader.get_column_names()
in_column_types = reader.get_column_types()
temp_column_names, temp_column_types = remove_columns(
in_column_names, in_column_types, ["q_value", "posterior_error_prob"]
)
# Configure temp writers
merge_row_type = BufferType.Dicts
temp_buffer_size = 1000
temp_writers = {
level: TabularDataWriter.from_suffix(
temp_files[level],
columns=temp_column_names,
column_types=temp_column_types,
buffer_size=temp_buffer_size,
buffer_type=merge_row_type,
)
for level in levels
}
# todo: discuss: We need an option to write parquet or sql for example
# (also, the output file type could depend on the input file type)
# Write temporary files which contain only the best scoring entity of a
# given level
logging.debug(
"Writing temp files: %s", [str(file) for file in temp_files.values()]
)
timer = make_timer()
score_stats = OnlineStatistics()
with auto_finalize(temp_writers.values()):
count = 0
seen_entities: dict[str, set] = {level: set() for level in levels}
for data_row in reader.get_row_iterator(
temp_column_names, row_type=merge_row_type
):
count += 1
if count % 10000 == 0:
logging.debug(
f" Processed {count} lines ({timer():.2f} seconds)"
)
for level in levels:
seen = seen_entities[level]
id_col = level
if merge_row_type == BufferType.DataFrame:
id = data_row.loc[0, id_col]
else:
id = data_row[id_col]
if id not in seen:
seen.add(id)
temp_writers[level].append_data(data_row)
score_stats.update_single(data_row["score"])
logging.info(f"Read {count} PSMs")
logging.debug(f"Score statistics: {score_stats.describe()}")
for level in levels:
seen = seen_entities[level]
logging.info(
f"Rollup level {level}: found {len(seen)} unique entities"
)
# Determine output files
out_files_map = {
level: [
dest_dir / f"{file_root}targets.{level}s{suffix}",
dest_dir / f"{file_root}decoys.{level}s{suffix}",
]
for level in levels
}
# Configure temp readers and output writers
buffer_size = 1000
output_columns, output_types = remove_columns(
in_column_names, in_column_types, ["is_decoy"]
)
output_options = dict(
columns=output_columns,
column_types=output_types,
buffer_size=buffer_size,
)
def create_writer(path: Path):
return TabularDataWriter.from_suffix(path, **output_options)
for level in levels:
output_writers = list(map(create_writer, out_files_map[level]))
writer = TargetDecoyWriter(
output_writers, write_decoys=True, decoy_column="is_decoy"
)
with auto_finalize(output_writers):
temp_reader = temp_writers[level].get_associated_reader()
compute_and_write_confidence(
temp_reader,
writer,
config.qvalue_algorithm,
config.peps_algorithm,
config.stream_confidence,
score_stats,
peps_error=True,
level=level,
eval_fdr=0.01,
)

@jspaezp jspaezp changed the title [WIP] Feature/auto pin handling2 [WIP] v0.11.0 RC Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants