[WIP] v0.11.0 RC #132

jspaezp · 2024-12-05T04:00:33Z

What does this pr do:

adds back the flashlfq support.
fixes the python API to work with the docs.
centralizes linting/formatting expectations to a makefile.
migrates dependency management to uv.
Adds PR template.
Addresses:

Notes:

Most of the lines added are just this file: data/phospho_rep1.traditional.pin
Which is a backport of the original testing file (phospho_rep1.pin which was ""fixed"" in another PR by removing the ragged aspect of the protein column PROT1\tPROT2 -> PROT1:PROT2)

Blockers:

Windows tests fail bc of numpy v1.x, updating to v2 needs triqler to bump the debendency. PR is accepted but not released. chore: update numpy to v2 statisticalbiotechnology/triqler#34

Unhandled things:

No idea why windows is failing tests.
Update docs/docstrings/vignettes

* ✨ cherry picks internal fixes from !68 and !70 * Cherry pick feature/confidence_streaming branch * ✨ adds filelock dependency for tests * 💄 linting * 💄 reformat to satisfy linter k * ✨ imports type annotations from future for python 3.9 * ✨ make pytest and cli behave with type annotations in Python 3.9 * ✨ test dropping Python 3.9 support - inspired by https://github.com/wfondrie/mokapot/pull/126/files#diff-1db27d93186e46d3b441ece35801b244db8ee144ff1405ca27a163bfe878957fL20 * Set scale_to_one to false in *all* cases * Fixed path problems probably causing errors under windows * Fix more possible path issues * Fix warning about bitwise not in python 3.12 * Fix problem with numpy 2.x's different str rep of floats * Make hashing of rows for splitting independent of numpy version and spectra columns * Feature/streaming fix windows (wfondrie#48) * ✨ log more infos * ✨ uses uv for env setup; fix dependencies --------- Co-authored-by: Elmar Zander <elmar.zander@googlemail.com>

Fixed retention time division by 60. Time is required in minutes for FlashLFQ, it's already in minutues Co-authored-by: William Fondrie <fondriew@gmail.com>

jspaezp · 2024-12-05T19:24:17Z

mokapot/tabular_data/format_chooser.py

 )

-CSV_SUFFIXES = [".csv", ".pin", ".tab", ".csv"]
+CSV_SUFFIXES = [


For the record... I still dislike naming so many tab-delimited file formats as "comma separated values (csv)"

I absolute agree. I just don't see a better way, as those other extensions are already out their in the wild.

I don't recall any tool that generates a tab delimited .csv of the top of my head. Do you happen to have an example? (I wont deal with it in this PR but in the future we could split csv-tsv formats internally)

Sorry, you're right. I somehow misread your initial comment. Yes, since we really never have "comma-separated" values anywhere, why not get completely rid of it and replace "comma separated/CSV" with "tab separated/TSV" everywhere.

For the record: when I started on this code base, it was something with "comma separated" everywhere, but a separator variable sep was passed around, which was always set to "\t". I got rid of all the explicit file reading/writing stuff and moved that into the readers/writers, set the separator (I think) unconditionally to "\t", but did not rename the variables/classes. So: my bad ;)

To be clear, I think adding support for .csv would be a good idea in the future (comma separated file)

jspaezp · 2024-12-05T21:19:02Z

Edit: a7401c3 does some progress, figured out the confidence but still need to "pipe" some columns needed by flashlfq, since _optional_columns was removed as an attribute from the confidence object

@gessulat and @ezander

I might need help with this one to understand how to update the documentation.

Right now if I try to do this (part of tests/unit_tests/test_writer_flashlfq.py):

# Using the psms_ondisk fixture from your tests ...
def test_sanity(psms_ondisk, tmp_path):
    """Run simple sanity checks"""

    mods, scores = mokapot.brew([psms_ondisk])
    conf = mokapot.assign_confidence(
        [psms_ondisk],
        scores_list=scores,
        eval_fdr=0.05,
        deduplication=False,  # RN fails with deduplication = True with an error saying that the column "ExpMass" does not exist 
    )

# When set to dedup=False it fails with error ` KeyError: 'proteinIds'`

so .... where are these columns specified? how can one assign confidence without proteins?

https://github.com/jspaezp/mokapot/blob/08d73afec23a072642f37ba510bc6d2a7d3577db/mokapot/confidence.py#L380-L388

https://github.com/jspaezp/mokapot/blob/08d73afec23a072642f37ba510bc6d2a7d3577db/tests/unit_tests/test_writer_flashlfq.py#L8-L19

jspaezp · 2024-12-07T03:55:11Z

Note:

There seems to be a difference on what 'OnDiskDataset' and 'LinearPsmDataset' mean by spectra:

on disk psm is all of these:

  ... labels = find_required_column("label", columns)

  # Optional columns
    filename = find_optional_column(filename_column, columns, "filename")
    calcmass = find_optional_column(calcmass_column, columns, "calcmass")
    expmass = find_optional_column(expmass_column, columns, "expmass")
    ret_time = find_optional_column(rt_column, columns, "ret_time")
    charge = find_optional_column(charge_column, columns, "charge_column")
    spectra = [c for c in [filename, scan, ret_time, expmass] if c is not None]

https://github.com/jspaezp/mokapot/blob/73a0e14df017dcb0d8ba5c2ed2cfa2d17d581eab/mokapot/parsers/pin.py#L223-L232

and the linear psm defines it as

spectrum_columns : str or tuple of str
        The column(s) that collectively identify unique mass spectra. Multiple
        columns can be useful to avoid combining scans from multiple mass
        spectrometry runs.

https://github.com/jspaezp/mokapot/blob/73a0e14df017dcb0d8ba5c2ed2cfa2d17d581eab/mokapot/dataset.py#L255-L260

which would seem more like the OnDisk ... of specId_column (the linear psm uses as an index the compound index made from the columns defined by 'spectrum_columns' whilst the on disk dataset assumes there is a single column that can be used as a primary index).

ezander · 2024-12-08T13:00:27Z

@jspaezp I think there have been some changes in the meaning in some of the code entities (variables, functions etc.) that are not reflected in the naming and also quite some divergence between code and documentation. Maybe it makes sense to discuss these issues on teams or skype next week?

jspaezp · 2024-12-08T14:16:16Z

@ezander 100%, Emailed @gessulat to get a meeting scheduled.

jspaezp · 2024-12-09T14:10:03Z

mokapot/writers/flashlfq.py

-        proteins = conf._protein_column
-    else:
-        proteins = None
+    # TODO: make this work again ...


Note to self:

jspaezp · 2024-12-15T01:13:28Z

mokapot/confidence.py

        # todo: nice to have: move this column renaming stuff into the
        #   column defs module, and further, have standardized columns
        #   directly from the pin reader (applying the renaming itself)
+
        level_column_names = [
            "PSMId",


Question (@ezander and @gessulat) Is it a dealbreaker for you to have this column hard-coded? I believe that the current behavior in the published version of mokapot is to preserve all the "spectrum columns" (which usually will be psm id + file name + rank ... all columns that uniquely identify a PSM). Let me know if having a single "PSMId" column is a requirement.

Hi @jspaezp
I looked again at the current state of column handling in mokapot, which is a little bit confusing. This may not answer your question directly, but may give a basis for moving forward.

Input columns are determined by parsers/pin.py

All are case-insensitive

Required

specid (becomes later PSMId, not identical to the "spectrum")

peptide

proteins

label (target column)

scannr

Optional

rollup columns: modifiedpeptides, precursor, peptidegroup

filename

calcmass

expmass

ret_time

charge

Spectra is [filename, scannr, ret_time, expmass]. Since all except scannr are optional, can
span between 1 and 4 columns.
Is used for two things (IIRC): in model training for competition (only those with same "spectrum" compete) and later
in the deduplication process in confidence writing.

From this an OnDiskPsmDataset is created and the names of the inferred columns are set

IMHO the dataset should rather get a reader, and the plain CSVFileReader should be wrapped in a ColumnMappedReader so
that in the code, only hard-coded names (or names defined as constants) can be used (except maybe for the feature
columns, but there could also be some mapping scheme e.g. from "foo" -> "feature_001", "bar" -> "feature_002", ...)

Also: any conversion necessary for the input file should be done here (e.g. convert label from int to bool)

The level_columns (yeah, the naming...) are the columns of the intermediate files, that get written for each rollup
level before confidence assignment is made.

During writing of the "level files" deduplication is done on the columns as given by the "spectra columns" i.e. some
subset of [filename, scannr, ret_time, expmass]

Columns are renamed from the level_input_column_names to the level_column_names. Most importantly the "specId"
column is renamed to "PSMId".
(It's not quite clear why the "target_column" is not renamed, probably the name taken from the input file is usually
correct by some coincidence). This whole renaming business (as mentions should IMO move to the input file reader as
mentioned earlier)

output_column_names are the names of the columns after confidence estimation on the "(rollup) level files".

The names are hard-coded except for the extra_output_columns, which (as of now) only contain the optional rollup
levels. One more point for defining those as string constants somewhere.

Additionally there are the score, q-value, and posterior_error_prob columns. The - is problematic when writing
to SQL (should be q_value, but that clashed with other code momentarily)

The names of the columns could be inferred from reader columns when writing the output files. However, due to the
possibility of multiple input files going into one output file (and the inherited code structure) it was necessary
to write the column header before they could be inferred from a reader (but this could be changed, of course).

Note: it is still somewhat messy, but can be cleared up. The special treatment of proteins (which, according to some,
shouldn't be mokapot's business anyway) makes that somewhat harder.

Columns are renamed from the level_input_column_names to the level_column_names. Most importantly the "specId"
column is renamed to "PSMId".

This assumes there is a single column that uniquely identifies the spectrum. Since we are not writing say ... the file name, there is the clear assumption that the information would be redundant with the PSMId.

I believe that the point I want to make is... SINCE we are requesting and supporting multiple columns denoting uniqueness of the spectrum why are we not writing them? IMHO If the user wants a single column called PSMId, they should pass it as an input. (input columns should be the same as output columns)

IMHO the dataset should rather get a reader, and the plain CSVFileReader should be wrapped in a ColumnMappedReader so
that in the code, only hard-coded names (or names defined as constants) can be used (except maybe for the feature
columns, but there could also be some mapping scheme e.g. from "foo" -> "feature_001", "bar" -> "feature_002", ...)

I don't think it matters, the interface should be defined in the ABC, not in the OnDiskPSMDataset

Additionally there are the score, q-value, and posterior_error_prob columns. The - is problematic when writing

Cool, I will make sure columns are sql-friendly.

The names of the columns could be inferred from reader columns when writing the output files. However, due to the
possibility of multiple input files going into one output file (and the inherited code structure) it was necessary
to write the column header before they could be inferred from a reader (but this could be changed, of course).

But we check when multiple datasets are passed that all of them have the same columns ... (I dont think we should support multiple input files where each can have different column names)

My proposal:

Target/Label_column: [str] Column that marks whether an element is a target/decoy. Bool? (I think it has to be a bool RN) Spectrum_columns: list[str] The series of columns that denote which PSMs are competing with each other. For example: 1. We can have both a target and a decoy coming from the same scan number + file. 2. We can have multiple peptides suggested as targets for each scan. We should also warn if a float column is passed here, since testing equality in floats is not reliable. Peptide_column: [str] A column that denotes a peptide, this column + spectrum_columns should identify uniquely a PSM (thus have no duplicates) Extra rollup columns: list[str] Other columns that can be used to compete the spectra for summary. Each of those columns will be used to summarize the output as at a different level. For example: modifiedpeptides, precursor, peptidegroup Feature_columns: list[str] List of all the columns that can be used as possible scores to build a model.

Trim the PSMDataset API to contain only those groups (that includes the LinearPSMDataset and the OnDiskPsmDataset).

Request other columns as needed for other outputs (move to ‘to_flashlfq’ or so …)

Store the best scores internally in the Dataset (instead of being passed by brew and then used again in the confidence estimation)

Write columns as-is (maybe replace spaces + punctuation/hyphenation to '_')

cleaning up the confidence api over here: jspaezp#2

Looking back at it... having the support for the variable compound key is especially hard for the sqlite output ... at least with the current implementation, where the writer requires hard-coded columns.

jspaezp · 2024-12-16T18:35:20Z

Why is there a "do_rollup" separate from the Confidence ??

mokapot/mokapot/rollup.py

Lines 59 to 241 in 0d1f437

    
           @typechecked 
        
           def do_rollup(config): 
        
               # todo: refactor: this function is far too long. Should be split. Probably 
        
               #  at least one function to configure the input readers, one to write the 
        
               #  intermediate/temp files, and one that computes the statistics (q-values 
        
               #  and peps and writes the output files) 
        
               base_level: str = config.level 
        
               src_dir: Path = config.src_dir 
        
               dest_dir: Path = config.dest_dir 
        
               file_root: str = config.file_root + "." 
        
               # Determine input files 
        
               if len(list(src_dir.glob(f"*.{base_level}s.parquet"))) > 0: 
        
                   if len(list(src_dir.glob(f"*.{base_level}s.csv"))) > 0: 
        
                       raise RuntimeError( 
        
                           "Only input files of either type CSV or type Parquet should " 
        
                           f"exist in '{src_dir}', but both types were found." 
        
                       ) 
        
                   suffix = ".parquet" 
        
               else: 
        
                   suffix = ".csv" 
        
               target_files: list[Path] = sorted( 
        
                   src_dir.glob(f"*.targets.{base_level}s{suffix}") 
        
               ) 
        
               decoy_files: list[Path] = sorted( 
        
                   src_dir.glob(f"*.decoys.{base_level}s{suffix}") 
        
               ) 
        
               target_files = [ 
        
                   file for file in target_files if not file.name.startswith(file_root) 
        
               ] 
        
               decoy_files = [ 
        
                   file for file in decoy_files if not file.name.startswith(file_root) 
        
               ] 
        
               in_files: list[Path] = sorted(target_files + decoy_files) 
        
               logging.info(f"Reading files: {[str(file) for file in in_files]}") 
        
               if len(in_files) == 0: 
        
                   raise ValueError("No input files found.") 
        
               # Configure readers (read targets/decoys and adjoin is_decoy column) 
        
               target_readers = [ 
        
                   get_target_decoy_reader(path, False) for path in target_files 
        
               ] 
        
               decoy_readers = [ 
        
                   get_target_decoy_reader(path, True) for path in decoy_files 
        
               ] 
        
               reader = MergedTabularDataReader( 
        
                   target_readers + decoy_readers, 
        
                   priority_column="score", 
        
                   reader_chunk_size=10000, 
        
               ) 
        
               # Determine out levels 
        
               levels = compute_rollup_levels(base_level, DEFAULT_PARENT_LEVELS) 
        
               levels_not_found = [ 
        
                   level for level in levels if level not in reader.get_column_names() 
        
               ] 
        
               levels = [level for level in levels if level in reader.get_column_names()] 
        
               logging.info(f"Rolling up to levels: {levels}") 
        
               if len(levels_not_found) > 0: 
        
                   logging.info( 
        
                       f"  (Rollup levels not found in input: {levels_not_found})" 
        
                   ) 
        
               # Determine temporary files 
        
               temp_files = { 
        
                   level: dest_dir / f"{file_root}temp.{level}s{suffix}" 
        
                   for level in levels 
        
               } 
        
               logging.debug( 
        
                   "Using temp files: " 
        
                   f"{ {level: str(file) for level, file in temp_files.items()} }" 
        
               ) 
        
               # Determine columns for output files and intermediate files 
        
               in_column_names = reader.get_column_names() 
        
               in_column_types = reader.get_column_types() 
        
               temp_column_names, temp_column_types = remove_columns( 
        
                   in_column_names, in_column_types, ["q_value", "posterior_error_prob"] 
        
               ) 
        
               # Configure temp writers 
        
               merge_row_type = BufferType.Dicts 
        
               temp_buffer_size = 1000 
        
               temp_writers = { 
        
                   level: TabularDataWriter.from_suffix( 
        
                       temp_files[level], 
        
                       columns=temp_column_names, 
        
                       column_types=temp_column_types, 
        
                       buffer_size=temp_buffer_size, 
        
                       buffer_type=merge_row_type, 
        
                   ) 
        
                   for level in levels 
        
               } 
        
               # todo: discuss: We need an option to write parquet or sql for example 
        
               #  (also, the  output file type could depend on the input file type) 
        
               # Write temporary files which contain only the best scoring entity of a 
        
               # given level 
        
               logging.debug( 
        
                   "Writing temp files: %s", [str(file) for file in temp_files.values()] 
        
               ) 
        
               timer = make_timer() 
        
               score_stats = OnlineStatistics() 
        
               with auto_finalize(temp_writers.values()): 
        
                   count = 0 
        
                   seen_entities: dict[str, set] = {level: set() for level in levels} 
        
                   for data_row in reader.get_row_iterator( 
        
                       temp_column_names, row_type=merge_row_type 
        
                   ): 
        
                       count += 1 
        
                       if count % 10000 == 0: 
        
                           logging.debug( 
        
                               f"  Processed {count} lines ({timer():.2f} seconds)" 
        
                           ) 
        
                       for level in levels: 
        
                           seen = seen_entities[level] 
        
                           id_col = level 
        
                           if merge_row_type == BufferType.DataFrame: 
        
                               id = data_row.loc[0, id_col] 
        
                           else: 
        
                               id = data_row[id_col] 
        
                           if id not in seen: 
        
                               seen.add(id) 
        
                               temp_writers[level].append_data(data_row) 
        
                       score_stats.update_single(data_row["score"]) 
        
                   logging.info(f"Read {count} PSMs") 
        
                   logging.debug(f"Score statistics: {score_stats.describe()}") 
        
                   for level in levels: 
        
                       seen = seen_entities[level] 
        
                       logging.info( 
        
                           f"Rollup level {level}: found {len(seen)} unique entities" 
        
                       ) 
        
               # Determine output files 
        
               out_files_map = { 
        
                   level: [ 
        
                       dest_dir / f"{file_root}targets.{level}s{suffix}", 
        
                       dest_dir / f"{file_root}decoys.{level}s{suffix}", 
        
                   ] 
        
                   for level in levels 
        
               } 
        
               # Configure temp readers and output writers 
        
               buffer_size = 1000 
        
               output_columns, output_types = remove_columns( 
        
                   in_column_names, in_column_types, ["is_decoy"] 
        
               ) 
        
               output_options = dict( 
        
                   columns=output_columns, 
        
                   column_types=output_types, 
        
                   buffer_size=buffer_size, 
        
               ) 
        
               def create_writer(path: Path): 
        
                   return TabularDataWriter.from_suffix(path, **output_options) 
        
               for level in levels: 
        
                   output_writers = list(map(create_writer, out_files_map[level])) 
        
                   writer = TargetDecoyWriter( 
        
                       output_writers, write_decoys=True, decoy_column="is_decoy" 
        
                   ) 
        
                   with auto_finalize(output_writers): 
        
                       temp_reader = temp_writers[level].get_associated_reader() 
        
                       compute_and_write_confidence( 
        
                           temp_reader, 
        
                           writer, 
        
                           config.qvalue_algorithm, 
        
                           config.peps_algorithm, 
        
                           config.stream_confidence, 
        
                           score_stats, 
        
                           peps_error=True, 
        
                           level=level, 
        
                           eval_fdr=0.01, 
        
                       )

Chore/fix confidence api

jspaezp and others added 12 commits September 6, 2024 16:48

(feat) added auto handling of traditional pin and testing

d9d91dc

(fix) added handling of default direction

31dff36

(fix) changed intermediate files pin->tsv and fixed tests accordingly

5940500

(chore) formatting on docs and removed T20 from them

a64c1f2

(chore) upgraded to upstream actions

1ff50c5

(chore) removed unused dependency in docs

1809edd

(chore) reformatted tests

3214c24

Small changes for FlashLFQ writer (wfondrie#131)

6db5964

Fixed retention time division by 60. Time is required in minutes for FlashLFQ, it's already in minutues Co-authored-by: William Fondrie <fondriew@gmail.com>

wip: formatting and rebasing fixes

a17fa4c

chore: merge main

8c779fa

chore: ruff format

d6ef287

jspaezp mentioned this pull request Dec 5, 2024

Storey's method for q-value computation (draft) #133

Closed

wip,chore: re-adding flashlfq support

1d7475f

jspaezp commented Dec 5, 2024

View reviewed changes

jspaezp added 2 commits December 5, 2024 14:33

ci,fix: fixed confidence out and ci migration

0074f88

format: eof newline

08d73af

jspaezp mentioned this pull request Dec 5, 2024

(feat) added auto handling of traditional pin and testing #126

Closed

jspaezp added 5 commits December 5, 2024 17:05

wip,fix: progress to re-add flashlfq output

a7401c3

chore: uv lock and formatting

35cb9d8

chore: added pr template

ce53dee

wip: make brew generic again

d6f58ac

wip,fix: added deleter to on psm dataset

73a0e14

jspaezp added 5 commits December 7, 2024 17:08

feat: re-added flashlfq support

f7e8dbd

chore: linting + formatting

6045ade

fix: fixtures and progess in definition of cols

b02567a

test, fix: annotated/commented new fixtures

b06b01e

lint: formatting

26c4c78

jspaezp added 2 commits December 8, 2024 06:37

ci: removed fail on codecov fail

874e2f6

ci: test speedup replacing RF with dtree

2e0a5bd

jspaezp force-pushed the feature/auto_pin_handling2 branch from c487e00 to 7109058 Compare December 8, 2024 14:12

ci: attempt to fix test docker build

9841733

jspaezp force-pushed the feature/auto_pin_handling2 branch from 7109058 to 9841733 Compare December 8, 2024 14:29

jspaezp commented Dec 9, 2024

View reviewed changes

jspaezp added 4 commits December 13, 2024 14:02

refactor: change stats to dataclass

6749f84

refactor: extracted output writer factory

1a95140

refactor: extracted level manager in confidence

0466b71

refactor: extracted level writer group

692a9f2

jspaezp commented Dec 15, 2024

View reviewed changes

jspaezp added 2 commits December 15, 2024 12:08

refactor: extracted more writer builder work to class

d4d3e8c

feat: score propagation and unscored confidence

789f0b5

jspaezp and others added 10 commits December 16, 2024 15:35

feat(confidence): add data reading api

59e649d

feat,experiment: Experimental qvalue-fdr estimation

2e43ce2

chore,docs: updated basic docs to curr api and updated typing

be91528

chore: updated basic n joint model docs code (md in progress)

680fc5b

chore: updated notebook

4ece548

chore,confidence: update docstrings

409d98d

chore,qvalue: removed commented out code

100ec58

Merge pull request #2 from jspaezp/chore/fix_confidence_api

c246df7

Chore/fix confidence api

chore: fixed line length lints in docstrings

1e7d68c

fix,sqlite: fixed path for sqlite writer

a10cb31

jspaezp changed the title ~~[WIP] Feature/auto pin handling2~~ [WIP] v0.11.0 RC Jan 2, 2025

jspaezp mentioned this pull request Jan 8, 2025

feat, wip: compound key on spectrum jspaezp/mokapot#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] v0.11.0 RC #132

[WIP] v0.11.0 RC #132

jspaezp commented Dec 5, 2024 •

edited

Loading

jspaezp Dec 5, 2024

ezander Dec 8, 2024

jspaezp Dec 8, 2024

ezander Dec 8, 2024

jspaezp Dec 9, 2024

jspaezp commented Dec 5, 2024 •

edited

Loading

jspaezp commented Dec 7, 2024 •

edited

Loading

ezander commented Dec 8, 2024

jspaezp commented Dec 8, 2024

jspaezp Dec 9, 2024

jspaezp Dec 15, 2024

ezander Dec 16, 2024

jspaezp Dec 16, 2024

jspaezp Dec 16, 2024

jspaezp Jan 2, 2025

jspaezp commented Dec 16, 2024

[WIP] v0.11.0 RC #132

Are you sure you want to change the base?

[WIP] v0.11.0 RC #132

Conversation

jspaezp commented Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jspaezp commented Dec 5, 2024 • edited Loading

jspaezp commented Dec 7, 2024 • edited Loading

ezander commented Dec 8, 2024

jspaezp commented Dec 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

My proposal:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jspaezp commented Dec 16, 2024

jspaezp commented Dec 5, 2024 •

edited

Loading

jspaezp commented Dec 5, 2024 •

edited

Loading

jspaezp commented Dec 7, 2024 •

edited

Loading