-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] v0.11.0 RC #132
base: main
Are you sure you want to change the base?
[WIP] v0.11.0 RC #132
Conversation
* ✨ cherry picks internal fixes from !68 and !70 * Cherry pick feature/confidence_streaming branch * ✨ adds filelock dependency for tests * 💄 linting * 💄 reformat to satisfy linter k * ✨ imports type annotations from future for python 3.9 * ✨ make pytest and cli behave with type annotations in Python 3.9 * ✨ test dropping Python 3.9 support - inspired by https://github.com/wfondrie/mokapot/pull/126/files#diff-1db27d93186e46d3b441ece35801b244db8ee144ff1405ca27a163bfe878957fL20 * Set scale_to_one to false in *all* cases * Fixed path problems probably causing errors under windows * Fix more possible path issues * Fix warning about bitwise not in python 3.12 * Fix problem with numpy 2.x's different str rep of floats * Make hashing of rows for splitting independent of numpy version and spectra columns * Feature/streaming fix windows (wfondrie#48) * ✨ log more infos * ✨ uses uv for env setup; fix dependencies --------- Co-authored-by: Elmar Zander <elmar.zander@googlemail.com>
Fixed retention time division by 60. Time is required in minutes for FlashLFQ, it's already in minutues Co-authored-by: William Fondrie <fondriew@gmail.com>
) | ||
|
||
CSV_SUFFIXES = [".csv", ".pin", ".tab", ".csv"] | ||
CSV_SUFFIXES = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record... I still dislike naming so many tab-delimited file formats as "comma separated values (csv)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolute agree. I just don't see a better way, as those other extensions are already out their in the wild.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall any tool that generates a tab delimited .csv of the top of my head. Do you happen to have an example? (I wont deal with it in this PR but in the future we could split csv-tsv formats internally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right. I somehow misread your initial comment. Yes, since we really never have "comma-separated" values anywhere, why not get completely rid of it and replace "comma separated/CSV" with "tab separated/TSV" everywhere.
For the record: when I started on this code base, it was something with "comma separated" everywhere, but a separator variable sep
was passed around, which was always set to "\t". I got rid of all the explicit file reading/writing stuff and moved that into the readers/writers, set the separator (I think) unconditionally to "\t", but did not rename the variables/classes. So: my bad ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I think adding support for .csv would be a good idea in the future (comma separated file)
Edit: a7401c3 does some progress, figured out the confidence but still need to "pipe" some columns needed by flashlfq, since I might need help with this one to understand how to update the documentation. Right now if I try to do this (part of tests/unit_tests/test_writer_flashlfq.py):
so .... where are these columns specified? how can one assign confidence without proteins? |
Note: There seems to be a difference on what 'OnDiskDataset' and 'LinearPsmDataset' mean by spectra: on disk psm is all of these:
and the linear psm defines it as
which would seem more like the OnDisk ... of |
@jspaezp I think there have been some changes in the meaning in some of the code entities (variables, functions etc.) that are not reflected in the naming and also quite some divergence between code and documentation. Maybe it makes sense to discuss these issues on teams or skype next week? |
c487e00
to
7109058
Compare
7109058
to
9841733
Compare
proteins = conf._protein_column | ||
else: | ||
proteins = None | ||
# TODO: make this work again ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self:
# todo: nice to have: move this column renaming stuff into the | ||
# column defs module, and further, have standardized columns | ||
# directly from the pin reader (applying the renaming itself) | ||
|
||
level_column_names = [ | ||
"PSMId", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question (@ezander and @gessulat) Is it a dealbreaker for you to have this column hard-coded? I believe that the current behavior in the published version of mokapot is to preserve all the "spectrum columns" (which usually will be psm id + file name + rank ... all columns that uniquely identify a PSM). Let me know if having a single "PSMId" column is a requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jspaezp
I looked again at the current state of column handling in mokapot, which is a little bit confusing. This may not answer your question directly, but may give a basis for moving forward.
Input columns are determined by parsers/pin.py
- All are case-insensitive
- Required
- specid (becomes later PSMId, not identical to the "spectrum")
- peptide
- proteins
- label (target column)
- scannr
- Optional
- rollup columns: modifiedpeptides, precursor, peptidegroup
- filename
- calcmass
- expmass
- ret_time
- charge
- Spectra is [filename, scannr, ret_time, expmass]. Since all except scannr are optional, can
span between 1 and 4 columns.
Is used for two things (IIRC): in model training for competition (only those with same "spectrum" compete) and later
in the deduplication process in confidence writing. - From this an OnDiskPsmDataset is created and the names of the inferred columns are set
- IMHO the dataset should rather get a reader, and the plain CSVFileReader should be wrapped in a ColumnMappedReader so
that in the code, only hard-coded names (or names defined as constants) can be used (except maybe for the feature
columns, but there could also be some mapping scheme e.g. from "foo" -> "feature_001", "bar" -> "feature_002", ...) - Also: any conversion necessary for the input file should be done here (e.g. convert label from int to bool)
The level_columns
(yeah, the naming...) are the columns of the intermediate files, that get written for each rollup
level before confidence assignment is made.
- During writing of the "level files" deduplication is done on the columns as given by the "spectra columns" i.e. some
subset of [filename, scannr, ret_time, expmass] - Columns are renamed from the
level_input_column_names
to thelevel_column_names
. Most importantly the "specId"
column is renamed to "PSMId".
(It's not quite clear why the "target_column" is not renamed, probably the name taken from the input file is usually
correct by some coincidence). This whole renaming business (as mentions should IMO move to the input file reader as
mentioned earlier)
output_column_names
are the names of the columns after confidence estimation on the "(rollup) level files".
- The names are hard-coded except for the
extra_output_columns
, which (as of now) only contain the optional rollup
levels. One more point for defining those as string constants somewhere. - Additionally there are the
score
,q-value
, andposterior_error_prob
columns. The-
is problematic when writing
to SQL (should be q_value, but that clashed with other code momentarily) - The names of the columns could be inferred from reader columns when writing the output files. However, due to the
possibility of multiple input files going into one output file (and the inherited code structure) it was necessary
to write the column header before they could be inferred from a reader (but this could be changed, of course).
Note: it is still somewhat messy, but can be cleared up. The special treatment of proteins (which, according to some,
shouldn't be mokapot's business anyway) makes that somewhat harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns are renamed from the level_input_column_names to the level_column_names. Most importantly the "specId"
column is renamed to "PSMId".
This assumes there is a single column that uniquely identifies the spectrum. Since we are not writing say ... the file name, there is the clear assumption that the information would be redundant with the PSMId
.
I believe that the point I want to make is... SINCE we are requesting and supporting multiple columns denoting uniqueness of the spectrum why are we not writing them? IMHO If the user wants a single column called PSMId
, they should pass it as an input. (input columns should be the same as output columns)
IMHO the dataset should rather get a reader, and the plain CSVFileReader should be wrapped in a ColumnMappedReader so
that in the code, only hard-coded names (or names defined as constants) can be used (except maybe for the feature
columns, but there could also be some mapping scheme e.g. from "foo" -> "feature_001", "bar" -> "feature_002", ...)
I don't think it matters, the interface should be defined in the ABC, not in the OnDiskPSMDataset
Additionally there are the score, q-value, and posterior_error_prob columns. The - is problematic when writing
Cool, I will make sure columns are sql-friendly.
The names of the columns could be inferred from reader columns when writing the output files. However, due to the
possibility of multiple input files going into one output file (and the inherited code structure) it was necessary
to write the column header before they could be inferred from a reader (but this could be changed, of course).
But we check when multiple datasets are passed that all of them have the same columns ... (I dont think we should support multiple input files where each can have different column names)
My proposal:
Target/Label_column: [str]
Column that marks whether an element is a target/decoy.
Bool? (I think it has to be a bool RN)
Spectrum_columns: list[str]
The series of columns that denote which PSMs are competing with each other. For example:
1. We can have both a target and a decoy coming from the same scan number + file.
2. We can have multiple peptides suggested as targets for each scan.
We should also warn if a float column is passed here, since testing equality in floats is not reliable.
Peptide_column: [str]
A column that denotes a peptide, this column + spectrum_columns should identify uniquely a PSM (thus have no duplicates)
Extra rollup columns: list[str]
Other columns that can be used to compete the spectra for summary.
Each of those columns will be used to summarize the output as at a different level.
For example: modifiedpeptides, precursor, peptidegroup
Feature_columns: list[str]
List of all the columns that can be used as possible scores to build a model.
- Trim the PSMDataset API to contain only those groups (that includes the LinearPSMDataset and the OnDiskPsmDataset).
- Request other columns as needed for other outputs (move to ‘to_flashlfq’ or so …)
- Store the best scores internally in the Dataset (instead of being passed by
brew
and then used again in the confidence estimation) - Write columns as-is (maybe replace spaces + punctuation/hyphenation to '_')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaning up the confidence api over here: jspaezp#2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back at it... having the support for the variable compound key is especially hard for the sqlite output ... at least with the current implementation, where the writer requires hard-coded columns.
Why is there a "do_rollup" separate from the Lines 59 to 241 in 0d1f437
|
Chore/fix confidence api
What does this pr do:
Addresses:
mokapot.picked_protein.strip_peptides
#130Notes:
Most of the lines added are just this file: data/phospho_rep1.traditional.pin
Which is a backport of the original testing file (
phospho_rep1.pin
which was ""fixed"" in another PR by removing the ragged aspect of the protein columnPROT1\tPROT2
->PROT1:PROT2
)Blockers:
Unhandled things: