Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: when lingroups are provided, use them for csv_summary #3311

Merged
merged 15 commits into from
Oct 24, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Aug 26, 2024

Currently, when we generate a csv_summary with LINs, we get a summary at every single LIN rank, which is a lot of results and not very helpful. LINgroups are our way of linking the LINs (e.g. 14;1;0;0;0;0;0;0;0;0) to a known name/taxonomic group (e.g. "Phylotype I").

This PR changes the behavior of csv_summary when a lingroup file is provided, limiting summarized reporting to just the named lingroups. While the output is very similar to the lingroup output we already have, the most important difference is that the sample name is included in the output, meaning that we get intelligible results when running tax metagenome on more than one sample.

Prior tax metagenome behavior was to always generate a lingroup output file when a lingroups file is provided. Here, I disable that for multiple queries, since the results wouldn't make sense. I do not replace it with another default, but I did add a recommendation to the help + doc.

In the future, we could consider changing the default lingroup output to csv_summary, since it's actually useful for multiple files. Or, we could modify the lingroup output to include query information.

Copy link

codecov bot commented Aug 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.40%. Comparing base (5acf698) to head (e3c3bb6).
Report is 1 commits behind head on latest.

Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3311      +/-   ##
==========================================
+ Coverage   86.45%   92.40%   +5.94%     
==========================================
  Files         137      104      -33     
  Lines       16070    12925    -3145     
  Branches     2211     2219       +8     
==========================================
- Hits        13894    11943    -1951     
+ Misses       1869      675    -1194     
  Partials      307      307              
Flag Coverage Δ
hypothesis-py 25.43% <8.00%> (-0.04%) ⬇️
python 92.40% <100.00%> (+<0.01%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bluegenes
Copy link
Contributor Author

@sourmash-bio/devs ready for review

@bluegenes bluegenes changed the title WIP: when lingroups are provided, use them for csv_summary MRG: when lingroups are provided, use them for csv_summary Sep 4, 2024
Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good - esp appreciate the documentation update.

there's some missing code coverage - is this just buggy codecov? I haven't dug in at all.

@ctb
Copy link
Contributor

ctb commented Sep 18, 2024

oh! I wanted to suggest that you put the suggested changes to behavior in the PR description into new issues, too; I think they require a major version bump?

@bluegenes
Copy link
Contributor Author

looks good - esp appreciate the documentation update.

there's some missing code coverage - is this just buggy codecov? I haven't dug in at all.

looks like it was just buggy codecov!

@bluegenes
Copy link
Contributor Author

oh! I wanted to suggest that you put the suggested changes to behavior in the PR description into new issues, too; I think they require a major version bump?

now in #3361

@bluegenes bluegenes merged commit 6ae9cd3 into latest Oct 24, 2024
43 of 44 checks passed
@bluegenes bluegenes deleted the better-summarized-lingroups branch October 24, 2024 20:09
@ctb ctb mentioned this pull request Dec 5, 2024
ctb added a commit that referenced this pull request Dec 5, 2024
Developer updates:

* build: move ORCID to metadata in pyproject.toml, fix pixi (#3416)
* build: simplify Rust release (#3392)
* fix: Avoid re-calculating md5sum on clone and conversion to
KmerMinHashBTree (#3385)
* r0.15.1 release (#3304)
* update sourmash core to r0.17.0 (#3381)
* Added union method to HLL (#3293)
* Build: upgrade to newer maturin (#3366)
* CI: use supported ubuntu for codspeed (#3350)
* Fix clippy lints from 1.83 beta (#3357)
* Implement resumability for revindex (#3275)
* add `Manifest::intersect_manifest` to Rust core (#3305)
* bump sourmash core to r0.17.2 (#3399)
* change `sig_from_record` to use scaled from `Record` to downsample
(#3387)
* derive Hash for `HashFunctions` (#3344)
* enforce a single scaled on a `CollectionSet` (#3397)
* fix formatting from #3306 (#3307)
* have ruff ignore ipynb so as to avoid triggering an error during CI
(#3325)
* improve downsampling behavior on `KmerMinHash`; fix `RevIndex::gather`
bug around `scaled`. (#3342)
* panic when `FSStorage::load_sig` encounters more than one `Signature`
in a JSON record (#3333)
* propagate error from `RocksDB::open` on bad directory (#3306)
* refactor `calculate_gather_stats` to disallow repeated downsampling
(#3352)
* release core r0.17.1 (#3388)
* release sourmash rust core r0.16.0 (#3356)
* standardize on u32 for scaled, and introduce `ScaledType` (#3364)
* update plugin documentation for users (#3286)
* update sourmash core to r0.15.2 (#3338)
* when lingroups are provided, use them for `csv_summary` (#3311)
* Misc Rust updates to core (#3297)
* Resolve issue for high precision MLE estimation (#3296)

Dependabot and pre-commit CI updates:

* Bump DeterminateSystems/magic-nix-cache-action from 7 to 8 (#3319)
* Bump DeterminateSystems/nix-installer-action from 13 to 14 (#3320)
* Bump DeterminateSystems/nix-installer-action from 14 to 15 (#3374)
* Bump DeterminateSystems/nix-installer-action from 15 to 16 (#3401)
* Bump camino from 1.1.7 to 1.1.9 (#3301)
* Bump codspeed-criterion-compat from 2.6.0 to 2.7.2 (#3324)
* Bump conda-incubator/setup-miniconda from 3.0.4 to 3.1.0 (#3373)
* Bump csv from 1.3.0 to 1.3.1 (#3390)
* Bump getset from 0.1.2 to 0.1.3 (#3328)
* Bump histogram from 0.11.0 to 0.11.1 (#3377)
* Bump js-sys from 0.3.72 to 0.3.74 (#3412)
* Bump memmap2 from 0.9.4 to 0.9.5 (#3326)
* Bump myst-parser from 3.0.1 to 4.0.0 (#3277)
* Bump needletail from 0.5.1 to 0.6.0 (#3376)
* Bump pypa/cibuildwheel from 2.19.2 to 2.20.0 (#3278)
* Bump pypa/cibuildwheel from 2.20.0 to 2.21.1 (#3332)
* Bump pypa/cibuildwheel from 2.21.1 to 2.21.2 (#3345)
* Bump pypa/cibuildwheel from 2.21.2 to 2.21.3 (#3353)
* Bump pypa/cibuildwheel from 2.21.3 to 2.22.0 (#3408)
* Bump roaring from 0.10.6 to 0.10.7 (#3413)
* Bump serde from 1.0.204 to 1.0.207 (#3289)
* Bump serde from 1.0.207 to 1.0.208 (#3298)
* Bump serde from 1.0.208 to 1.0.209 (#3310)
* Bump serde from 1.0.209 to 1.0.210 (#3318)
* Bump serde from 1.0.210 to 1.0.214 (#3368)
* Bump serde from 1.0.214 to 1.0.215 (#3403)
* Bump serde_json from 1.0.120 to 1.0.121 (#3267)
* Bump serde_json from 1.0.121 to 1.0.122 (#3280)
* Bump serde_json from 1.0.122 to 1.0.124 (#3288)
* Bump serde_json from 1.0.124 to 1.0.125 (#3302)
* Bump serde_json from 1.0.125 to 1.0.127 (#3309)
* Bump serde_json from 1.0.127 to 1.0.128 (#3316)
* Bump serde_json from 1.0.128 to 1.0.132 (#3358)
* Bump serde_json from 1.0.132 to 1.0.133 (#3402)
* Bump sphinx-design from 0.5.0 to 0.6.0 (#3268)
* Bump sphinx-design from 0.6.0 to 0.6.1 (#3276)
* Bump tempfile from 3.10.1 to 3.11.0 (#3279)
* Bump tempfile from 3.11.0 to 3.12.0 (#3287)
* Bump tempfile from 3.12.0 to 3.13.0 (#3340)
* Bump tempfile from 3.13.0 to 3.14.0 (#3391)
* Bump thiserror from 1.0.63 to 1.0.64 (#3335)
* Bump thiserror from 1.0.64 to 1.0.65 (#3367)
* Bump thiserror from 1.0.65 to 1.0.68 (#3379)
* Bump thiserror from 1.0.68 to 2.0.3 (#3389)
* Bump web-sys from 0.3.69 to 0.3.70 (#3299)
* Bump web-sys from 0.3.70 to 0.3.72 (#3354)
* Bump web-sys from 0.3.72 to 0.3.74 (#3411)
* Update pytest-cov requirement from <6.0,>=4 to >=4,<7.0 (#3375)
* Update sphinx requirement from <8,>=6 to >=6,<9 (#3269)
* Upgrade rocksdb to 0.22.0, bump MSRV to 1.66  (#3383)
* [pre-commit.ci] pre-commit autoupdate (#3281)
* [pre-commit.ci] pre-commit autoupdate (#3290)
* [pre-commit.ci] pre-commit autoupdate (#3312)
* [pre-commit.ci] pre-commit autoupdate (#3330)
* [pre-commit.ci] pre-commit autoupdate (#3336)
* [pre-commit.ci] pre-commit autoupdate (#3341)
* [pre-commit.ci] pre-commit autoupdate (#3346)
* [pre-commit.ci] pre-commit autoupdate (#3360)
* [pre-commit.ci] pre-commit autoupdate (#3369)
* [pre-commit.ci] pre-commit autoupdate (#3380)
* [pre-commit.ci] pre-commit autoupdate (#3393)
* [pre-commit.ci] pre-commit autoupdate (#3404)
* [pre-commit.ci] pre-commit autoupdate (#3409)
* [pre-commit.ci] pre-commit autoupdate (#3414)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

csv_summary format not created for multiple queries (tax metagenome)
2 participants