-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpreting abundance #549
Comments
Hi @luizirber, I was wondering if you could help me with this problem as well? I was pointed to additionally, may I also ask if I could interpret the counts from |
so, |
@luizirber, thank you! that makes sense... I got 107323 from |
I realized I forgot the |
While I try to find time to write more docs, please also see my STAMPS 2018 presentation at https://osf.io/8d56n/ -- thx! |
@mytluo , @ctb
|
hi @Alok-Shekhar thank you for your questions! scaledre scaled, have you seen the explanations in Using sourmash: a practical guide and modulo hash details? re abundance, what specific question(s) do you want to answer? here's an example (from the test code) that might help: In the
here, the query reads were constructed to have 10 times as many reads from s10 as s11. And so that's what you see with p_query (91% of the query is genome s10, 9% of the query is genome s11). In terms of the average abundance ( For this example, there is no difference between p_query and avg_abund, because the s10 and s11 genomes are the same size. However, were s10 twice as big as s11, then the p_query should change while avg_abund should not. |
Thanks! @ctb for the explanation |
@ctb Abundance info just like we get from kraken run from "% of fragments covered " column Thanks! |
@luizirber is doing a bunch of work on this for his thesis, so we can finally lay this to rest with benchmarks and (lots of) details. soon! |
Hi @ctb @luizirber, since it has been a few months, I was wondering if you guys have any more details/documentation for interpreting abundances as mentioned above. I am struggling a bit myself, picking up these differences. For instance,
Thanks! |
hi @cesarcardonau et al., let me update you all a bit on some documentation and testing updates! first, our classifying signatures documentation has been improved. In particular the abundance weighting section is better, and the gather algorithm discussion at the bottom of the page should be informative. Second, we've updated lca summarize to respect abundance info and elaborated our documentation. I don't think either of these answers the questions above fully, but they might provide context. On to the meat of the question -- |
There's a lot of questions in this issue, so let me try to answer some of them first and then iterate as people ask for clarification! The gather-abund tests added in #886 are key. See this diff, but I interpret the important bits below: sourmash gather and abundancestl;dr Below is a discussion of a synthetic set of test cases using three randomly generated (fake) genomes of the same size, with two even read data set abundances of 2x each, and a third read data set with 20x. Data set prepFirst, we make some synthetic data sets:
then we make signature s10-s11 with r1 and r3, i.e. 1:1 abundance, and make signature s10x10-s11 with r2 and r3, i.e. 10:1 abundance. test_gather_abund_1_1 in test_sourmash.pyWhen we project r1+r3, 1:1 abundance, onto s10, s11, and s12 genomes with gather:
we get:
test_gather_abund_10_1When we project r2+r3, 10:1 abundance, onto s10, s11, and s12 genomes with gather:
we get:
The cause of the poor approximation for genome s10 is unclear; it could be due to low coverage, or small genome size, or something else. More investigation needed. |
I think the above should answer @mytluo first question. answering @mytluo second question,
Yes, that file is the list of hashes that were not classified / assigned to a genome. The percentages and so on referred to elsewhere in this issue should take that all into account, however, so you don't need to explicitly account for them yourself! |
more @mytluo answers -
Yeah, in partial response to this issue, I put in some tests to make sure I understood my own code :). Per this line, |
the answer to @Alok-Shekhar question,
is in #1022 and in the latest docs for lca summarize |
ok I think I'm done for now, thanks for all the questions and please ask more! |
This is very helpful, thanks so much @ctb |
thank you for sticking with & please do ask for clarification :). we have an
increasing number of good synthetic and real data sets that we can use to
explore and demonstrate this stuff.
|
@ctb This is very helpful in understanding |
my naive thinking is that |
closing for now - moved final unanswered question to #1079 for visibility. |
Updates the requirements on [pytest-cov](https://github.com/pytest-dev/pytest-cov) to permit the latest version. <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/pytest-dev/pytest-cov/blob/master/CHANGELOG.rst">pytest-cov's changelog</a>.</em></p> <blockquote> <h2>4.0.0 (2022-09-28)</h2> <p><strong>Note that this release drops support for multiprocessing.</strong></p> <ul> <li> <p><code>--cov-fail-under</code> no longer causes <code>pytest --collect-only</code> to fail Contributed by Zac Hatfield-Dodds in <code>[#511](pytest-dev/pytest-cov#511) <https://github.com/pytest-dev/pytest-cov/pull/511></code>_.</p> </li> <li> <p>Dropped support for multiprocessing (mostly because <code>issue 82408 <https://github.com/python/cpython/issues/82408></code>_). This feature was mostly working but very broken in certain scenarios and made the test suite very flaky and slow.</p> <p>There is builtin multiprocessing support in coverage and you can migrate to that. All you need is this in your <code>.coveragerc</code>::</p> <p>[run] concurrency = multiprocessing parallel = true sigterm = true</p> </li> <li> <p>Fixed deprecation in <code>setup.py</code> by trying to import setuptools before distutils. Contributed by Ben Greiner in <code>[#545](pytest-dev/pytest-cov#545) <https://github.com/pytest-dev/pytest-cov/pull/545></code>_.</p> </li> <li> <p>Removed undesirable new lines that were displayed while reporting was disabled. Contributed by Delgan in <code>[#540](pytest-dev/pytest-cov#540) <https://github.com/pytest-dev/pytest-cov/pull/540></code>_.</p> </li> <li> <p>Documentation fixes. Contributed by Andre Brisco in <code>[#543](pytest-dev/pytest-cov#543) <https://github.com/pytest-dev/pytest-cov/pull/543></code>_ and Colin O'Dell in <code>[#525](pytest-dev/pytest-cov#525) <https://github.com/pytest-dev/pytest-cov/pull/525></code>_.</p> </li> <li> <p>Added support for LCOV output format via <code>--cov-report=lcov</code>. Only works with coverage 6.3+. Contributed by Christian Fetzer in <code>[#536](pytest-dev/pytest-cov#536) <https://github.com/pytest-dev/pytest-cov/issues/536></code>_.</p> </li> <li> <p>Modernized pytest hook implementation. Contributed by Bruno Oliveira in <code>[#549](pytest-dev/pytest-cov#549) <https://github.com/pytest-dev/pytest-cov/pull/549></code>_ and Ronny Pfannschmidt in <code>[#550](pytest-dev/pytest-cov#550) <https://github.com/pytest-dev/pytest-cov/pull/550></code>_.</p> </li> </ul> <h2>3.0.0 (2021-10-04)</h2> <p><strong>Note that this release drops support for Python 2.7 and Python 3.5.</strong></p> <ul> <li>Added support for Python 3.10 and updated various test dependencies. Contributed by Hugo van Kemenade in <code>[#500](pytest-dev/pytest-cov#500) <https://github.com/pytest-dev/pytest-cov/pull/500></code>_.</li> <li>Switched from Travis CI to GitHub Actions. Contributed by Hugo van Kemenade in <code>[#494](pytest-dev/pytest-cov#494) <https://github.com/pytest-dev/pytest-cov/pull/494></code>_ and <code>[#495](pytest-dev/pytest-cov#495) <https://github.com/pytest-dev/pytest-cov/pull/495></code>_.</li> <li>Add a <code>--cov-reset</code> CLI option. Contributed by Danilo Šegan in <code>[#459](pytest-dev/pytest-cov#459) <https://github.com/pytest-dev/pytest-cov/pull/459></code>_.</li> <li>Improved validation of <code>--cov-fail-under</code> CLI option. Contributed by ... Ronny Pfannschmidt's desire for skark in <code>[#480](pytest-dev/pytest-cov#480) <https://github.com/pytest-dev/pytest-cov/pull/480></code>_.</li> <li>Dropped Python 2.7 support.</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/28db055bebbf3ee016a2144c8b69dd7b80b48cc5"><code>28db055</code></a> Bump version: 3.0.0 → 4.0.0</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/57e9354a86f658556fe6f15f07625c4b9a9ddf53"><code>57e9354</code></a> Really update the changelog.</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/56b810b91c9ae15d1462633c6a8a1b522ebf8e65"><code>56b810b</code></a> Update chagelog.</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/f7fced579e36b72b57e14768026467e4c4511a40"><code>f7fced5</code></a> Add support for LCOV output</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/1211d3134bb74abb7b00c3c2209091aaab440417"><code>1211d31</code></a> Fix flake8 error</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/b077753f5d9d200815fe500d0ef23e306784e65b"><code>b077753</code></a> Use modern approach to specify hook options</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/00713b3fec90cb8c98a9e4bfb3212e574c08e67b"><code>00713b3</code></a> removed incorrect docs on <code>data_file</code>.</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/b3dda36fddd3ca75689bb3645cd320aa8392aaf3"><code>b3dda36</code></a> Improve workflow with a collecting status check. (<a href="https://github-redirect.dependabot.com/pytest-dev/pytest-cov/issues/548">#548</a>)</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/218419f665229d61356f1eea3ddc8e18aa21f87c"><code>218419f</code></a> Prevent undesirable new lines to be displayed when report is disabled</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/60b73ec673c60942a3cf052ee8a1fdc442840558"><code>60b73ec</code></a> migrate build command from distutils to setuptools</li> <li>Additional commits viewable in <a href="https://github.com/pytest-dev/pytest-cov/compare/v2.12.0...v4.0.0">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Hi @ctb,
I ran
--track-abundance
withsourmash gather
, and wasn't sure how to interpret the values. Should I be looking ataverage_abund
?On another note, for
--output-unassigned
, is it correct to interpret that file as the list of hashes that were unclassified for the "mins" array? If I took the length of that array and added it to the sum ofaverage_abund
(or the correct column), could I calculate the % abundance from that sum? Apologies if this is completely off the mark!The text was updated successfully, but these errors were encountered: