Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for calculating niches #831

Open
wants to merge 87 commits into
base: main
Choose a base branch
from
Open

Add function for calculating niches #831

wants to merge 87 commits into from

Conversation

LLehner
Copy link
Member

@LLehner LLehner commented May 27, 2024

Description

Adds a function that calculates niches using different strategies. The initial function calculates niches based on neighborhood profiles similar to here.

This PR will get updated with methods discussed in #789.

@LLehner LLehner marked this pull request as draft May 27, 2024 16:24
@LLehner LLehner requested a review from timtreis May 27, 2024 16:43
@timtreis timtreis added the squidpy2.0 Everything releated to a Squidpy 2.0 release label May 29, 2024
@LLehner LLehner marked this pull request as ready for review October 10, 2024 12:44
@LLehner LLehner requested a review from giovp October 10, 2024 12:44
@LLehner
Copy link
Member Author

LLehner commented Oct 10, 2024

@giovp @timtreis PR is ready for review!

Some additional questions that came up, which you could have a look at:

  1. How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.
  2. I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?
  3. If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).
  4. Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

@timtreis
Copy link
Member

Hey @LLehner

How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.

You mean a scenario where the user is just storing multiple slides in the same sdata object, like the point8, point16, ´point24` dataset we have? These should be fully independent since they have no biological connection.

I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?

Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.

If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).

Could it be that the OS has some of the compiled bytecode or data in cache? 🤔

Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

What runtime are we speaking here about? I tend to be a fan of some intermediate output (if one can also shut it up, f.e. with some verbosity level)

Copy link
Member

@giovp giovp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks great @LLehner , I think it would benefit from a refactoring to take out the single implementation, validate arguments, and then run the functions. We discussed about potentially even doing separate modules, e.g.

squidpy.gr.niches.utag
...

but I'm not sure it is necessary, what do you and @timtreis think?
Keeping it like this would be also fine, but then the arguments should be handled a bit better (e.g. check whether the passed arguments are fit for the single implementation/flavour required, and otherwise fail right away). Also how each argument maps to each implementation could be documented in the docs.

Also some work to be done in the docs more generally.

src/squidpy/gr/_niche.py Show resolved Hide resolved
src/squidpy/gr/_niche.py Outdated Show resolved Hide resolved
src/squidpy/gr/_niche.py Show resolved Hide resolved
Restrict niche calculation to a subset of the data.
table_key
Key in `spatialdata.tables` to specify an 'anndata' table. Only necessary if 'sdata' is passed.
mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the use case for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to make use of all existing connectivities but still want to ignore some cells during niche assignment. It's based on the 'mask' parameter here: https://monkeybread.readthedocs.io/en/latest/generated/monkeybread.calc.cellular_niches.html#monkeybread.calc.cellular_niches. The difference to subsetting is: if you subset you also cut the connectivities. With a mask you keep the connectivities but avoid labeling the masked cells.

src/squidpy/gr/_niche.py Outdated Show resolved Hide resolved
src/squidpy/gr/_niche.py Outdated Show resolved Hide resolved
src/squidpy/gr/_niche.py Outdated Show resolved Hide resolved
src/squidpy/gr/_niche.py Outdated Show resolved Hide resolved
src/squidpy/gr/_niche.py Outdated Show resolved Hide resolved
src/squidpy/gr/_niche.py Outdated Show resolved Hide resolved
@giovp
Copy link
Member

giovp commented Oct 28, 2024

also on this question

How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.

yes, and yes. If neighbor calculation is run with multiple slides, then a block diagonal matrix of the spatial neighbor is returned, in fact treating the slides to be independent.

I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?

Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.

yes agree! You can take a look at the Parallel module we have now and just reuse it

Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

absolutely please log everything you see fit!

@LLehner LLehner requested a review from giovp December 9, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
graph 🕸️ squidpy2.0 Everything releated to a Squidpy 2.0 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants