Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to calculate embeddings for variable by distance aggregation #807

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

LLehner
Copy link
Member

@LLehner LLehner commented Mar 4, 2024

Description

Adds a method in tools to calculate embeddings of variables by their counts aggregated by distance.

Example usage

import squidpy as sq

load example data set
adata = sq.datasets.seqfish()

Calculate distances of each observation to a specified anchor point (e.g. cell type or tissue location). Here we use cell type "Endothelium" in the annotation column "celltype_mapped_refined":
sq.tl.var_by_distance(adata, groups="Endothelium", cluster_key="celltype_mapped_refined")

The resulting distances are stored in adata.obsm["design_matrix"]. Now we can calculate the embeddings, which are returned as a new anndata object:
adata_new = sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix")

Note that by default the bin of distance 0, meaning the counts that belong to the anchor point, are excluded. This can be changed by setting include_anchor=True in sq.tl.var_embeddings().

adata_new.X contains the aggregated var x distance_bin count matrix.
adata_new.obs contains the variables as a categorical matrix, which is required to highlight them in plots.

TODO

  • Add a plotting function so this doesn't need to be done manually.
  • Allow flexible embedding calculations

@LLehner LLehner requested a review from timtreis March 4, 2024 22:56
@codecov-commenter
Copy link

codecov-commenter commented Mar 4, 2024

Codecov Report

Attention: Patch coverage is 33.33333% with 24 lines in your changes are missing coverage. Please review.

Project coverage is 69.75%. Comparing base (df8e042) to head (8ee07ba).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #807      +/-   ##
==========================================
- Coverage   69.99%   69.75%   -0.24%     
==========================================
  Files          39       40       +1     
  Lines        5525     5561      +36     
  Branches     1029     1037       +8     
==========================================
+ Hits         3867     3879      +12     
- Misses       1363     1387      +24     
  Partials      295      295              
Files Coverage Δ
src/squidpy/tl/_var_embeddings.py 33.33% <33.33%> (ø)

@giovp
Copy link
Member

giovp commented Apr 22, 2024

hi @LLehner , thank you for this, would you mind elaborating a bit when this would be used? also, what if the embedding are pre-calculated, or the user would like to use something other than the UMAP, should that be an option? finally, I think a test would be required before we get this in, thanks!

@timtreis
Copy link
Member

Hey @giovp, this feature was coming out of a discussion with @maiiashulman. We ran into a situation in which the "literature-curated" signature for hypoxia was either 20 or 4000 genes, the latter obviously being useless. So we wondered which other genes maybe show the same spatially variable pattern as a function of distance to a certain cell-type (e.g. epithelial). This is essentially a graphical method to see if a given set of genes (f.e. the 20 gene signature) even varies in a similar pattern.

But I agree with your points; if we see that it's actually doing something useful, we should make it a bit more flexible.

@LLehner LLehner marked this pull request as draft April 22, 2024 22:05
@timtreis timtreis marked this pull request as ready for review July 9, 2024 21:02
@timtreis timtreis added squidpy2.0 Everything releated to a Squidpy 2.0 release feature PR introduces a new feature labels Jul 9, 2024
@LLehner LLehner marked this pull request as draft August 8, 2024 10:22
@LLehner
Copy link
Member Author

LLehner commented Aug 8, 2024

@timtreis this function now returns an anndata object, which is i think simplifies further processing, compared to storing the new count matrix somewhere in .varm or .uns. Because if we want to make us of already implemented dimreduction and clustering methods from scanpy, then the count matrix needs to be in .X and for visualization we need the variable names stored as categories in .obs. Doing all of this in the same anndata will just make things cluttered.

Additionally the question is whether a spatialdata object should be required as input instead of an anndataone, because then a new table could be added directly instead of having multiple disconnected tables.

The function call would change from:
adata_new = sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix")
to
sq.tl.var_embeddings(sdata, group="Endothelium", design_matrix_key="design_matrix")

@LLehner LLehner marked this pull request as ready for review October 10, 2024 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature PR introduces a new feature squidpy2.0 Everything releated to a Squidpy 2.0 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants