-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idea: __dataframe__
interchange protocol for anndata
#1111
Labels
Comments
Very rough proof of concept: import pandas as pd
from pandas.core.interchange.column import PandasColumn
from pandas.core.interchange.dataframe import PandasDataFrameXchg
import anndata as ad
import scanpy as sc
class ObsDF(pd.core.interchange.dataframe_protocol.DataFrame):
def __init__(self, adata: ad.AnnData, layer: str | None = None, allow_copy: bool = True):
self.adata = adata
self.layer = layer
self.allow_copy = allow_copy
def __dataframe__(self, nan_as_null: bool = False, allow_copy: bool = True):
return ObsDF(self.adata, self.layer, allow_copy=allow_copy)
@property
def metadata(self) -> dict[str, pd.Index]:
# `index` isn't a regular column, and the protocol doesn't support row
# labels - so we export it as Pandas-specific metadata here.
return {"pandas.index": self.adata.obs_names}
def get_chunks(self, n_chunks=None):
if n_chunks and n_chunks > 1:
size = len(self._df)
step = size // n_chunks
if size % n_chunks != 0:
step += 1
for start in range(0, step * n_chunks, step):
yield ObsDf(
self.adata[start : start + step, :],
layer=self.layer,
allow_copy=self.allow_copy,
)
else:
yield self
def get_columns(self):
raise NotImplementedError()
def column_names(self):
return list(adata.obs.columns) + list(adata.var_names)
def num_chunks(self):
return 1
def get_column_by_name(self, name: str):
return PandasColumn(pd.Series(self.adata.obs_vector(name, layer=self.layer), index=self.adata.obs_names))
def get_column(self, i: int):
return self.get_column_by_name(self.column_names()[i])
def num_columns(self) -> int:
return len(self.column_names())
def num_rows(self) -> int:
return self.adata.n_obs
def select_columns_by_name(self, names: list[str]):
return PandasDataFrameXchg(sc.get.obs_df(self.adata, names, layer=self.layer))
def select_columns(self, indices):
all_names = self.column_names()
return self.select_columns_by_name([all_names[i] for i in indices]) Looks like altair/ data fusion currently don't support the protocol well enough for us to be able to use them. |
Sadly, looks like the same for seaborn. Just uses the interchange to convert whatever type you pass to a pandas dataframe. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Please describe your wishes and possible alternatives to achieve the desired result.
https://data-apis.org/dataframe-protocol/latest/index.html
It could be nice if AnnData supported the
__dataframe__
interchange protocol, especially when used by libraries which will use theselect_columns_by_name
,get_column_by_name
interfaces.Use-case: plotting
The biggest use case is plotting. Both seaborn (mwaskom/seaborn#3369) and altair (vega/altair#2888) now support inputs in the dataframe protocol.
In
scanpy
we typically use thesc.get.obs_df
method to create a dataframe for plotting. A major painpoint for this in analysis code is that the user has to provide the keys they want to plot multiple times, once for creating the dataframe, and again to the plotting interface. Instead of having to do:It could eventually be:
This should also work for plots of gene expression values, especially if the underlying plotting library selects columns through the dataframe interface and the matrix was stored as CSC or dense.
This could even be a nice interface to on-disk data, especially when
X
/layers
is stored inCSC
.Some more detail
.obs.columns
,var_names
, keys likeobsm/pca/0
.var_names
layer
is being accessedImplementation
I think it would make sense for this to start out as POC outside of the main implementation. It may require
pyarrow
as a dependency to work. In theorypyarrow
be a dependency ofpandas
v3 early next year, so may not be an issue.cc: @ilan-gold
The text was updated successfully, but these errors were encountered: