-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_10x_h5 error in scanpy 1.9.1 #128
Comments
François! Hope all is well! So, I think I know what this is... any chance the input data was in CellRanger v2 format? CellBender outputs a file in "v2 format" if the input was v2 format. I guess I wanted to keep output like input, although this may have been a bad choice in hindsight, since v3 format from CellRanger makes a lot more sense in terms of its structure, and is better-maintained by writers of other tools. If that's what's going on, it is a scanpy bug, which I am attempting to fix here |
If you are seeing that with a CellRanger v3 formatted output file from CellBender, let me know! |
I am getting same error. |
I'm getting this error with CellRanger v3 formatted output scanpy v1.9.1
|
I think it's scalars and strings in h5 fields, since scanpy uses h5py to read h5 files using I guess I could make sure everything that gets written is wrapped up into a list... sort of strange to have to do that, but it seems faster than hoping For now, for existing files, here is a workaround:
ptrepack CELLBENDER_OUTPUT.h5:/matrix MODIFIED_CELLBENDER_OUTPUT.h5:/matrix This takes just the "matrix" part of the cellbender output file and puts it in its own new h5 file, without any of the cellbender metadata. That new file should be readable directly by scanpy. |
Another option is to use this function adata = anndata_from_h5(file='CELLBENDER_OUTPUT.h5') This loads an AnnData object that can be used in scanpy as normal. It's just a drop-in replacement for scanpy's data loader. Function below: import tables
import numpy as np
import scipy.sparse as sp
import anndata
from typing import Dict, Optional
def anndata_from_h5(file: str,
analyzed_barcodes_only: bool = True) -> 'anndata.AnnData':
"""Load an output h5 file into an AnnData object for downstream work.
Args:
file: The h5 file
analyzed_barcodes_only: False to load all barcodes, so that the size of
the AnnData object will match the size of the input raw count matrix.
True to load a limited set of barcodes: only those analyzed by the
algorithm. This allows relevant latent variables to be loaded
properly into adata.obs and adata.obsm, rather than adata.uns.
Returns:
adata: The anndata object, populated with inferred latent variables
and metadata.
"""
d = dict_from_h5(file)
X = sp.csc_matrix((d.pop('data'), d.pop('indices'), d.pop('indptr')),
shape=d.pop('shape')).transpose().tocsr()
# check and see if we have barcode index annotations, and if the file is filtered
barcode_key = [k for k in d.keys() if (('barcode' in k) and ('ind' in k))]
if len(barcode_key) > 0:
max_barcode_ind = d[barcode_key[0]].max()
filtered_file = (max_barcode_ind >= X.shape[0])
else:
filtered_file = True
if analyzed_barcodes_only:
if filtered_file:
# filtered file being read, so we don't need to subset
print('Assuming we are loading a "filtered" file that contains only cells.')
pass
elif 'barcode_indices_for_latents' in d.keys():
X = X[d['barcode_indices_for_latents'], :]
d['barcodes'] = d['barcodes'][d['barcode_indices_for_latents']]
elif 'barcodes_analyzed_inds' in d.keys():
X = X[d['barcodes_analyzed_inds'], :]
d['barcodes'] = d['barcodes'][d['barcodes_analyzed_inds']]
else:
print('Warning: analyzed_barcodes_only=True, but the key '
'"barcodes_analyzed_inds" or "barcode_indices_for_latents" '
'is missing from the h5 file. '
'Will output all barcodes, and proceed as if '
'analyzed_barcodes_only=False')
# Construct the anndata object.
adata = anndata.AnnData(X=X,
obs={'barcode': d.pop('barcodes').astype(str)},
var={'gene_name': (d.pop('gene_names') if 'gene_names' in d.keys()
else d.pop('name')).astype(str)},
dtype=X.dtype)
adata.obs.set_index('barcode', inplace=True)
adata.var.set_index('gene_name', inplace=True)
# For CellRanger v2 legacy format, "gene_ids" was called "genes"... rename this
if 'genes' in d.keys():
d['id'] = d.pop('genes')
# For purely aesthetic purposes, rename "id" to "gene_id"
if 'id' in d.keys():
d['gene_id'] = d.pop('id')
# If genomes are empty, try to guess them based on gene_id
if 'genome' in d.keys():
if np.array([s.decode() == '' for s in d['genome']]).all():
if '_' in d['gene_id'][0].decode():
print('Genome field blank, so attempting to guess genomes based on gene_id prefixes')
d['genome'] = np.array([s.decode().split('_')[0] for s in d['gene_id']], dtype=str)
# Add other information to the anndata object in the appropriate slot.
_fill_adata_slots_automatically(adata, d)
# Add a special additional field to .var if it exists.
if 'features_analyzed_inds' in adata.uns.keys():
adata.var['cellbender_analyzed'] = [True if (i in adata.uns['features_analyzed_inds'])
else False for i in range(adata.shape[1])]
if analyzed_barcodes_only:
for col in adata.obs.columns[adata.obs.columns.str.startswith('barcodes_analyzed')
| adata.obs.columns.str.startswith('barcode_indices')]:
try:
del adata.obs[col]
except Exception:
pass
else:
# Add a special additional field to .obs if all barcodes are included.
if 'barcodes_analyzed_inds' in adata.uns.keys():
adata.obs['cellbender_analyzed'] = [True if (i in adata.uns['barcodes_analyzed_inds'])
else False for i in range(adata.shape[0])]
return adata
def dict_from_h5(file: str) -> Dict[str, np.ndarray]:
"""Read in everything from an h5 file and put into a dictionary."""
d = {}
with tables.open_file(file) as f:
# read in everything
for array in f.walk_nodes("/", "Array"):
d[array.name] = array.read()
return d
def _fill_adata_slots_automatically(adata, d):
"""Add other information to the adata object in the appropriate slot."""
for key, value in d.items():
try:
if value is None:
continue
value = np.asarray(value)
if len(value.shape) == 0:
adata.uns[key] = value
elif value.shape[0] == adata.shape[0]:
if (len(value.shape) < 2) or (value.shape[1] < 2):
adata.obs[key] = value
else:
adata.obsm[key] = value
elif value.shape[0] == adata.shape[1]:
if value.dtype.name.startswith('bytes'):
adata.var[key] = value.astype(str)
else:
adata.var[key] = value
else:
adata.uns[key] = value
except Exception:
print('Unable to load data into AnnData: ', key, value, type(value)) |
FYI, I tried the |
@esrice Thanks for letting me know. That is confusing to me... there must be something going on that I'm not understanding. Is it possible for you to share that |
Sure, here is an h5 output by cellbender that is parsed fine by scanpy v8 but not v9. Github doesn't allow h5 uploads in comments so here's a download link: https://oc1.rnet.missouri.edu/index.php/s/fqOxxkjaxkcizUE/download |
Okay thank you @esrice , now I see. Yes there are still some extra values packed into the h5 file in the cellbender output for v0.2. I will soon update to v0.3, where the The best thing to do is to use the above data loader code for the time being: the In v0.3.0, the data loading functions will be distributed as part of cellbender, and also the format of the h5 output files will be tweaked to be compatible with the |
thanks for that anndata_from_h5() script ! |
After @ivirshup's pytables PR (#2064) we started having issues with loading h5 files with scalar datasets, such as those created by CellBender (broadinstitute/CellBender#128). It is currently not an issue for the 10X h5 files for now since they don't have any scalars, however it'd be good to just handle scalars as well as arrays 1- to fix the cellbender file loading problem 2- to fix potential problems we might end up having if 10X h5 format includes scalar datasets.
@sjfleming you might like scverse/scanpy#2344 :) |
Woohoo, thanks @gokceneraslan ! |
* Handle scalar datasets too After @ivirshup's pytables PR (#2064) we started having issues with loading h5 files with scalar datasets, such as those created by CellBender (broadinstitute/CellBender#128). It is currently not an issue for the 10X h5 files for now since they don't have any scalars, however it'd be good to just handle scalars as well as arrays 1- to fix the cellbender file loading problem 2- to fix potential problems we might end up having if 10X h5 format includes scalar datasets. * Add a scalar to the multiple_genomes.h5 test file * Fixes #2203
Thanks @gokceneraslan ! Is this change available in a public version of Scanpy? I'm on 1.9.1 and still seeing this error. Name: scanpy
|
Closed by #238 |
Hi, I'm not sure if this is related to other h5 issues (and may be a scanpy bug), but scanpy 1.9.1 throws
ValueError: Illegal slicing argument for scalar dataspace
when opening*.cellbender_output_filtered.h5
files (this works fine with scanpy 1.8.2).The text was updated successfully, but these errors were encountered: