You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The command umi_tools extract results in read names being suffixed with the pattern _[cell barcode]_[UMI]. See the docs here for an example.
However, umi_tools count_tab expects read names suffixed with the pattern _[UMI]_[cell barcode]. See the docs here.
As a result, pipelines naively expecting to use the output of umi_tools extract for umi_tools count_tab (after e.g. a cut | sort manipulation) will have incorrect output.
This does not seem to be simply a documentation error. On this line, umi_tools count_tab counts the barcodes using sam_methods.get_gene_count_tab(), which by default uses the sam_methods.get_cell_umi_read_string() function, returning the tuple(read_id.split(sep)[-1].encode('utf-8'), read_id.split(sep)[-2].encode('utf-8')). For the output read names from extract, this corresponds to (UMI, cell barcode). But then this output is unpacked here as cell, umi = bc_getter(read_id). So the cell barcode and UMI are swapped around.
Apologies if I've missed a step, and this behaviour is intended. I thought I should point it out to save others some trouble in future.
The text was updated successfully, but these errors were encountered:
Thanks for this, it does indeed seem that you are correct. @TomSmithCGAT - any thoughts? Did we swtich the order at somepoint and forget to propogate through to count_tab?
The command
umi_tools extract
results in read names being suffixed with the pattern_[cell barcode]_[UMI]
. See the docs here for an example.However,
umi_tools count_tab
expects read names suffixed with the pattern_[UMI]_[cell barcode]
. See the docs here.As a result, pipelines naively expecting to use the output of
umi_tools extract
forumi_tools count_tab
(after e.g. acut | sort
manipulation) will have incorrect output.This does not seem to be simply a documentation error. On this line,
umi_tools count_tab
counts the barcodes usingsam_methods.get_gene_count_tab()
, which by default uses thesam_methods.get_cell_umi_read_string()
function, returning the tuple(read_id.split(sep)[-1].encode('utf-8'), read_id.split(sep)[-2].encode('utf-8'))
. For the output read names fromextract
, this corresponds to(UMI, cell barcode)
. But then this output is unpacked here ascell, umi = bc_getter(read_id)
. So the cell barcode and UMI are swapped around.Apologies if I've missed a step, and this behaviour is intended. I thought I should point it out to save others some trouble in future.
The text was updated successfully, but these errors were encountered: