All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Hotfix linked to the issues like #125 showing that python3.8 and above crash on modified loop.
- Fixed bug that checks filenames length instead of number of files.
- Support for multiple files as input. This allows you to not merge different lanes before running CITE-seq-Count. Thanks to @arkal for the contribution on this! #79
- UMI correction for UMI clusters of 1 UMI is now skipped, increases performance.
- Cell barcode whitelist is now based on reads instead of UMIs. Hopefully fixing issues #35,#37,#48,#49,#69
- Added a low filter for UMI correction not trying to correct unique UMIs. Small improvement on performance.
- Corrected typo from
--no_umi_correctio
to--no_umi_correction
. Thanks for the catch @ChristophH. #72
- Cell barcode correction based on whitelist instead of top cells. This will trigger automatically when a whitelist is provided.
- UMI correction for tags with more than than 20'000 umis is too slow right now.
Cells containing a TAG with more than 20'000 umis will be discarded. This means that more than 20'000
antibody tags with different umis have been captured which is odd. Such high numbers could mean
that cells have been aggregating and can't be used for downstream analysis.
The number of cells are reported under
Bad cells
and the dense matrix of those cells is provided inuncorrected_cells
. Fixing issues #32 - The option to not correct for umis with
--no_umi_correction
. - Now prints how many reads will be processed.
- Checks that tags are indeed made of ATGC bases.
- Added a check for cell barcode correction and collapsing threshold for given whitelist.
- New option to allow for a sliding window when aligning TAGS.
--sliding-window
Use when you have a variable sequence before your tags.
- Sparse matrix is now based on int32 instead of int16. Fixing issue #40
- Cell barcode correction without a whitelist is not outputing any error anymore.
This is due to the fact that it uses a hard threshold based on the
-cells
option if it kind find one. Fixing issues #29, #36 - Upgraded umi_tools to
1.0.0
- fixed the error linked to umi_tools version update
getCellWhitelist
#45.
- Enabled parallelization using multiprocessing. You can choose how many
cores/threads you want to use with the
-T
--threads
option. Takes max cpu by default. It will run on only one core if you run 1'000'000 reads or less. - The output is now given in a gzipped mtx format. This is to ensure smooth usage for
new/larger datasets such as data from novaseq sequencers. For those who still want
to use the dense format, there is an option
--dense
which will add the dense output as a tsv format. The sparse outputs are compatible theRead10x
from the Seurat package. - You get both umi and read counts as output. This can help with overamplification estimation.
- UMI and cell barcodes are now corrected based on umi_tools.
- A small report is now produced as
report.yaml
giving some statistics about the run as well as which parameters have been used. - Unittesting has been added using pytest. This should help prevent further unforseen bugs.
- New option for trimming the start of read2 sequences.
--start-trim
--version
option now prints the version.
- CITE-seq-Count uses less memory and is faster.
- Changed how you select the unmapped tags. The option is now
-ut
instead of-uc
and gives you the top most unmapped tags. - Expected cells,
-cells
is now required and not mutually exclusive with whitelists. - Hamming distance option has been changed from
-hd
to--max-error
. Please refer to the documentation for more information.
- The default dense output matrix is gone and replaced by the sparse equivalent. See the documentation on how to read it.
- Regex is gone. Finding out what doesn't map is given by the unmapped file option.
- Stripping of numbers and dash (-) to the whitelist barcodes so the 10X
barcodes.tsv
file could be directly used. - When TAGs are too close based on the maximum Levenshtein distance allowed, print the offending pairs with its distance to identify them.
- When storing unmatched tags, added the possibility to specify a minimum counts cutoff to avoid printing every unmatched possibility (option -uc).
- Documentation to the functions.
Levenshtein.hamming
usage was replaced byLevenshtein.distance
in order to keep the sequences with INDELs matching against the TAGs.- The redundant classifications
no_match
andbad_struct
were merged into one (no_match
); now, sequences match or do not. - Decreased the running time by compiling the regex.
- File handles are released once done reading Read1/Read2 files.
- The different length TAGs are now being matched: first, by the regex pattern, which will always pick the longest match within the maximum distance allowed; second, by looping through the TAGs sorted by decreasing length, and choosing the first match within the maximum distance allowed.
- General code improvements and optimisations: releasing file handles sooner, compartimentalising code, etc.
- Removed
ambiguous
classification; it's not a possibility because it's being handled when checking the distance between TAGs. E.g. Lets assumemax_hamming_distance = 2
, then: if any two TAGs are 2 hamming distance away from each other, then the program aborts with a message; thus, we could never be finding two sequences being two hamming distance away from more than one TAG.
- Reduced the memory usage by 1/3 by removing the
UMI_reduce
set; because it was based on theunique_lines
set, where Read1 is already trimmed to Barcode+UMI length, it was providing no gain. (Issue #17). - Hamming distance was not being properly applied because the equal sign was
missing from the the conditional (
if min_value >= args.hamming_thresh
). - When applying the regular expression to filter patterns to keep, by using the
i
in the fuzzy logic it was only acceptinginsertions
as mismatches, not allowing proper mismatches or deletions. Thei
was changed tos
, which meanssubstitution
in general. - The
merge
technique for creating the common regex pattern for all the requested ABs/TAGs was replaced by a simplein list
technique; themerge
technique was faulty, matching sequences that are clearly different (bad_struct
), being later classified asno_match
instead. For example: Tag1 AAAAAA Tag2 TTTTTT Pattern [AT][AT][AT][AT][AT][AT] Read2 ATATAT... In this scenario, Read2 is being matched and classified asno_match
instead ofbad_struct
.
- Printing version now from the help
- Added a possibility to store unmatched tags in a file using the option
-u
-l
--legacy
option now available. This option will then end of the regex to find the TAG barcode in R2. Use-l
if you have TAG barcodes ending with [TGC] + polyA tails.- New warning that checks R1 length in the fastq file and the cell and umi barcodes given as input.
- TAG barcode sequence are now added to the name of the tag in the rows of the results. This will help reproducibility since the barcode sequence will already be in the count results.
- pandas dependcy
- Cite-seq-Count is now a python package!
- Compatibility with tags of different lengths
- Possibility to use a whitelist of barcodes to extract
- Uses now fuzzy regex for structure detection
- Regex is now optional.
- You can now use only one tag, the script won't crash
- Processing of CITE-seq data with antibody tags and/or HTO