Skip to content

gxiaolab/scAllele

Repository files navigation

scAllele


Github

About

scAllele is a versatile tool to detect and analyze nucleotide variants in scRNA-seq (Article).
scAllele makes use local reassembly via de-Bruijn graph to identify sequence differences and infers nucleotide variants at the read level.
The read level variant-call allows for the analysis of the role variants in the context of splicing. Using mutual information, scAllele identifies allelic linkage between nucleotide variants and splicing isoforms.

Table of contents


Outline

alt text
a. Illustration of the main algorithm of scAllele for variant calling. The reads and the reference genomic sequence overlapping a read cluster (RC) are decomposed into k-mers and are reasembled into a de Bruijn graph. b. Variants (green box in a) identified from the graph are then scored using a generalized linear model (GLM). The GLM was trained with different features (green box) to assign a confidence score to the variants. c. To identify allele-specific splicing (i.e., variant linkage), scAllele performs a mutual information calculation between nucleotide variants (SNVs, microindels) and intronic parts (where the ‘alleles’ are the different overlapping introns), to calculate allelic linkage of splicing isoforms.

Installation

scAllele is available through PyPi. To download simply type:

pip install scAllele

The download was tested with PyPi version >= 20.0.1.

Alternatively, you can clone this GitHub repository:

git clone git@github.com:gxiaolab/scAllele.git 
cd scAllele
python -m build
pip install .

Or

git clone https://github.com/gxiaolab/scAllele.git
cd scAllele
python -m build
pip install .

You can also download the Singularity container:

singularity pull library://giovas/collection/s5
# run
singularity exec s5_latest.sif scAllele

If succesful, the program is ready to use. The installation incorporates console script entrypoints to directly call scAllele:

scAllele

Usage: 
	scAllele -b <file.bam> -g <genome.fa> -o <output prefix>
A variant caller and variant analysis tool for scRNA-seq data.
Options:
  -h, --help            show this help message and exit
  -b INPUT_BAM, --input-bam=INPUT_BAM
                        [Required] Input bam file, (or comma-seprated list of
                        bam files) sorted and indexed
  -o OUTPUT_PREFIX, --output-vcf=OUTPUT_PREFIX
                        [Required] Prefix of the output files
  -g GENOME, --genome-file=GENOME
                        [Required] Reference genome file (fasta format)
  Filtering Options:
    --AB=MINRATIOVAR    Minimum allelic ratio for the variant allele. Default:
                        0.01
    --AC=MINCOUNTVAR    Minimum read depth supporting the variant allele.
                        Default: 2
    --DP=MINCOVERAGE    Minimum read depth at the position of the variant.
                        Default: 5
    --min-base_position=MINREADPOS
                        Minimum mean distance for the variant from the read
                        ends. Default = 7
    --min-base_quality=MINBASEQUAL
                        Minimum mean base quality score to consider SNPs.
                        Default = 20
  Run Mode Options:
    --run_mode=RUN_MODE
                        Select <Variant_Caller> <Full> or <Training> mode.
                        Default: Full
    --glm_clf_name=GLM_CLASSIFIER_NAME
                        Prefix of the GLM pickle objects with the GLM models
  Linkage Options:
    --link_min_count=LINK_MIN_COUNT
                        Minimum number of common reads for linkage analysis.
                        Default = 10
    --link_min_mi=LINK_MIN_MI
                        Minimum mutual information for linkage analysis.
                        Default = 0.52
  Advanced Options:
    -c SEARCH_REGION, --region=SEARCH_REGION
                        Limit search to this region (chrom:start-end or
                        chrom). Default: All
    -n NODES, --nodes=NODES
                        Number of threads for parallel execution. Default = 64
    --min-map_quality=MINMAPQ
                        Minimum mapping quality to keep a read. Default = 40
    --max_soft_clip=MAXSOFTCLIP
                        Maximum length of soft-clipping allow at each end of
                        the read. Default = 5
    --kmer-size=KMER    k-mer size for the de-Bruijn graph assembly. Default:
                        15
    --strandedness=STRANDEDNESS
                        Select from ['fr-firststrand', 'fr-secondstrand', 'fr-
                        unstrand']. Default: 'fr-unstrand'
    --maxSeqErrorRate=MAXSEQERRORRATE
                        Maximum estimate of sequencing error rate. Default:
                        0.01
    --Ploidy=PLOIDY     Maximum ploidy to be considered. Default: 2.

Usage

Basic usage

The minimum requirements to run scAllele are:

  1. A bam file (sorted and indexed) samtools sort file.bam file.sorted ; samtools index file.sorted.bam
  2. A reference genome fasta file (indexed) samtools faidx genome.fa
  3. A prefix for the output files.
scAllele -b file.sorted.bam -g genome.fa -o path/to/output_prefix

Using the provided test data:

scAllele 
    -b testdata/gm12878.chr21.bam 
    -g testdata/hg38.chr21.fa 
    -o path/to/output_prefix 

Preprocessing

scAllele only requires a bam file and a reference genome fasta file, however, in order to get optimal results it is recommended to pre-process the data:

alt text
Recommended pipeline

Stranded data

If your scRNA-seq is strand-specific, then you can specify the strandedness of your data (default: non-strand specific).
Strand-specific data helps resolve ambiguous alignments on overlapping genes. It also helps detect more accurate ASAS events.
Most strand-specific libraries in RNA-Seq are fr-firststrand (second read pair is sense to the RNA). You can specify this in your command:

scAllele 
    -b testdata/gm12878.chr21.bam 
    -g testdata/hg38.chr21.fa 
    -o path/to/output_prefix 
    --strandedness='fr-firststrand'

Alternatively, you can use the option --strandedness=fr-secondstrand if the first read pair is sense to the RNA.

Local variant call

By default, scAllele searches for variants in all the regions of the transcriptome covered by reads. If you wish to search for variants in a custom genomic interval, you can do so with the -c option.

## Only search chromosome 21
scAllele 
    -b testdata/gm12878.chr21.bam 
    -g testdata/hg38.chr21.fa 
    -o path/to/output_prefix 
    -c chr21
## Only search within these coordinates
scAllele 
    -b testdata/gm12878.chr21.bam 
    -g testdata/hg38.chr21.fa 
    -o path/to/output_prefix 
    -c chr21:34582111-34628004

scAllele will search for read clusters within these regions only. Bare in mind that it's possible to find no read clusters in the spcified region, and that, if a specified region does not contain the entirety of a gene, it may miss some ASAS events.

Filtering variants

Although it is recommended to filter variants downstream of your analysis (via bcftools or others), it's possible to filter variants from the start. If you wish, for example, to only report variants with 3 reads supporting the alternative allele (AC) and 5 reads overall, then you can run the following command:

scAllele 
    -b testdata/gm12878.chr21.bam 
    -g testdata/hg38.chr21.fa 
    -o path/to/output_prefix 
    --AC=3 
    --DP=5

The default is AC=2 and DP=2.

Training a new classifier

scAllele offers the option to retrain the variant classifier. Sequencing data from different platforms or resulting from different library preparation protocols may have different error profiles. If you wish to retrain scAllele's classifier run it in training mode:

scAllele 
    -b testdata/gm12878.chr21.bam 
    -g testdata/hg38.chr21.fa 
    -o path/to/new_clf 
    --run_mode='Training' 

This will return a feature file (.feature_matrix.tab) containing the variant calls and all the features used for the training of the classifier.
Then, run scAllele's training function. The supervised classifier will require a set of ground-truth variants to fit the model.

scAllele_train 
    -i path/to/new_clf.feature_matrix.tab 
    -v truth.vcf 
    -g testdata/hg38.chr21.fa

This will return 3 pickle objects:

  • path/to/new_clf.feature_matrix.tab.DELETION.glm.pickle
  • path/to/new_clf.feature_matrix.tab.INSERTION.glm.pickle
  • path/to/new_clf.feature_matrix.tab.SNP.glm.pickle

Finally, to use these new classifiers to call variants run:

scAllele 
    -b testdata/gm12878.chr21.bam 
    -g testdata/hg38.chr21.fa 
    -o new_path/to/output_prefix 
    --glm_clf_name path/to/new_clf.feature_matrix.tab  

Output

scAllele generates 4 files as output:

  • path/to/output_prefix.vcf
  • path/to/output_prefix.mi_summary.tab
  • path/to/output_prefix.read_cluster_info.tab
  • path/to/output_prefix.intronic_parts.bed

The first file (.vcf) is a standard vcf file reporting all the nucleotide variants found. The description of the tags and values are specified in the header.

The second file (.mi_summary.tab) reports all the linkage events between variants or between variant and intronic part (ASAS). The mutual information and number of common reads between pairs of variants are reported. NOTE: All the testable linkages are presented. It is recommended to filter linkage events based on the mutual information and number of common reads as explained in the main publication.

The third file (.read_cluster_info.tab) reports all the read clusters identified in the file.

The fourth file (.intronic_parts.bed) reports the intronic parts identified together with the introns that form them.

Debug

If the installed version of scAllele is not the latest one, please try:

pip install scAllele==0.0.9.3

If one or more dependencies fail to install, make sure you have the latest version of pip:

pip install --upgrade pip

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published