AmpBinner: An anchor-assisted demultiplexing tool for Oxford Nanopore long-read amplicon sequencing data.
- AmpBinner uses the sequence upstream/downstream of the barcode (i.e. anchor) to help locate barcode position and eliminates random matching due to sequencing error.
- AmpBinner does not require a pretrained model for demultiplexing and supports officially provided barcodes as well as custom-designed barcodes.
- AmpBinner is able to demultiplex Oxford Nanopore sequencing data generated from 10X Genomics Chromium Single Cell 3ʹ Gene Expression Libraries.
- Requirements
- Installation
- Usage
- Quick start
- Demultiplexing regular amplicons
- Case 1: The barcode is next to the forward primer
- Case 2: The barcode is next to the reverse primer
- Case 3: The barcodes are on both ends. One sample have the same barcodes on both ends. Only one barcode is required to bin the reads.
- Case 4: The barcodes are on both ends. One sample may or may not have the same barcodes on both ends. Two barcodes are required to bin the reads.
- Demultiplexing 10X Genomics Chromium Single Cell 3ʹ Gene Expression Libraries
AmpBinner calls minimap2
to do sequence alignment. If you don't have minimap2
in your system, you can install it following the instructions here.
If you are using Linux, you can acquire precompiled binaries using the following commands:
wget https://github.com/lh3/minimap2/releases/download/v2.17/minimap2-2.17_x64-linux.tar.bz2
tar -jxvf minimap2-2.17_x64-linux.tar.bz2
./minimap2-2.17_x64-linux/minimap2
Next, you can clone the repository of AmpBinner using the following command.
git clone https://github.com/WGLab/AmpBinner.git
The scripts in the ./AmpBinner
can run directly without additional compilation or installation.
There are two script files in the AmpBinner
directory. ampBinner_10X.py
is used to demultiplex Oxford Nanopore sequencing data derived from 10X Genomics Chromium single cell libraries. There are usually several thousands of barcodes per sample. ampBinner.py
is for regular barcoding methods, including barcoding kits provided by Oxford Nanopore Technologies and custom-designed barcodes.
# The barcode is beside forward primer
path/to/AmpBinner/ampBinner.py --in_fq example_data.fastq.gz --amp_seq_fasta example_amplicon_seq.fasta --out_dir . --exp_name testing --num_threads 4 --fwd_barcode_fasta example_barcodes.fasta --minimap2 path/to/minimap2
# The barcode is beside reverse primer
path/to/AmpBinner/ampBinner.py --in_fq example_data.fastq.gz --amp_seq_fasta example_amplicon_seq.fasta --out_dir . --exp_name testing --num_threads 4 --rev_barcode_fasta example_barcodes.fasta --minimap2 path/to/minimap2
# The barcodes are on both ends. One sample have the same barcodes on both ends. Only one barcode is required to bin the reads.
path/to/AmpBinner/ampBinner.py --in_fq example_data.fastq.gz --amp_seq_fasta example_amplicon_seq.fasta --out_dir . --exp_name testing --num_threads 4 --fwd_barcode_fasta example_barcodes.fasta --rev_barcode_fasta example_barcodes.fasta --minimap2 path/to/minimap2
# The barcodes are on both ends. One sample may or may not have the same barcodes on both ends. Two barcodes are required to bin the reads.
path/to/AmpBinner/ampBinner.py --in_fq example_data.fastq.gz --amp_seq_fasta example_amplicon_seq.fasta --out_dir . --exp_name testing --num_threads 4 --fwd_barcode_fasta example_barcodes.fasta --rev_barcode_fasta example_barcodes.fasta --require_two_barcodes --minimap2 path/to/minimap2
# Input DNA is from a 10X Genomics single cell library
/home/fangl/AmpBinner/ampBinner_10X.py --in_fq example.fastq.gz --barcode_list barcodes.txt --barcode_upstream_seq AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT --out_prefix testing --num_threads 8
$ ./ampBinner.py --help
usage: ampBinner.py [-h] [--in_fq FILE] [--in_fq_list FILE] --amp_seq_fasta
FILE --out_dir PATH --exp_name STRING
[--fwd_barcode_fasta FILE] [--rev_barcode_fasta FILE]
[--require_two_barcodes] [--num_threads INT]
[--minimap2 FILE] [--version]
A barcode demultiplexer for Oxford Nanopore long-read sequencing data
optional arguments:
-h, --help show this help message and exit
--in_fq FILE input sequencing reads in one FASTQ(.gz) file
--in_fq_list FILE a list file specifying all input FASTQ(.gz) files, one
file per line
--amp_seq_fasta FILE reference amplicon sequence in FASTA format
--out_dir PATH output directory
--exp_name STRING experimental name, used as prefix of output files
--fwd_barcode_fasta FILE
barcode sequences of the forward primer (in FASTA
format)
--rev_barcode_fasta FILE
barcode sequences of the reverse primer (in FASTA
format)
--require_two_barcodes
require matched barcodes on both ends (default:
False). Notice: this option is valid only if both '--
fwd_barcode_fasta' and '--rev_barcode_fasta' are
supplied.
--num_threads INT number of threads (default: 1)
--minimap2 FILE path to minimap2 (default: using environment default)
--version show program's version number and exit
If you have one single input fastq file, you can supply the input with --in_fq
.
If you have multiple fastq files, you can supply a list file with --in_fq_list
. The list file contains all input fastq files, one file per line.
--amp_seq_fasta
is the reference amplicon sequence in FASTA format. Sometimes the barcode sequence is not at the very begining of the long read. Sometimes the first a few bases of a read is truncated. Due to the sequencing error, the barcode matching is flexible and allows some mismatches. ampBinner.py
assumes the reference amplicon sequence is known and uses it to distinguish amplicon sequence and barcode sequence, thus eliminates random fuzzy matching inside the amplicon.
--fwd_barcode_fasta
and --rev_barcode_fasta
are barcode sequences in FASTA format. If you use the same barcodes on both ends, you can supply --fwd_barcode_fasta
and --rev_barcode_fasta
with the same file. An example of --fwd_barcode_fasta
is shown below. We supplied FASTA files of official barcodes in the AmpBinner/ONT_barcodes
folder.
>BC01
CACAAAGACACCGACAACTTTCTT
>BC02
ACAGACGACTACAAACGGAATCGA
>BC03
CCTGGTAACTGGGACACAAGACTC
>BC04
TAGGGAAACACGATAGAATCCGAA
>BC05
AAGGTTACACAAACCCTGGACAAG
ampBinner.py
supports different barcoding strategies.
In this case, the amplicon structure is shown below.
You can use the --fwd_barcode_fasta
argument to supply the barcode FASTA file and use the --amp_seq_fasta
argument to supply reference amplicon FASTA file. Please note that the --amp_seq_fasta
file should INCLUDE the primer sequence but EXCLUDE the barcode sequence. An example command is shown below:
/home/fangl/AmpBinner/ampBinner.py --in_fq example_data.fastq.gz --amp_seq_fasta example_amplicon_seq.fasta --out_dir . --exp_name testing --num_threads 4 --fwd_barcode_fasta example_barcodes.fasta --minimap2 /home/fangl/software/minimap2-2.8_x64-linux/minimap2
In this case, the amplicon structure is shown below.
Similar to case 1, you can use the --rev_barcode_fasta
argument to supply the barcode FASTA file and use the --amp_seq_fasta
argument to supply reference amplicon FASTA file. Please note that the --amp_seq_fasta
file should INCLUDE the primer sequence but EXCLUDE the barcode sequence. An example command is shown below:
/home/fangl/AmpBinner/ampBinner.py --in_fq example_data.fastq.gz --amp_seq_fasta example_amplicon_seq.fasta --out_dir . --exp_name testing --num_threads 4 --rev_barcode_fasta example_barcodes.fasta --minimap2 /home/fangl/software/minimap2-2.8_x64-linux/minimap2
Case 3. The barcodes are on both ends. One sample have the same barcodes on both ends. Only one barcode is required to bin the reads.
This might be the most common case. In this case, the amplicon structure is shown below.
You can supply --fwd_barcode_fasta
and --rev_barcode_fasta
with the same file, and use the --amp_seq_fasta
argument to supply reference amplicon FASTA file. Please note that the --amp_seq_fasta
file should INCLUDE the primer sequence but EXCLUDE the barcode sequence. An example command is shown below:
/home/fangl/AmpBinner/ampBinner.py --in_fq example_data.fastq.gz --amp_seq_fasta example_amplicon_seq.fasta --out_dir . --exp_name testing --num_threads 4 --fwd_barcode_fasta example_barcodes.fasta --rev_barcode_fasta example_barcodes.fasta --minimap2 /home/fangl/software/minimap2-2.8_x64-linux/minimap2
Case 4. The barcodes are on both ends. One sample may or may not have the same barcodes on both ends. Two barcodes are required to bin the reads.
This might be the most common case. In this case, the amplicon structure is shown below.
You can supply --fwd_barcode_fasta
and --rev_barcode_fasta
with the barcode FASTA file. The --fwd_barcode_fasta
and --rev_barcode_fasta
file may or may not be the same. You want to use the --require_two_barcodes
option to specify that two barcodes are required to bin the reads. You can use the --amp_seq_fasta
argument to supply reference amplicon FASTA file. Please note that the --amp_seq_fasta
file should INCLUDE the primer sequence but EXCLUDE the barcode sequence. An example command is shown below:
/home/fangl/AmpBinner/ampBinner.py --in_fq example_data.fastq.gz --amp_seq_fasta example_amplicon_seq.fasta --out_dir . --exp_name testing --num_threads 4 --fwd_barcode_fasta example_barcodes.fasta --rev_barcode_fasta example_barcodes.fasta --require_two_barcodes --minimap2 /home/fangl/software/minimap2-2.8_x64-linux/minimap2
A 10X Genomics Chromium Single Cell 3ʹ Gene Expression Library often has several thousands of cellular barcodes. AmpBinner uses the sequence upstream of the barcode to help locate barcode position and eliminates random matching due to sequencing error. We provided a separate script file ampBinner_10X.py
for 10X single cell libraries. The structure of the 10X Genomics single cell library is shown below.
$ ./ampBinner_10X.py --help
usage: ampBinner_10X.py [-h] [--in_fq FILE] [--in_fq_list FILE] --barcode_list
FILE --barcode_upstream_seq STRING --out_prefix PATH
[--num_threads INT] [--minimap2 FILE] [--version]
A barcode demultiplexer for Oxford Nanopore long-read sequencing data with 10X
Genomics Chromium barcodes
optional arguments:
-h, --help show this help message and exit
--in_fq FILE input sequencing reads in one FASTQ(.gz) file
--in_fq_list FILE a list file specifying all input FASTQ(.gz) files, one
file per line
--barcode_list FILE a list file of all barcode sequences, one barcode
sequence per line, no barcode name
--barcode_upstream_seq STRING
known upstream sequence of the barcode
--out_prefix PATH prefix of output files
--num_threads INT number of threads (default: 1)
--minimap2 FILE path to minimap2 (default: using environment default)
--version show program's version number and exit
If you have one single input fastq file, you can supply the input with --in_fq
.
If you have multiple fastq files, you can supply a list file with --in_fq_list
. The list file contains all input fastq files, one file per line.
The barcode upstream sequence can be supplied with the --barcode_upstream_seq
argument. It can be either the TruSeq Read1 sequence or the concatenation of the P5 sequence and the TruSeq Read1 sequence.
The barcode list file is supplied via the --barcode_list
argument. The barcode list file should contain all barcodes of the specific sample, one barcode sequence per line (No barcode name). An example of the barcode list file is shown below.
AAACCCACACATCATG
AAACCCACATCATTGG
AAACCCAGTAGTTCCA
AAACCCAGTCGTTATG
AAACCCAGTGCGGATA
AAACCCAGTTCTTAGG
AAACCCATCATGAGTC
AAACCCATCTACTCAT
AAACGAAAGGTAGTAT
AAACGAACAACCCTAA
An example command is:
/home/fangl/AmpBinner/ampBinner_10X.py --in_fq example.fastq.gz --barcode_list barcodes.txt --barcode_upstream_seq AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT --out_prefix testing --num_threads 8
ampBinner_10X.py
will generate 3 files: testing.demultiplexing.PASS.reads.txt
, testing.demultiplexing.statistics.txt
and testing.all_reads.txt
.
testing.demultiplexing.PASS.reads.txt
contains the barcodes of QC-passed reads.
testing.all_reads.txt
contains the barcodes of all reads (including QC-passed and QC-failed reads).
testing.demultiplexing.statistics.txt
is a summary file with number of reads per barcode.
ampBinner_10X.py
has been tested samples with less than 10,000 barcodes. You'd better have a short-read 10X Genomics sequencing data so that you can narrow down the barcode list to a few thousand. ampBinner_10X.py
will not work well on a large barcode list (e.g. the complete barcode list which has > 1 million barcodes).