“Con-hi” means “consensus-highlighter”.
Latest version is 3.3.c
(2024-12-22 edition).
This program annotates low-coverage and high-coverage regions of sequences in fasta format using read mapping in BAM format.
- Target sequence(s) in fasta format.
- Read mapping in a sorted BAM file.
- Coverage threshold(s) for searching low-coverage and high-coverage regions.
- A GenBank file with annotated low-coverage and high-coverage regions.
- Python 3.6 or later.
- Biopython package.
- samtools 1.13 or later is recommended. Versions from 1.11 to 1.12 are acceptable, but might calculate coverage inaccurately.
You can install Biopython with following command:
pip3 install biopython
You can install samtools by downloading latest release from samtools page on github. Then follow instrunctions in downloaded INSTALL file.
Basic usage is:
./con-hi.py -f <TARGET_FASTA> -b <MAPPING_BAM>
You can specify custom coverage theshold(s) by passing comma-separated list of thresholds with options -c
and -C
. For example, following command will annotate:
-
regions with coverage below 25 and all regions below 55 (and also with zero coverage);
-
regions with coverage greater than 1.5×M and greater than 2.0×M, where M is median coverage.
./con-hi.py \
-f my_sequence.fasta -b my_mapping.sorted.bam \
-c 25,55 -C 1.5,2.0
-f or --target-fasta: *
File of target sequence(s) in fasta format.
-b or --bam: *
Sorted BAM file which contains mapping on target sequence(s).
-o or --outfile:
Output file.
Deault value: 'highlighted_sequence.gbk'.
-r or --target-seq-ids:
Comma-separated list of target sequence IDs to process.
Examples: "seq_1" or "seq_1,seq_9,seq_12".
Dasta sequence id is the part of its header before the first space.
Default: process all target sequences.
-c or --lower-coverage-thresholds:
Comma-separated list of lower coverage threshold(s).
Default: 10.
To disable it, specify "-c off", and low-coverage regions won't be annotated.
-n or --no-zero-output:
Disable annotation of zero-coverage regions.
Disabled by default.
-C or --upper-coverage-coefficients:
Comma-separated list of coverage coefficient(s).
To annotate regions with coverage > 1.7×M,
where M is median coverage, specify "-C 1.7".
Default: 2.0.
To disable it, specify "-C off", and high-coverage regions won't be annotated.
-l or --min-feature-len:
Minimum length of a feature to output. Must be int >= 0.
Default: 5 bp.
--circular:
Target sequence in curcular. Affects only corresponding GenBank field.
Disabled by default.
--organism:
Organism name. Affects only corresponding GenBank field.
If it contains spaces, surround it with quotes (see Example 4).
Empty by default.
-k or --keep-temp-cov-file:
Don't delete temporary TSV file "coverages.tsv" after work of the program.
The program creates this file in the same directory where the "-o" file is located.
Default behaviour is to delete this file afterwards.
* - mandatory option
Annotate file my_sequence.fasta
with default parameters according to mapping from file my_mapping.sorted.bam
:
./con-hi.py -f my_sequence.fasta -b my_mapping.sorted.bam
Annotate regions with coverage below 25, fragments with coverages below 50 and regions with zero coverages:
./con-hi.py -f my_sequence.fasta -b my_mapping.sorted.bam -c 25,50
Annotate regions with coverage below 25, fragments with coverages below 50. Disable annotation of zero coverage regions:
./con-hi.py -f my_sequence.fasta -b my_mapping.sorted.bam -c 25,50 -n
Specify the name of the organism for output file. The sequence is circular:
./con-hi.py -f my_sequence.fasta -b my_mapping.sorted.bam \
--circular --organism "Serratia marcescens"
Disable annotation of low-coverage regions (-c off
). Annotate high-coverage regions with coverage above 1.7×M and above 2.4×M, where M is median coverage:
./con-hi.py -f my_sequence.fasta -b my_mapping.sorted.bam \
-c off -C 1.7,2.4
Target file my_sequences.fasta
contains the following sequences:
-
a prokaryotic chromosome (sequence id
chr
); -
one high-copy plasmid (sequence id
plasmid_H1
); -
two low-copy plasmids (sequence ids
plasmid_L1
andplasmid_L2
).
One might expect that the more copies a replicon has the higher is its read coverage. Use coverage threshold of 20 for the chromosome, 50 for the high-copy plasmid, and 5 for low-copy plasmids:
./con-hi.py -f my_sequences.fasta -b my_mapping.sorted.bam \
-r chr \
-c 20
./con-hi.py -f my_sequences.fasta -b my_mapping.sorted.bam \
-r plasmid_H1 \
-c 50
./con-hi.py -f my_sequences.fasta -b my_mapping.sorted.bam \
-r plasmid_L1,plasmid_L2 \
-c 5