A python script to interconvert seq-ids in gff3, gtf, bed and other files.
- Install using conda
conda install -c bioconda cthreepo
- Execute as follows:
## convert seq-ids in <input.gff3> from refseq format (NC_000001.11)
## to UCSC format (chr1) using the Human GRCh38 mapping dictionary
cthreepo -i <input.gff3> -if rs -it uc -f gff3 -m h38 -o <output.gff3>
NCBI RefSeq, UCSC and Ensembl use different identifiers for chromosomes in annotation and other files such as GFF3, GTF, etc. Users interested in using a mix of files downloaded from different sources and use them in a single pipeline may end up with seq-id mismatch related errors. This script converts seq-ids from one style to the other in order to make the files compatible with each other.
Python3 is required for this script to work. With that requirement satisfied, you can install as shown below:
conda install -c bioconda cthreepo
pip install cthreepo
First, download/clone the repository. Then run:
python3 setup.py install
## help
cthreepo --help
## usage
## convert seq-ids in <input.gff3> from refseq format (NC_000001.11)
## to UCSC format (chr1) using the Human GRCh38 mapping dictionary
cthreepo \
--infile <input.gff3> \
--id_from rs \
--id_to uc \
--format gff3 \
--mapfile h38 \
--outfile <output.gff3>
- GFF3 (default)
- GTF
- BedGraph
- BED
- SAM
- VCF
- WIG
- TSV
cthreepo
needs a mapfile
that it uses to figure out how seq-ids map from one style to the other.
- Use the built-in shortcuts --
h38
,h37
,m38
andm37
for GRCh38/hg38, GRCh37/hg19, MGSCv37/mm9 and GRCm38/mm10 respectively. I try to keep these files up-to-date but if they don't work as expected, I suggest using the latest file by following one of the two options described below. - Provide NCBI assembly accession using the
-a
parameter. A complete, legal accession.version such as GCF_000001405.39 should be provided. - Provide an NCBI assembly report file. For a given assembly it can be downloaded from the NCBI Assembly website. If the 'Download' button is used, this file is called 'Assembly structure report'. On the NCBI Genomes FTP site, these files have the suffix
assembly_report.txt
.