Skip to content

tolkit/gff-stats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GFF(3) stats

Given a genome and a corresponding GFF3 file, calculate various statistics on the coding regions (or extract them).

Current output is a tsv, with bed-like first four columns (i.e. sequence ID, attribute Parent ID, start, end...).

GC percent, GC skew, AT percent, and AC skew are calculated for each:

  • raw CDS (or spliced CDS) (GC)
  • four(six)-fold degenerate sites from the CDS (GC4)
  • third codon position for each codon in the CDS (GC3)

gff-stats can also extract CDS/spliced CDS as a nucleotide or protein string to a fasta file (see below). Note this functionality is also provided by gffread (see below). gffread may be faster as it indexes the fasta for quick random access.

Note: gff-stats requires the length of coding sequences of a given transcript to add up to a value divisible by three. In case any transcripts violate this assumption, they can be filtered out with the following script before running gff-stats: https://github.com/charlottewright/genomics_tools/blob/main/gff3_handling/filter_non_divisible_by_three_transcripts.py

Build

Building requires Rust.

git clone https://github.com/tolkit/gff-stats
cd gff-stats
cargo build --release
# ./target/release/gff-stats is the executable
# or
cargo install --path .
# to put gff-stats in your path

Usage

### gff-stats -h

GFF(3) stats 0.2.2
Max Brown <mb39@sanger.ac.uk>
Extract GFF3 regions from a reference fasta and compute statistics on them.

USAGE:
    gff-stats [SUBCOMMAND]

OPTIONS:
    -h, --help       Print help information
    -V, --version    Print version information

SUBCOMMANDS:
    help    Print this message or the help of the given subcommand(s)
    seq     Extract CDS regions to fasta format. Printed to stdout.
    stat    Compute statistics on CDS regions

gff-stats stat -h

gff-stats-stat 0.2.2
Compute statistics on CDS regions

USAGE:
    gff-stats stat [OPTIONS] --gff <gff> --fasta <fasta>

OPTIONS:
    -d, --degeneracy <degeneracy>    Calculate statistics on four-fold or six-fold (in addition to
                                     four-fold) degenerate codon sites. [default: fourfold]
                                     [possible values: fourfold, sixfold]
    -f, --fasta <fasta>              The reference fasta file.
    -g, --gff <gff>                  The input gff file.
    -h, --help                       Print help information
    -o, --output <output>            Output filename for the TSV (without extension). [default: gff-
                                     stat]
    -p, --spliced                    Compute stats on spliced CDS sequences?
    -V, --version                    Print version information

gff-stats seq -h

Cross testing with gffread:

# -x outputs spliced fastas
gffread -g ./tests/test_fasta.fna ./tests/test_gff.gff -x ./tests/test_gffread_x.fa
# equivalent to:
gff-stats seq -f ./tests/test_fasta.fna -g ./tests/test_gff.gff -s
# -y outputs spliced protein fastas
gffread -g ./tests/test_fasta.fna ./tests/test_gff.gff -y ./tests/test_gffread_y.fa
# equivalent to:
gff-stats seq -f ./tests/test_fasta.fna -g ./tests/test_gff.gff -sp
gff-stats-seq 0.2.2
Extract CDS regions to fasta format. Printed to stdout.

USAGE:
    gff-stats seq [OPTIONS] --gff <gff> --fasta <fasta>

OPTIONS:
    -f, --fasta <fasta>      The reference fasta file.
    -g, --gff <gff>          The input gff file.
    -h, --help               Print help information
    -o, --output <output>    Output filename for the fasta (without extension). [default: gff-stat]
    -p, --protein            Save the extracted CDS fasta sequences as a translated protein?
    -s, --spliced            Save the spliced extracted CDS fasta sequences?
    -V, --version            Print version information

The output of gff-stats can be used to calcualate average GC3 percent in non-overlapping sliding windows of a user-defined size (e.g. 100 kb). While either mode of gff-stats stat can be used as the input to this script, using the 'spliced' option is the quickest.

Calculate_GC3_per_window.py -h

  -h, --help            show this help message and exit
  -i INDEX, --index INDEX
                        Index file for the genome
  -s STATS, --stats STATS
                        Stats file generated by gff-stats stat
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Window size (in bases)
  -o OUTPUT, --output OUTPUT
                        Output filename for the TSV (without extension)

Docs

API documentation

About

Calculate GC4 and GC3 from a gff and a fasta.

Topics

Resources

License

Stars

Watchers

Forks