GFF(3) stats

Given a genome and a corresponding GFF3 file, calculate various statistics on the coding regions (or extract them).

Current output is a tsv, with bed-like first four columns (i.e. sequence ID, attribute Parent ID, start, end...).

GC percent, GC skew, AT percent, and AC skew are calculated for each:

raw CDS (or spliced CDS) (GC)
four(six)-fold degenerate sites from the CDS (GC4)
third codon position for each codon in the CDS (GC3)

gff-stats can also extract CDS/spliced CDS as a nucleotide or protein string to a fasta file (see below). Note this functionality is also provided by gffread (see below). gffread may be faster as it indexes the fasta for quick random access.

Note: gff-stats requires the length of coding sequences of a given transcript to add up to a value divisible by three. In case any transcripts violate this assumption, they can be filtered out with the following script before running gff-stats: https://github.com/charlottewright/genomics_tools/blob/main/gff3_handling/filter_non_divisible_by_three_transcripts.py

Build

Building requires Rust.

git clone https://github.com/tolkit/gff-stats
cd gff-stats
cargo build --release
# ./target/release/gff-stats is the executable
# or
cargo install --path .
# to put gff-stats in your path

Usage

### gff-stats -h

GFF(3) stats 0.2.2
Max Brown <mb39@sanger.ac.uk>
Extract GFF3 regions from a reference fasta and compute statistics on them.

USAGE:
    gff-stats [SUBCOMMAND]

OPTIONS:
    -h, --help       Print help information
    -V, --version    Print version information

SUBCOMMANDS:
    help    Print this message or the help of the given subcommand(s)
    seq     Extract CDS regions to fasta format. Printed to stdout.
    stat    Compute statistics on CDS regions

`gff-stats stat -h`

gff-stats-stat 0.2.2
Compute statistics on CDS regions

USAGE:
    gff-stats stat [OPTIONS] --gff <gff> --fasta <fasta>

OPTIONS:
    -d, --degeneracy <degeneracy>    Calculate statistics on four-fold or six-fold (in addition to
                                     four-fold) degenerate codon sites. [default: fourfold]
                                     [possible values: fourfold, sixfold]
    -f, --fasta <fasta>              The reference fasta file.
    -g, --gff <gff>                  The input gff file.
    -h, --help                       Print help information
    -o, --output <output>            Output filename for the TSV (without extension). [default: gff-
                                     stat]
    -p, --spliced                    Compute stats on spliced CDS sequences?
    -V, --version                    Print version information

`gff-stats seq -h`

Cross testing with gffread:

# -x outputs spliced fastas
gffread -g ./tests/test_fasta.fna ./tests/test_gff.gff -x ./tests/test_gffread_x.fa
# equivalent to:
gff-stats seq -f ./tests/test_fasta.fna -g ./tests/test_gff.gff -s
# -y outputs spliced protein fastas
gffread -g ./tests/test_fasta.fna ./tests/test_gff.gff -y ./tests/test_gffread_y.fa
# equivalent to:
gff-stats seq -f ./tests/test_fasta.fna -g ./tests/test_gff.gff -sp

gff-stats-seq 0.2.2
Extract CDS regions to fasta format. Printed to stdout.

USAGE:
    gff-stats seq [OPTIONS] --gff <gff> --fasta <fasta>

OPTIONS:
    -f, --fasta <fasta>      The reference fasta file.
    -g, --gff <gff>          The input gff file.
    -h, --help               Print help information
    -o, --output <output>    Output filename for the fasta (without extension). [default: gff-stat]
    -p, --protein            Save the extracted CDS fasta sequences as a translated protein?
    -s, --spliced            Save the spliced extracted CDS fasta sequences?
    -V, --version            Print version information

The output of gff-stats can be used to calcualate average GC3 percent in non-overlapping sliding windows of a user-defined size (e.g. 100 kb). While either mode of gff-stats stat can be used as the input to this script, using the 'spliced' option is the quickest.

`Calculate_GC3_per_window.py -h`

  -h, --help            show this help message and exit
  -i INDEX, --index INDEX
                        Index file for the genome
  -s STATS, --stats STATS
                        Stats file generated by gff-stats stat
  -w WINDOW_SIZE, --window_size WINDOW_SIZE
                        Window size (in bases)
  -o OUTPUT, --output OUTPUT
                        Output filename for the TSV (without extension)

Docs

API documentation

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docs		docs
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
calculate_GC3_per_window.py		calculate_GC3_per_window.py
make_docs.bash		make_docs.bash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GFF(3) stats

Build

Usage

`gff-stats stat -h`

`gff-stats seq -h`

`Calculate_GC3_per_window.py -h`

Docs

About

Contributors 2

Languages

License

tolkit/gff-stats

Folders and files

Latest commit

History

Repository files navigation

GFF(3) stats

Build

Usage

gff-stats stat -h

gff-stats seq -h

Calculate_GC3_per_window.py -h

Docs

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages

`gff-stats stat -h`

`gff-stats seq -h`

`Calculate_GC3_per_window.py -h`