Skip to content

User Manual

Shenghui edited this page Sep 27, 2022 · 5 revisions

Introduction

ASTK is a command line software for comprehensive alternative splicing analysis(AS) analyses including AS event analysis, splicing sites sequence feature extraction, AS gene function analysis, potential regulatory mechanism analysis of AS.


Installation

General installation

Create a virtual environment for ASTK, although this is not necessary.

## create a new conda environment for astk and install python and R
$ conda create -n astk -c conda-forge -c bioconda r-base=4.1 python=3.8 -y
$ conda create -n astk -c bioconda bedtools meme -y
## activate conda environment
$ conda activate astk

install ASTK using pip

## install the development version from github
$ pip install git+https://github.com/huang-sh/astk.git@dev

After installed astk, you should install astk' dependent R packages. Then you can set a existing R executable binary file path, it will save lots of time and resource to install R packages.

It is an example, you can get yours R PATH in your own machine.

$ conda activate R41
$ which R
~/software/anaconda/envs/R41/bin/R

Configure the R path.

astk config -R ~/software/anaconda/envs/R41/bin/R

Note that this is not required if you don't have R.

And install R packages with:

$ astk install -r 
...

Note, Some R packages may fail to install, you can refer to FAQs for some solutions.

Docker installation

This way will save much time to install software.

$ docker pull huangshing/astk
 

astk docker usage

We could create a shortcut for the docker command with alias command. It is convenient for us to run the docker version of ASTK multiple times.

$ alias astkdocker="docker run --rm -v /home/test/project:/project -e MY_USER=$(id -u) huangshing/astk"

Please replace /home/test/project with your path. This directory should contain some reference files and all files you need to analyze.

$ ll -h | cut -d " " -f 5-

  446M Aug 12 21:08 ATAC.e16.5.fb.bigwig
    27 Aug  8 16:40 data
  1.4G Aug  8 22:03 gencode.v38.annotation.gtf
  847M Aug  8 17:35 gencode.vM25.annotation.gtf
  2.6G Aug  8 17:35 GRCm38.primary_assembly.genome.fa

And then we just run astk like:

$ astkdocker astk meta -h
Usage: astk meta [OPTIONS]

  generate metadata template file

Options:
  -p1, --control PATH           file path for condtion 1  [required]
  -p2, --treatment PATH         file path for condtion 2  [required]
  -gn, --groupName TEXT         group name
  -repN, --replicate INTEGER    replicate, number
  -o, --output PATH             metadata output path  [required]
  -repN1, --replicate1 INTEGER  replicate1, number
  -repN2, --replicate2 INTEGER  replicate2, number
  --condition TEXT              condition name
  -fn, --filename               file name
  --split TEXT                  name split symbol and index
  -h, --help                    Show this message and exit.

Command

ASTK works with a command/subcommand structure:

astk subcommand options

ASTK provides multiple groups sub-commands for comprehensive AS analysis:

AS differential splicing analysis

  • meta: generate metadata of AS differential splicing analysis contrast groups; it's helpful when you have multiple condition for anlysis
  • generateEvent: generate AS events using genome GTF annotation
  • generatePsi: calculates AS events PSI values
  • diffSplice: run AS differential splicing analysis
  • dsflow: wrapper of generateEvent, generatePsi and diffSplice

PSI/dPSI analysis and plot

  • sigFilter: select significant AS event
  • psiFilter: filter AS event with PSI value
  • pca: PSI value PCA ploting
  • heatmap: PSI value heatmap ploting
  • volcano: dPSI value volcano ploting
  • upset: AS event upset ploting

alternative exon/intron lenght analysis

  • lenCluster: AS event clustering based on alternative exon/intron length
  • lenDist: alternative exon/intron length distribution plotting
  • lenPick: selecting specfic exon/intron length of AS event

gene function enrichment analysis

  • enrich: AS gene function enrichment
  • enrichCompare: AS gene function comparsion
  • gsea: gene set enrichment analysis
  • nease: AS events analysis using NEASE

motif analysis

  • motifEnrich:motif enrichment
  • motifFind:motif discovery
  • motifPlot: motif plot
  • motifMap:motit RNA map
  • getmeme: extract motif from a meme motif file
  • seqlogo: draw seqLogo figure

chromatin analysis

  • signalProfile:profile chromatin signal of splicing sites

Eukaryotic Linear Motif

  • elms: search Eukaryotic Linear Motifs within amino acid sequence coding by alternative exon DNA sequence.

AS sites coordinate extract

  • getcoor, extract AS site coordiante and generate BED file and fasta file, it also can set AS site upstream or downstream width.

useful utilities

  • install: install other dependent software
  • list: list 20 orgnism annotation OrgDb

Usage

Preparation

meta

meta is used to generate contrast group metadata table for AS differential splicing analysis.

For example, we can generate multiple developmental stage contrast groups with e11.5 stage as control:

$ mkdir metadata -p
$ astk meta -o metadata/fb_e11_based -repN 2 \
    -c1 data/quant/fb_e11.5_rep*/quant.sf \
    -c2 data/quant/fb_e1[2-6].5_rep*/quant.sf  data/quant/fb_p0_rep*/quant.sf \
    -gn fb_e11_12 fb_e11_13 fb_e11_14 fb_e11_15 fb_e11_16 fb_e11_p0

meta arguments

  • -o: output file path
  • -repN: number of replicate samples
  • -c1: condition 1(ctrl) sample transcript quantification files path
  • -c2: condition 2(case) sample transcript quantification files path
  • -gn: group names

the output of meta is a CSV file and JSON file. CSV file is convenient for viewing in excel, and JSON file will be used in other sub-commands.

fb_e11_meta.csv

AS event

SUPPA2

generateEvent

generateEvent is used to infer AS events from genome GTF annotation file.

$ astk generateEvent -gtf gencode.vM25.annotation.gtf  -et SE \
    -o result/fb_e11_based/ref/gencode.vM25

generateEvent arguments:

  • -gtf: genome annotation GTF file
  • -et: AS event type
  • -o: output path

generatePsi

generatePsi is used to calulate PSI of AS event.

$ astk generatePsi -o result/fb_e11_based/psi/fb_SE_e10.psi \
    -qf data/quant/fb_e10.5_rep*/quant.sf \
    -ioe result/fb_e11_based/ref/gencode.vM25_SE_strict.ioe

$ astk generatePsi -o result/fb_e11_based/psi/fb_SE_e16.psi \
    -qf data/quant/fb_e16.5_rep*/quant.sf \
    -ioe result/fb_e11_based/ref/gencode.vM25_SE_strict.ioe

$ head result/fb_e11_based/psi/fb_SE_e10.psi
event_id        fb_e10.5_rep1   fb_e10.5_rep2
ENSMUSG00000025900.13;SE:chr1:4293012-4311270:4311433-4351910:- 1.0     1.0
ENSMUSG00000025902.13;SE:chr1:4492668-4493100:4493466-4493772:- 0.50685587094156        0.49103663421745114
ENSMUSG00000025902.13;SE:chr1:4492668-4493100:4493490-4493772:- 0.7405788514940302      0.697045848727448
ENSMUSG00000025902.13;SE:chr1:4493863-4495136:4495942-4496291:- 0.20508467461294735     0.16355428896798765

## it also will extract TPM value from transcript quantification files
$ head result/fb_e11_based/psi/fb_SE_e10.tpm -n 3
Name    fb_e10.5_rep1   fb_e10.5_rep2
ENSMUST00000193812.1    0.0     0.0
ENSMUST00000082908.1    0.0     0.0

generatePsi arguments:

  • -o: output path
  • -qf: transcript quantification files
  • -ioe: AS event ioe file

diffSplice

diffSplice is used to perform AS differential splicing analysis. The core algorithm is based on SUPPA2.

$ astk diffSplice -psi result/fb_e11_based/psi/fb_SE_e1*.psi \
    -exp result/fb_e11_based/psi/fb_SE_e1*.tpm \
    -ref result/fb_e11_based/ref/gencode.vM25_SE_strict.ioe \
    -o result/fb_e11_based/dpsi/fb_SE_e10_p0.dpsi 

diffSplice arguments:

  • -psi: AS events PSI files
  • -exp: transcripts TPM expression files
  • -ref: ioe reference file
  • -o: output file

dsflow

dsflow is re-implementation and wrapper of SUPPA2 sub-comment function (generateEvent, generatePsi and diffSplice). It's used to simplify the differential splicing analysis workflow.

dsflow arguments:

  • -od: output directory
  • -md: meta data, the meta output json file
  • -gtf: genome annotation GTF file
  • -t: alternative splicing type, ALL is for all supported types.
  • -m: empirical or classical, the method to calculate the significance, default=empirical
  • -p: pval threshold value
  • -adpsi: absulte dpsi threshold value

the output of dsflow contain four directories:

  • ref is the directory including AS event reference annotation files
  • tpm is the directory including sample TPM files
  • psi is the directory including AS event PSI file
  • dpsi is the directory including differential splicing result
  • sig01 is the directory including result that filter using pval < 0.05 and |PSI| > 0.1

AS event process

sigFilter

sigFilter is using for filter significant differential splicing event according to dPSI and p-value. It will generate significant differential splicing events and associated PSI files. sf is short alias of sigFilter.

sf arguments

  • -i: input dpsi file
  • -od: output directory
  • -adpsi: absolute dPSI threshold value
  • -p: p-value threshold value, defualt=0.05
  • -dpsi: dpsi threshold value, defualt=0
  • -sep: split dpsi file into two files according to dpsi > 0 and dpsi < 0
  • -app: the program that generates event file, defualt='auto'

psiFilter

psiFilter is used to filter AS event with PSI value.

$ astk pf -i result/fb_e11_based/psi/fb_e11_p0_SE_c2.psi \
    -psi 0.8 -o result/fb_e11_based/psi/fb_e11_p0_SE_c2_08.psi
$ astk pf -i result/fb_e11_based/psi/fb_e11_p0_SE_c2.psi \
    -psi -0.2 -o result/fb_e11_based/psi/fb_e11_p0_SE_c2_02.psi

psiFilter arguments

  • FILE: input psi file
  • -psi: when option value > 0, it denotes that select AS events that PSI > option value; however, when option value < 0, it denotes that select AS events that PSI < abs(option value)
  • -o: output file

lenCluster

lenCluster provide a function for cluster AS events based on alternative exon/intron length. lc is short alias of lenCluster.

lc arguments:

  • -i: input dpsi files
  • -lr: length range
  • -od: output directory

AS event analysis

pca

pca sub-commands is using for PCA analysis of PSI.

$ astk pca -i result/fb_e11_based/psi/fb_e11_12_SE_c1.psi \
    result/fb_e11_based/psi/fb_e11_1[2-6]_SE_c2.psi \
    result/fb_e11_based/psi/fb_e11_p0_SE_c2.psi \
    -o img/fb_pca.png -fmt png --width 6 --height 4

pca arguments

  • -i : PSI files
  • -o: output figure

fb_pca.png

heatmap

heatmap is used for ploting heatmap of PSI. hm is short alias of heatmap.

$ astk hm -i result/fb_e11_based/sig01/psi/fb_e11_12_SE_c1.sig.psi \
    result/fb_e11_based/sig01/psi/fb_e11_1*_SE_c2.sig.psi \
    -o img/fb_hm.png -fmt png

heatmap arguments

  • -i : PSI files
  • -o: output figure
  • -o : output path

fb_hm.png

barplot

barplot is used to draw barplot figure that show AS events counts distribution withing different condition.

astk barplot -i result/fb_e11_based/sig01/fb_e11_p0_*.sig.dpsi \
    -o img/fb_e11_p0_bar.png -dg -xl A3 A5 AF AL MX RI SE

barplot arguments:

  • -i : input dpsi files, support multiple files
  • -o: output figure
  • -dg: AS events can be divided into two groups based on dPSI values (group +: dPSI > 0, group -: dPSI < 0)
  • -xl: x labels

fb_e11_p0_bar.png

volcano

volcano is used for dPSI volcano ploting. vol is short alias of volcano.

$ astk volcano -i result/fb_e11_based/dpsi/fb_e11_p0_SE.dpsi \
    -o img/fb_e11_p0_SE.vol.png 

vol arguments:

  • -i : input dpsi files, support multiple files
  • -o: output directory

fb_hm.png

upset

upset is used for dPSI upset ploting. vol is short alias of upset.

$ astk upset -i result/fb_e11_based/sig01/fb_e11_12_SE.sig.dpsi \
    result/fb_e11_based/sig01/fb_e11_14_SE.sig.dpsi \
    result/fb_e11_based/sig01/fb_e11_16_SE.sig.dpsi \
    -o img/fb_upset.png -xl e11_12 e11_14 e11_16 

upset arguments:

  • -i : input dpsi files, support multiple files
  • -o: output file
  • -xl: x labels

fb_upset.png

Sequence feature

gcc

Compute GC content of each splicing sites flank region

gcc arguments:

  • -e: AS event file
  • -od : output directory
  • -fi: genome fasta file
  • -bs: bin size or slide window size
  • -ef: exon flank width
  • -if: intron flank width
  • --includeSS: include splice site flank region when computing GC content
  • -app: the program that generates event file, default="app"

sss

Compute 5'/3' Splice site strength

sss arguments:

  • -e: AS event file
  • -od : output directory
  • -fi: genome fasta file
  • -app: the program that generates event file, default="app"
  • -p: process number, default=4

elen

Compute exon or intron length

elen arguments:

  • -e: AS event file
  • -od : output directory
  • -log: log2 transformation
  • -app: the program that generates event file, default="app"

function enrichment

enrich

enrich is used for genes GO term enrichment enrichment. GO term enrichment map networks and enrichment clustering are provided.

$ mkdir img/enrich -p
$ astk enrich -i result/fb_e11_based/sig01/fb_e11_13_SE.sig.dpsi \
    -ont BP -qval 0.1 -orgdb mm  -fmt png \
    -od img/enrich/fb_e11_13_SE 

enrich arguments:

  • -i: dpsi file
  • -ont: ontology
  • -od : output directory
  • -qval : q-value
  • -org : organism, for example, ‘hs’ for human, ‘mm’ for mouse

GO terms enrichment result and enrichment clustering have figure and text formats.

enrichCompare

enrichCompare is used for gene functional characteristics comparison of different AS genes, short alias: ecmp.

Comparison between The dpsi > 0.1 and dpsi < 0.1 in the 7 group (fb_16.5 vs fb_p0)

$ mkdir img/ecmp
$ astk ecmp -i result/fb_e11_based/lenc/*/fb_e11_12_SE.sig.dpsi \
     -ont BP -org mm  -fmt png \
     -od img/enrich/fb_e11_12_SE_lc

enrichCompare arguments:

  • -i : dpsi files
  • -od : output directory
  • -ont : ontology
  • -qval : q-value
  • -org : organism, for example, ‘hs’ for human, ‘mm’ for mouse

GO.cmp.BP.png

nease

nease is used for genes GO term enrichment enrichment. GO term enrichment map networks and enrichment clustering are provided.

nease arguments:

  • -i: dpsi file
  • -od : output directory
  • -qval : q-value
  • -org : organism, only support Human
  • -db: nease support enrichment database,[PharmGKB|HumanCyc|Wikipathways|Reactome|KEGG|SMPDB|Signalink|NetPath|EHMN|INOH|BioCarta|PID]

motif analysis

motifEnrich

motifEnrich is used for performing motif enrichment within splicing sites flanking sequence using RBP motif database. me is short alias.

$ astk me -te result/fb_e11_based/psi/fb_e11_p0_SE_c2_08.psi \
    -ce result/fb_e11_based/psi/fb_e11_p0_SE_c2_02.psi \
    -od img/motif/fb_e11_p0_SE_me -org mm \
    -fi GRCm38.primary_assembly.genome.fa

Arguments:

  • -te: input treatment event file
  • -ce: input control event file
  • -od: output directory
  • -org: organism
  • -fi: genome fasta, need index

motifFind

motifFind is used for performing motif discovery and the compared to known RBP motif. mf is short alias.

astk mf -te result/fb_e11_based/psi/fb_e11_p0_SE_c2_08.psi \
    -od img/motif/fb_e11_p0_SE_mf -org mm \
    -fi GRCm38.primary_assembly.genome.fa

Arguments:

  • -te: input treatment event file
  • -od: output directory
  • -org: organism
  • -fi: genome fasta, need index

getmeme

getmeme is used for querying ASTK built-in motif data.

astk getmeme M316_0.6 M083_0.6 -db CISBP-RNA \
    -org mm -o img/motif/query.meme

Arguments:

  • MOTIFID...: input motif IDs
  • -db: motif database
  • -org: organism
  • -o: output file path

motifPlot

motifPlot is used for drawing motif figure using motif meme data

astk mp -mi M083_0.6 -db CISBP-RNA -org mm \
 -o img/motif/M083_0.6_plot.png -w 10

Arguments:

  • MOTIFID...: input motif IDs
  • -db: motif database
  • -org: organism
  • -o: output file path

motif_plot.png

mmap

mmap is used for generating motif map to show motif distribution.

astk mmap -e result/fb_e11_based/psi/fb_e11_p0_SE_c2_08.psi \
    -n a1 a2 a3 a4 -c 150 150 150 150 \
    -mm img/motif/query.meme -od img/motif/motif_map \
    -fi GRCm38.primary_assembly.genome.fa

Arguments:

  • -fa: input fasta files
  • -n: fasta files names
  • -c: center positions
  • -mm: motif meme file
  • -od: output directory
  • -fi: genome fasta, need index

epigenetic analysis

signalProfile

signalProfile is used to profile chromatin signal of splicing sites flank.

astk pf -i result/fb_e11_based/psi/fb_e11_16_AF_c2.psi \
    -psi 0.8 -o result/fb_e11_based/psi/fb_16_AF_08.psi

astk signalProfile -o img/fb_16_AF_low_ATAC.png \
    -e result/fb_e11_based/psi/fb_16_AF_08.psi \
    -bw ATAC.e16.5.fb.bigwig \
    -ssl A1 A2 A3 A4 A5 -fmt png

Arguments:

  • -o: output file
  • -e: AS event file that including AS event ID
  • -bw: bigwig file
  • -ssl: splicing site labels
  • -fmt: figure format

fb_16_AF_low_ATAC.png

elms

elms is using for searching Eukaryotic Linear Motifs within amino acid sequence coding by alternative exon sequence.

astk elms -i result/fb_e11_based/sig01/fb_e11_p0_SE.sig.dpsi -g mm10 -o img/elm.csv

Arguments:

  • -i: dpsi file path
  • -g: genome assembly
  • -o: output

FAQs

ERROR: compilation failed for package ‘magick’

$ astk install -r
ERROR: compilation failed for package ‘magick’
* removing ‘/home/user/software/anaconda/envs/astk/lib/R/library/magick’

You could install the software manually, and the re-run astk install -r. If you use conda, you can do it like:

$ conda install -c conda-forge r-magick -y
...
$ astk install -r

namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.0.0 is required

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.0.0 is required
Calls: source ... asNamespace -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted

You could install the rlang to higher version manually. For example:

# R console
> install.packages("rlang")