-
Notifications
You must be signed in to change notification settings - Fork 0
User Manual
ASTK is a command line software for comprehensive alternative splicing analysis(AS) analyses including AS event analysis, splicing sites sequence feature extraction, AS gene function analysis, potential regulatory mechanism analysis of AS.
Create a virtual environment for ASTK, although this is not necessary.
## create a new conda environment for astk and install python and R
$ conda create -n astk -c conda-forge -c bioconda r-base=4.1 python=3.8 -y
$ conda create -n astk -c bioconda bedtools meme -y
## activate conda environment
$ conda activate astk
install ASTK using pip
## install the development version from github
$ pip install git+https://github.com/huang-sh/astk.git@dev
After installed astk, you should install astk' dependent R packages. Then you can set a existing R executable binary file path, it will save lots of time and resource to install R packages.
It is an example, you can get yours R PATH in your own machine.
$ conda activate R41
$ which R
~/software/anaconda/envs/R41/bin/R
Configure the R path.
astk config -R ~/software/anaconda/envs/R41/bin/R
Note that this is not required if you don't have R.
And install R packages with:
$ astk install -r
...
Note, Some R packages may fail to install, you can refer to FAQs for some solutions.
This way will save much time to install software.
$ docker pull huangshing/astk
We could create a shortcut for the docker command with alias command. It is convenient for us to run the docker version of ASTK multiple times.
$ alias astkdocker="docker run --rm -v /home/test/project:/project -e MY_USER=$(id -u) huangshing/astk"
Please replace /home/test/project
with your path. This directory should contain some reference files and all files you need to analyze.
$ ll -h | cut -d " " -f 5-
446M Aug 12 21:08 ATAC.e16.5.fb.bigwig
27 Aug 8 16:40 data
1.4G Aug 8 22:03 gencode.v38.annotation.gtf
847M Aug 8 17:35 gencode.vM25.annotation.gtf
2.6G Aug 8 17:35 GRCm38.primary_assembly.genome.fa
And then we just run astk like:
$ astkdocker astk meta -h
Usage: astk meta [OPTIONS]
generate metadata template file
Options:
-p1, --control PATH file path for condtion 1 [required]
-p2, --treatment PATH file path for condtion 2 [required]
-gn, --groupName TEXT group name
-repN, --replicate INTEGER replicate, number
-o, --output PATH metadata output path [required]
-repN1, --replicate1 INTEGER replicate1, number
-repN2, --replicate2 INTEGER replicate2, number
--condition TEXT condition name
-fn, --filename file name
--split TEXT name split symbol and index
-h, --help Show this message and exit.
ASTK works with a command/subcommand structure:
astk subcommand options
ASTK provides multiple groups sub-commands for comprehensive AS analysis:
AS differential splicing analysis
- meta: generate metadata of AS differential splicing analysis contrast groups; it's helpful when you have multiple condition for anlysis
- generateEvent: generate AS events using genome GTF annotation
- generatePsi: calculates AS events PSI values
- diffSplice: run AS differential splicing analysis
- dsflow: wrapper of generateEvent, generatePsi and diffSplice
PSI/dPSI analysis and plot
- sigFilter: select significant AS event
- psiFilter: filter AS event with PSI value
- pca: PSI value PCA ploting
- heatmap: PSI value heatmap ploting
- volcano: dPSI value volcano ploting
- upset: AS event upset ploting
alternative exon/intron lenght analysis
- lenCluster: AS event clustering based on alternative exon/intron length
- lenDist: alternative exon/intron length distribution plotting
- lenPick: selecting specfic exon/intron length of AS event
gene function enrichment analysis
- enrich: AS gene function enrichment
- enrichCompare: AS gene function comparsion
- gsea: gene set enrichment analysis
- nease: AS events analysis using NEASE
motif analysis
- motifEnrich:motif enrichment
- motifFind:motif discovery
- motifPlot: motif plot
- motifMap:motit RNA map
- getmeme: extract motif from a meme motif file
- seqlogo: draw seqLogo figure
chromatin analysis
- signalProfile:profile chromatin signal of splicing sites
Eukaryotic Linear Motif
- elms: search Eukaryotic Linear Motifs within amino acid sequence coding by alternative exon DNA sequence.
AS sites coordinate extract
- getcoor, extract AS site coordiante and generate BED file and fasta file, it also can set AS site upstream or downstream width.
useful utilities
- install: install other dependent software
- list: list 20 orgnism annotation OrgDb
meta is used to generate contrast group metadata table for AS differential splicing analysis.
For example, we can generate multiple developmental stage contrast groups with e11.5 stage as control:
$ mkdir metadata -p
$ astk meta -o metadata/fb_e11_based -repN 2 \
-c1 data/quant/fb_e11.5_rep*/quant.sf \
-c2 data/quant/fb_e1[2-6].5_rep*/quant.sf data/quant/fb_p0_rep*/quant.sf \
-gn fb_e11_12 fb_e11_13 fb_e11_14 fb_e11_15 fb_e11_16 fb_e11_p0
meta arguments
- -o: output file path
- -repN: number of replicate samples
- -c1: condition 1(ctrl) sample transcript quantification files path
- -c2: condition 2(case) sample transcript quantification files path
- -gn: group names
the output of meta is a CSV file and JSON file. CSV file is convenient for viewing in excel, and JSON file will be used in other sub-commands.
generateEvent is used to infer AS events from genome GTF annotation file.
$ astk generateEvent -gtf gencode.vM25.annotation.gtf -et SE \
-o result/fb_e11_based/ref/gencode.vM25
generateEvent arguments:
- -gtf: genome annotation GTF file
- -et: AS event type
- -o: output path
generatePsi is used to calulate PSI of AS event.
$ astk generatePsi -o result/fb_e11_based/psi/fb_SE_e10.psi \
-qf data/quant/fb_e10.5_rep*/quant.sf \
-ioe result/fb_e11_based/ref/gencode.vM25_SE_strict.ioe
$ astk generatePsi -o result/fb_e11_based/psi/fb_SE_e16.psi \
-qf data/quant/fb_e16.5_rep*/quant.sf \
-ioe result/fb_e11_based/ref/gencode.vM25_SE_strict.ioe
$ head result/fb_e11_based/psi/fb_SE_e10.psi
event_id fb_e10.5_rep1 fb_e10.5_rep2
ENSMUSG00000025900.13;SE:chr1:4293012-4311270:4311433-4351910:- 1.0 1.0
ENSMUSG00000025902.13;SE:chr1:4492668-4493100:4493466-4493772:- 0.50685587094156 0.49103663421745114
ENSMUSG00000025902.13;SE:chr1:4492668-4493100:4493490-4493772:- 0.7405788514940302 0.697045848727448
ENSMUSG00000025902.13;SE:chr1:4493863-4495136:4495942-4496291:- 0.20508467461294735 0.16355428896798765
## it also will extract TPM value from transcript quantification files
$ head result/fb_e11_based/psi/fb_SE_e10.tpm -n 3
Name fb_e10.5_rep1 fb_e10.5_rep2
ENSMUST00000193812.1 0.0 0.0
ENSMUST00000082908.1 0.0 0.0
generatePsi arguments:
- -o: output path
- -qf: transcript quantification files
- -ioe: AS event ioe file
diffSplice is used to perform AS differential splicing analysis. The core algorithm is based on SUPPA2.
$ astk diffSplice -psi result/fb_e11_based/psi/fb_SE_e1*.psi \
-exp result/fb_e11_based/psi/fb_SE_e1*.tpm \
-ref result/fb_e11_based/ref/gencode.vM25_SE_strict.ioe \
-o result/fb_e11_based/dpsi/fb_SE_e10_p0.dpsi
diffSplice arguments:
- -psi: AS events PSI files
- -exp: transcripts TPM expression files
- -ref: ioe reference file
- -o: output file
dsflow is re-implementation and wrapper of SUPPA2 sub-comment function (generateEvent, generatePsi and diffSplice). It's used to simplify the differential splicing analysis workflow.
dsflow arguments:
- -od: output directory
- -md: meta data, the meta output json file
- -gtf: genome annotation GTF file
- -t: alternative splicing type, ALL is for all supported types.
- -m: empirical or classical, the method to calculate the significance, default=empirical
- -p: pval threshold value
- -adpsi: absulte dpsi threshold value
the output of dsflow contain four directories:
- ref is the directory including AS event reference annotation files
- tpm is the directory including sample TPM files
- psi is the directory including AS event PSI file
- dpsi is the directory including differential splicing result
- sig01 is the directory including result that filter using pval < 0.05 and |PSI| > 0.1
sigFilter is using for filter significant differential splicing event according to dPSI and p-value. It will generate significant differential splicing events and associated PSI files. sf is short alias of sigFilter.
sf arguments
- -i: input dpsi file
- -od: output directory
- -adpsi: absolute dPSI threshold value
- -p: p-value threshold value, defualt=0.05
- -dpsi: dpsi threshold value, defualt=0
- -sep: split dpsi file into two files according to dpsi > 0 and dpsi < 0
- -app: the program that generates event file, defualt='auto'
psiFilter is used to filter AS event with PSI value.
$ astk pf -i result/fb_e11_based/psi/fb_e11_p0_SE_c2.psi \
-psi 0.8 -o result/fb_e11_based/psi/fb_e11_p0_SE_c2_08.psi
$ astk pf -i result/fb_e11_based/psi/fb_e11_p0_SE_c2.psi \
-psi -0.2 -o result/fb_e11_based/psi/fb_e11_p0_SE_c2_02.psi
psiFilter arguments
- FILE: input psi file
- -psi: when option value > 0, it denotes that select AS events that PSI > option value; however, when option value < 0, it denotes that select AS events that PSI < abs(option value)
- -o: output file
lenCluster provide a function for cluster AS events based on alternative exon/intron length. lc is short alias of lenCluster.
lc arguments:
- -i: input dpsi files
- -lr: length range
- -od: output directory
pca sub-commands is using for PCA analysis of PSI.
$ astk pca -i result/fb_e11_based/psi/fb_e11_12_SE_c1.psi \
result/fb_e11_based/psi/fb_e11_1[2-6]_SE_c2.psi \
result/fb_e11_based/psi/fb_e11_p0_SE_c2.psi \
-o img/fb_pca.png -fmt png --width 6 --height 4
pca arguments
- -i : PSI files
- -o: output figure
heatmap is used for ploting heatmap of PSI. hm is short alias of heatmap.
$ astk hm -i result/fb_e11_based/sig01/psi/fb_e11_12_SE_c1.sig.psi \
result/fb_e11_based/sig01/psi/fb_e11_1*_SE_c2.sig.psi \
-o img/fb_hm.png -fmt png
heatmap arguments
- -i : PSI files
- -o: output figure
- -o : output path
barplot is used to draw barplot figure that show AS events counts distribution withing different condition.
astk barplot -i result/fb_e11_based/sig01/fb_e11_p0_*.sig.dpsi \
-o img/fb_e11_p0_bar.png -dg -xl A3 A5 AF AL MX RI SE
barplot arguments:
- -i : input dpsi files, support multiple files
- -o: output figure
- -dg: AS events can be divided into two groups based on dPSI values (group +: dPSI > 0, group -: dPSI < 0)
- -xl: x labels
volcano is used for dPSI volcano ploting. vol is short alias of volcano.
$ astk volcano -i result/fb_e11_based/dpsi/fb_e11_p0_SE.dpsi \
-o img/fb_e11_p0_SE.vol.png
vol arguments:
- -i : input dpsi files, support multiple files
- -o: output directory
upset is used for dPSI upset ploting. vol is short alias of upset.
$ astk upset -i result/fb_e11_based/sig01/fb_e11_12_SE.sig.dpsi \
result/fb_e11_based/sig01/fb_e11_14_SE.sig.dpsi \
result/fb_e11_based/sig01/fb_e11_16_SE.sig.dpsi \
-o img/fb_upset.png -xl e11_12 e11_14 e11_16
upset arguments:
- -i : input dpsi files, support multiple files
- -o: output file
- -xl: x labels
Compute GC content of each splicing sites flank region
gcc arguments:
- -e: AS event file
- -od : output directory
- -fi: genome fasta file
- -bs: bin size or slide window size
- -ef: exon flank width
- -if: intron flank width
- --includeSS: include splice site flank region when computing GC content
- -app: the program that generates event file, default="app"
Compute 5'/3' Splice site strength
sss arguments:
- -e: AS event file
- -od : output directory
- -fi: genome fasta file
- -app: the program that generates event file, default="app"
- -p: process number, default=4
Compute exon or intron length
elen arguments:
- -e: AS event file
- -od : output directory
- -log: log2 transformation
- -app: the program that generates event file, default="app"
enrich is used for genes GO term enrichment enrichment. GO term enrichment map networks and enrichment clustering are provided.
$ mkdir img/enrich -p
$ astk enrich -i result/fb_e11_based/sig01/fb_e11_13_SE.sig.dpsi \
-ont BP -qval 0.1 -orgdb mm -fmt png \
-od img/enrich/fb_e11_13_SE
enrich arguments:
- -i: dpsi file
- -ont: ontology
- -od : output directory
- -qval : q-value
- -org : organism, for example, ‘hs’ for human, ‘mm’ for mouse
GO terms enrichment result and enrichment clustering have figure and text formats.
enrichCompare is used for gene functional characteristics comparison of different AS genes, short alias: ecmp.
Comparison between The dpsi > 0.1 and dpsi < 0.1 in the 7 group (fb_16.5 vs fb_p0)
$ mkdir img/ecmp
$ astk ecmp -i result/fb_e11_based/lenc/*/fb_e11_12_SE.sig.dpsi \
-ont BP -org mm -fmt png \
-od img/enrich/fb_e11_12_SE_lc
enrichCompare arguments:
- -i : dpsi files
- -od : output directory
- -ont : ontology
- -qval : q-value
- -org : organism, for example, ‘hs’ for human, ‘mm’ for mouse
nease is used for genes GO term enrichment enrichment. GO term enrichment map networks and enrichment clustering are provided.
nease arguments:
- -i: dpsi file
- -od : output directory
- -qval : q-value
- -org : organism, only support Human
- -db: nease support enrichment database,[PharmGKB|HumanCyc|Wikipathways|Reactome|KEGG|SMPDB|Signalink|NetPath|EHMN|INOH|BioCarta|PID]
motifEnrich is used for performing motif enrichment within splicing sites flanking sequence using RBP motif database. me is short alias.
$ astk me -te result/fb_e11_based/psi/fb_e11_p0_SE_c2_08.psi \
-ce result/fb_e11_based/psi/fb_e11_p0_SE_c2_02.psi \
-od img/motif/fb_e11_p0_SE_me -org mm \
-fi GRCm38.primary_assembly.genome.fa
Arguments:
- -te: input treatment event file
- -ce: input control event file
- -od: output directory
- -org: organism
- -fi: genome fasta, need index
motifFind is used for performing motif discovery and the compared to known RBP motif. mf is short alias.
astk mf -te result/fb_e11_based/psi/fb_e11_p0_SE_c2_08.psi \
-od img/motif/fb_e11_p0_SE_mf -org mm \
-fi GRCm38.primary_assembly.genome.fa
Arguments:
- -te: input treatment event file
- -od: output directory
- -org: organism
- -fi: genome fasta, need index
getmeme is used for querying ASTK built-in motif data.
astk getmeme M316_0.6 M083_0.6 -db CISBP-RNA \
-org mm -o img/motif/query.meme
Arguments:
- MOTIFID...: input motif IDs
- -db: motif database
- -org: organism
- -o: output file path
motifPlot is used for drawing motif figure using motif meme data
astk mp -mi M083_0.6 -db CISBP-RNA -org mm \
-o img/motif/M083_0.6_plot.png -w 10
Arguments:
- MOTIFID...: input motif IDs
- -db: motif database
- -org: organism
- -o: output file path
mmap is used for generating motif map to show motif distribution.
astk mmap -e result/fb_e11_based/psi/fb_e11_p0_SE_c2_08.psi \
-n a1 a2 a3 a4 -c 150 150 150 150 \
-mm img/motif/query.meme -od img/motif/motif_map \
-fi GRCm38.primary_assembly.genome.fa
Arguments:
- -fa: input fasta files
- -n: fasta files names
- -c: center positions
- -mm: motif meme file
- -od: output directory
- -fi: genome fasta, need index
signalProfile is used to profile chromatin signal of splicing sites flank.
astk pf -i result/fb_e11_based/psi/fb_e11_16_AF_c2.psi \
-psi 0.8 -o result/fb_e11_based/psi/fb_16_AF_08.psi
astk signalProfile -o img/fb_16_AF_low_ATAC.png \
-e result/fb_e11_based/psi/fb_16_AF_08.psi \
-bw ATAC.e16.5.fb.bigwig \
-ssl A1 A2 A3 A4 A5 -fmt png
Arguments:
- -o: output file
- -e: AS event file that including AS event ID
- -bw: bigwig file
- -ssl: splicing site labels
- -fmt: figure format
elms is using for searching Eukaryotic Linear Motifs within amino acid sequence coding by alternative exon sequence.
astk elms -i result/fb_e11_based/sig01/fb_e11_p0_SE.sig.dpsi -g mm10 -o img/elm.csv
Arguments:
- -i: dpsi file path
- -g: genome assembly
- -o: output
ERROR: compilation failed for package ‘magick’
$ astk install -r
ERROR: compilation failed for package ‘magick’
* removing ‘/home/user/software/anaconda/envs/astk/lib/R/library/magick’
You could install the software manually, and the re-run astk install -r
. If you use conda, you can do it like:
$ conda install -c conda-forge r-magick -y
...
$ astk install -r
namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.0.0 is required
Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) :
namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.0.0 is required
Calls: source ... asNamespace -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
You could install the rlang to higher version manually. For example:
# R console
> install.packages("rlang")