-
Notifications
You must be signed in to change notification settings - Fork 5
mandricigor/isoem2
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
isoem2/isoDE2 README 1. Installation: ------------ 1. Create a isoem2 directory and download the git repository using git clone https://github.com/mandricigor/isoem2.git 3. run the linux shell script 'install' provided in the git repository 4. [Optional] On windows you might want to add the isoem2/isoDE2 installation directory to the path, such that you can invoke isoem2 from any location. On linux, you can obtain a similar effect by creating a symbolic link to the isoem2 and isoDE2 executables in /usr/local/bin. 2. Testing your installation: -------------------------- To test the installation of isoem2 and isoDE2, download and unzip the following compressed archive and follow the instructions in the README file included in the archive. http://dna.engr.uconn.edu/~software/IsoEM/testdata/IsoEM2IsoDE2-MAQC-Sample.zip 3. Running isoem2: ----------------- isoem2 takes as input a set of known isoforms in GTF format, and a file with aligned reads in SAM format. The aligned reads MUST be sorted by read name. If not sure, run this command to sort the file: sort -k 1,1 aligned_reads.sam > aligned_reads_sorted.sam You can run isoem2 from the command line as follows: isoem2 [global options]* [library options]* <aligned_reads.sam> Or, if you run provide read alignments from the standard input: cat <aligned_reads.sam> | isoem2 [global options]* [library options]* Mandatory global options: ------------------------ -G, --GTF <GTF file> Known genes and isoforms in GTF format Mandatory library options: either -a or both -m and -d must be present: ------------------------- -m, --fragment-mean <Double> Fragment length mean -d, --fragment-std-dev <Double> Fragment length standard deviation -a, --auto-fragment-distrib Automatically detect fragment length distribution from uniquely mapping paired reads (DOES NOT WORK FOR SINGLE READS) Optional global options: ----------------------- -c, --gene-clusters <Cluster file> Override isoform to gene mapping defined in the GTF file with a mapping taken from the given file. The format of each line in the file is "isoform gene" -g <genome fasta file> Genome reference sequence (needed by some library options) -b Perform hexamer bias correction -h, --help Show help -r <Repeats GTF> Drop alignments falling within annotated repeats Optional library options: ------------------------ -s, --directional Dataset obtained by directed RNA-Seq (the strand of each read is deterministically chosen: for single reads, the read always comes from the coding strand; for paired reads, the first read always comes from the coding strand, the second from the opposite strand) --antisense Directional sequencing but the reads come from the antisense --mate-pairs Paired reads come from the same strand (as opposed to the default behavior where the two reads in a pair are assumed to come from opposite strands) --max-mismatches <Integer> Maximum number of mismatched allowed for a read. This requires the genome sequence to be specified (see -g). -q, --quality-scores Weigh the reads based on their quality scores. This requires the genome sequence to be specified (see -g). --repeat-threshold <nbases> Drop all reads that have more than this many bases inside annotated repeats. Default: 20. --polyA <nbases> Reads have been generated from mRNAs with polyA tails of approximately the given number of bases -o <file prefix> Output files prefix. It can include path. Default: same as sam file name -O <directory prefix> Output directory prefix. If read alignments are read from stdin, the default value is stdinSample -C <confidence interval (%)> Compute expression of genes/isoforms with specified confidence intervals. Provide an integer (default: 95, bootstraps: 200) --endseq Disable length normalization for data generated using 5' or 3' end-sequen- cing protocols, which generate a single fragment per cDNA molecule Output ------ isoem2 generates the following output files structure under a directory with the same name as the sam file, unless the -o is used <output_directory> | - output | | | - Isoforms | | | | | - iso_fpkm_estimates | | - iso_tpm_estimates | - Genes | | | - iso_fpkm_estimates | - iso_tpm_estimates - ConfidenceIntervals (Only if -C option is used) | | | - iso_fpkm_ci | - iso_tpm_ci | - gene_fpkm_ci | - gene_tpm_ci - boostrap.tar.gz Files under output/Isoforms and output/Genes are tab delimited files with the following two fields 1- Isoform/Gene ID 2- Isoform/Gene FPKM (Fragments Per Kilobase per Million reads) or TPM (Transcripts per Million reads) Files under output/ConfidenceIntervals are tab delimited files with the following three fields 1- Isoform/Gene ID 2- Lower-bound for the 95% confidence interval of the Isoform/Gene FPKM/TPM estimate determined by bootstrapping 3- Upper-bound for the 95% confidence interval of the Isoform/Gene FPKM/TPM estimate determined by bootstrapping boostrap.tar.gz is a compressed tar archive containing bootstrap samples used to determine confidence intervals. This archive can be used as input to the isoDE2 tool for computing differentially expressed isoforms/genes. Note: Read Alignment: --------------------- To align the reads you have one of two options: 1) Use spliced alignment directly on the genome 2) Use unspliced alignment to the transcriptome. If you have a transcriptome reference and no GTF (needed to run isoem2), you can use the fastaToGTF tool, included with the isoem2 suite, to generate a GTF. If you want to generate a transcriptome reference using a GTF, you can use the extract-isoform-sequences-from-genome tool, included with the isoem2 suite 4. Running isoDE2 ---------------- isoDE2 makes DE calls for gene/isoform FPKM and TPM estimated using the boostrapping output generated by isoem2 isoDE2 -c1 <List of boostraping path for condition 1> -c2 <List of boostraping path for condition 2> -pval <desired p value> -out <output-files-prefix> Mandatory parameters -------------------- -c1 List of bootstrapping compressed archives for condition 1 -c2 List of bootstrapping compressed archives for condition 2 -pval pval -out prefix for generated output files Output ------ 4 files with the prefix specifies as input and the following suffixes geneFPKM geneTPM isoFPKM isoTPM All four output files have the same structure, described below Description of isoDE2 Output file: --------------------------------- 1- Gene/isoform ID 2- Confident log2(FC): the base 2 logarithm of the largest condition 2 vs condition 1 fold change of gene/isoform FPKM/TPM estimates supported by the bootstrap samples at a significance level of 'pval'. Positive values represent over-expression in condition 2, negative values representing over-expression in condition 1, and zero values indicate that no significant change was detected. 3- Single run log2(FC): the base 2 logarithm of the ratio between expression levels estimated by isoem2 for condition 2 and condition 1 (or the mean estimates in case replicates are provided for the two conditions). 4- condition 1 FPKM (or TPM) based on isoem2 run without bootstrapping (mean value in case of replicates) 5- condition 2 FPKM (or TPM) based on isoem2 run without bootstrapping (mean value in case of replicates) Example ------- isoDE2 -c1 /data1/BRAIN_UHR_Test/BRAIN_Genome_DIR/ /DataSet1/Test1_DIR/ -c2 /data1/BRAIN_UHR_Test/UHR_Genome_DIR/ /DataSet1/Test2_DIR/ -pval 0.05 -out "output1.txt" isoDE2 -c1 ./BRAIN_Genome_DIR/ ./Test1_DIR/ -c2 ./UHR_Genome_DIR/ ./DataSet1/Test2_DIR/ -pval 0.05 -out "output2.txt" Source Code: ------------ The source code can be found in the src directory under the installation path. Revision history ---------------- Version 2.0.0 (1/20/16) - added TPM estimates for genes and isoforms - added option to compute confidence intervals (bootstrapping) - added option for reading alignments from standard input - integrated IsoDE with IsoEM - Added DE for isoform FPKMs and genes and isoforms TPMs - Removed the isoviz visualization tool. To be added to the isoem2 suite in the future Version 1.1.4 (12/18/15) - added --counts option to generate expected read counts and --endseq to handle data from end-sequencing protocols Version 1.1.3 (10/11/15) - bug fix in handling CIGAR with indels in convert-iso-to-genome-coords - bug fix related to hisat/hisat2 alignments Version 1.1.1 (11/5/12) - bug fix related to clipped read alignments (CIGAR with S field) Version 1.1.0 (4/24/12) - added support for alignments with insertions and deletions Version 1.0.6 (8/12/11) - extract-isoform-sequences-from-genome (see http://dna.engr.uconn.edu/software/IsoEM/README-SAMPLE.TXT) generates transcripts in a randomized order - isoviz generates a gtf with fpkm values - added output file name option Version 1.0.5 (5/08/11) - bugfix related to paired read data Version 1.0.4 (2/22/11) - added polyATail option - further memory and speed improvements Version 1.0.3 (8/30/10) - correct for annotated repeats Version 1.0.2 (8/05/10) - improved memory requirements for storing genome sequence - added hexamer bias correction option - added isoviz visualization tool Version 1.0.1 (6/25/10) - added support for mate pairs - added support for max number of mismatches - performance improvements Version 1.0.0 (6/16/10) - first public release Contact ------- For questions or suggestions regarding IsoEM2/IsoDE2 you can contact: Igor Mandric (imandric1@student.gsu.edu) Sahar Al Seesi (sahar@engr.uconn.edu) Ion Mandoiu (ion@engr.uconn.edu) Alex Zelikovsky (alexz@cs.gsu.edu)
About
IsoEM2: fast bootstrap-based estimation of gene and isoform expression using RNA-Seq data
Resources
Stars
Watchers
Forks
Packages 0
No packages published