Install required software packages according to requirements
Download the scripts:
git clone https://github.com/lulab/exSeek-dev.git
Processed files: /BioII/lulab_b/shared/genomes/hg38
Refer to the documentation for details.
snakemake --snakefile snakemake/prepare_genome.snakemake \
--configfile snakemake/config.yaml \
--rerun-incomplete -k
File name | Description |
---|---|
${input_dir}/fastq/${sample_id}.fastq |
Read files (single-end sequencing) |
${input_dir}/fastq/${sample_id}_1.fastq , ${input_dir}/fastq/${sample_id}_2.fastq |
Read files (paired-end sequencing) |
${input_dir}/sample_ids.txt |
A text file with one sample ID per line. |
${input_dir}/sample_classes.txt |
A tab-deliminated file (with header) with two columns: sample_id, label |
${input_dir}/batch_info.txt |
A comma-deliminated file (with header) with at least two columns: sample_id, batch1, batch2, ... |
${input_dir}/reference_genes.txt |
A text file with reference gene IDs. |
${input_dir}/compare_groups.yaml |
A YAML file defining positive and negative classes. |
compare_groups.yaml
Every key-value pairs defines a compare group and a negative-positive class pair:
Normal-CRC: ["Healthy Control", "Colorectal Cancer"]
All parameters are specified in a configuration file in YAML format.
An example configuration file is (snakemake/config.yaml).
The parameter values in the configuration file can also be overrided through the --config
option in snakemake.
The following parameters should be changed:
Parameter | Description | Example |
---|---|---|
genome_dir |
Directory for genome and annotation files | genome/hg38 |
data_dir |
Directory for input files | data/scirep |
temp_dir |
Temporary directory | tmp |
sample_id_file |
A text file containing sample IDs | data/scirep/sample_ids.txt |
output_dir |
Directory for all output files | output/scirep |
tools_dir |
Directory for third-party tools | |
aligner |
Mapping software | bowtie2 |
adaptor |
3' adaptor sequence for single-end RNA-seq | AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC |
python2 |
Path to Python 2 | /apps/anaconda2/bin/python |
python3 |
Path to Python 3 | /apps/anaconda2/bin/python |
Option | Description |
---|---|
--config | Additional configuration parameters |
-j | Number of parallel jobs |
--dryrun | Do not execute |
-k | Do not stop when an independent job fails |
Please refer the link for descriptions of cluster configuration file.
Configuration file: snakemake/cluster.yaml
Here is an example configuration:
__default__:
queue: Z-LU
name: {rule}.{wildcards}
stderr: logs/cluster/{rule}/{wildcards}.stderr
stdout: logs/cluster/{rule}/{wildcards}.stdout
threads: {threads}
resources: span[hosts=1]
Commonly used parameters
Parameter | Description |
---|---|
__default__ |
Rule name (__default__ ) for default configuration) |
queue |
Queue name |
name |
Job name |
stderr |
Log file for standard error |
stdout |
Log file for standard output |
threads |
Number of parallel threads for a job |
resources |
Resource requirements. span[hosts=1] prevents parallel jobs from being submitted to different nodes |
Run snakemake
snakemake --snakefile snakemake/${snakefile} \
--configfile snakemake/config.yaml \
--cluster 'bsub -q {cluster.queue} -J {cluster.name} -e {cluster.stderr} \
-o {cluster.stdout} -R {cluster.resources} -n {cluster.threads}' \
--cluster-config snakemake/cluster.yaml \
--rerun-incomplete -k -j40
Note: replace ${snakefile}
with a Snakefile.
snakemake --snakefile snakemake/quality_control.snakemake \
--configfile snakemake/config.yaml \
--rerun-incomplete -k
bin/generate_snakemake.py sequential_mapping --rna-types rRNA,miRNA,piRNA,Y_RNA,srpRNA,tRNA,snRNA,snoRNA,lncRNA,mRNA,tucpRNA \
-o snakemake/mapping_small/sequential_mapping.snakemake
snakemake --snakefile snakemake/mapping_small.snakemake \
--configfile snakemake/config.yaml \
--rerun-incomplete -k
File name | Descrpition |
---|---|
snakemake/sequential_mapping.snakemake |
Snakefile for sequential mapping. Required by snakemake/mapping_small.snakemake |
${output_dir}/cutadapt/${sample_id}.fastq |
Reads with adaptor trimmed |
${output_dir}/tbam/${sample_id}/${rna_type}.bam |
BAM files in transcript coordinates |
${output_dir}/gbam/${sample_id}/${rna_type}.bam |
BAM files in genome coordinates |
${output_dir}/unmapped/${sample_id}/${rna_type}.fa.gz |
Unmapped reads in each step |
${output_dir}/fastqc/${sample_id}_fastqc.html |
FastQC report file |
${output_dir}/summary/fastqc.html |
Summary report for FastQC (HTML) |
${output_dir}/summary/fastqc.txt |
Summary table for FastQC |
${output_dir}/summary/fastqc.ipynb |
Summary report for FastQC (Jupyter notebook) |
${output_dir}/summary/read_counts.txt |
Summary table for read counts |
${output_dir}/stats/mapped_read_length_by_sample/${sample_id} |
Length distribution of mapped reads |
snakemake --snakefile snakemake/expression_matrix.snakemake \
--configfile snakemake/config.yaml \
--rerun-incomplete -k
File name | Descrpition |
---|---|
${output_dir}/count_matrix/transcript.txt |
Count matrix of transcripts |
${output_dir}/count_matrix/htseq.txt |
Count matrix of genes generated using HTSeq-count |
${output_dir}/count_matrix/featurecounts.txt |
Count matrix of genes generated using featureCounts |
${output_dir}/counts_by_biotype/${count_method}/${sample_id}/${rna_type} |
Gene/transcript counts generated using a feature counting tool |
Count matrix
- File path:
${output_dir}/count_matrix/transcript.txt
- First row: sample IDs
- First column: feature names
- Feature name:
gene_id|gene_type|gene_name
snakemake --snakefile snakemake/call_domains_long.snakemake \
--configfile snakemake/config.yaml \
--rerun-incomplete -k
File name | Descrpition |
---|---|
${output_dir}/domain_counts/${bin_size}/${pvalue}/${sample_id}.bed |
Read counts in long RNA domains (BED format with read counts in Column 5 |
${output_dir}/count_matrix/domain_${pvalue}.txt |
Read count matrix of long RNA domains |
${output_dir}/domains/${bin_size}/${pvalue}.bed |
Long RNA domain locations |
${output_dir}/domains_recurrence/${bin_size}/${pvalue}.bed |
Recurrence of long RNA domains among samples (Column 5) |
Read count matrix
- File path:
${output_dir}/count_matrix/domain_long.txt
- First row: sample IDs
- First column: feature names
- Feature name:
gene_id|gene_type|gene_name|domain_id|transcript_id|start|end
| File name | Description |
| ${output_dir}/normalized_matrix/${normalization_method}.${imputation_method}.${batch_removal_method}.txt
|
| ${output_dir}/matrix_processing/normalization/${normalization_method}.txt
|
| ${output_dir}/matrix_processing/imputation/${normalization_method}.${imputation_method}.txt
|
| ${output_dir}/matrix_processing/batch_removal/${batch_removal_method}.${batch_index}.txt
|