Skip to content

Snakemake pipeline for calculating omic features required by F3UTER

License

Notifications You must be signed in to change notification settings

sid-sethi/Generate-F3UTER-features

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pipeline for calculating omic features required by F3UTER

This snakemake pipeline generates omic features for a set of input candidate regions. The omic features can be broadly categorised as either genomic or transcriptomic in nature. Features calculated from genomic data included poly(A) signal occurrence, DNA sequence conservation, mono-/di-nucleotide frequency, transposon occurrence and DNA structural properties. Features calculated from transcriptomic data include entropy efficiency of the mapped reads (EE) - a measure of the uniformity of read coverage over a genomic region, and percentage difference (a measure of the absolute difference) between the reads mapped at the boundaries (PD).

Getting Started

Input

  • Regions of interest: can be ERs (output of Generate ERs) or a standard BED file. If a BED file, it must contain at least the following columns - seqnames, start, end, strand and ER_id. Please see /test_data for a demo file. Please note that contig names in BED file should be UCSC style: chr1, chr2 .... chrM.
  • Aligned RNA-seq reads in bigwig format to calculate the expression features. Multiple RNA-seq replicates can be provided. In case of multiple bigwigs, mean expression will be calculated. Please note that contig names in the bigwig should be Ensembl style: 1, 2.... MT.
  • chromosome lengths are required by the code. A file containing chromosome lengths for hg38 is provided in /data and is automatically used by the pipeline.

External data resources:

Process the repeatmasker file using the commands below. Please see /test_data for a demo file.

gunzip hg38.fa.out.gz
awk -F " " '{print $5" "$6" "$7" "$11}' hg38.fa.out | sed 1d > hg38.repeatMasker.mod.fa.out

Output

The main output files generated by the pipeline are:

  • F3UTER_features/<sample_name>_nt_freq.txt - mono- and di-nucleotide frequencies (n=20)
  • F3UTER_features/<sample_name>_phastcons.txt - sequence conservation (phastCons) scores (n=1)
  • F3UTER_features/<sample_name>_polyA_signal.txt - poly(A) signal occurrence (n=1; binary outcome)
  • F3UTER_features/<sample_name>_repeats.txt - transposon occurrence (n=1)
  • F3UTER_features/<sample_name>_exp_feat.txt - expression features (n=2)
  • F3UTER_features/<sample_name>_structural_feat.txt - DNA structural properties (n=16)

Depedencies

  • miniconda
  • snakemake - can be installed via conda (snakemake>=5.3)
  • The rest of the dependencies (R packages) are installed via conda.

Installation

Clone the pipeline:

git clone --recursive https://github.com/sid-sethi/Generate-F3UTER-features.git

Usage

Edit config.yml to set up the working directory and input files. Snakemake command should be issued from within the pipeline directory.

cd Generate-F3UTER-features
snakemake --use-conda -j <num_cores> all

If you provide more than one core, independent snakemake rules will be processed simultaneously. This pipeline uses 6 cores at most. It is a good idea to do a dry run (using -n parameter) to view what would be done by the pipeline before executing the pipeline.

snakemake --use-conda -n all

Snakemake can be run to only install the required conda environments without running the full workflow. Subsequent runs with --use-conda will make use of the local environments without requiring internet access. This is suitable for running the pipeline offline.

snakemake --use-conda --conda-create-envs-only

Licence

Copyright 2020 Astex Therapeutics Ltd.

This repository is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the LICENSE file (GNU General Public License) for more details.

About

Snakemake pipeline for calculating omic features required by F3UTER

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages