Skip to content

getting started

DrYak edited this page Mar 12, 2021 · 4 revisions

Getting started

V-Pipe uses Snakemake [1], a robust workflow management system written in Python. Snakemake (>= 3.9.0) is now integrated with the Conda package manager, which allows to define the vpipe workflow together with its software dependencies. In order to use this feature, we strongly advise to install Snakemake using Conda. Additionally, we advise to install Python 3.5 or later versions, as required Snakemake versions depend on it.

Installation for UNIX-like operating systems

V-pipe integrates various open-source software packages. In order to avoid installation burdens, as well as overcome incompatibilities due to different software versions, we provide Conda environments for each rule.

Install conda

Conda is a cross-platform package management system, as well as an environment manager application. Particularly, conda allows you to install the workflow dependencies into an isolated environment. Before proceeding with Snakemake installation, check if conda is installed and if it is in your PATH. Type in a terminal conda -V, if conda is installed you should see something like the following:

conda 4.7.12

If you need to install conda, please refer to the documentation. We recommend to install miniconda3, and make sure that the .bash_profile file includes the path to miniconda3, taking precedence over previous Conda (or Anaconda) installations.

Install Python

Now, let’s proceed with the installation of Python (≥ 3.5). If Python is installed (and if it is in your PATH), you should be able to read the version number, after typing in a terminal: python −−version. Also, the Python version should correspond to the conda distribution you have installed. If Python is not installed or the version is outdated (< 3.5), please install a newer Python version as follows:

conda install python=3.6

Install Snakemake

For installing Snakemake, type the following:

conda install -c conda-forge -c bioconda snakemake

If Snakemake was succesfully installed, we can have a look at the command line help with snakemake −−help . Note that, e.g. , if you wish to use the Conda integration, the command-line option –use-conda must be added to the execution command.

Here, we have used the bioconda channel. A Conda channel refers to a repository where Conda looks for packages. Many bioinformatics tools are available through the bioconda channel and, particularly, most V-pipe’s software dependencies are included in this repository. Although Conda is a cross-platform tool, bioconda pre-compiled packages are not tested in Windows. For this reason, this installation mode is only recommended for Linux and Mac OS.

V-pipe

Finally, the pipeline (script, configuration file, conda environments and test data) can be retrieved from github. In a terminal, change to the directory into which you wish to clone the repository and type:

# clone V-pipe into a directory of your choice
git clone https://github.com/cbg-ethz/V-pipe.git path/to/V-pipeDir

Input files

V-pipe is designed with hierarchically organized data in mind, and it expects the input files to be grouped by samples. In this context, samples can refer to, e.g. , patient samples or biological replicates of an experiment. In order to distinguish, different datasets belonging to the same sample, a second level is expected, and it can refer to, e.g. , sample dates. Below, we show how the working directory is expected lo look,

working_directory
├─references
│   └───HXB2.fasta
└─samples
  ├── patient1
  │   ├── 20100113
  │   │   └──raw_data
  │   │      ├──patient1_20100113_R1.fastq
  │   │      └──patient1_20100113_R2.fastq
  │   └── 20110202
  │       └──raw_data
  │          ├──patient1_20100202_R1.fastq
  │          └──patient1_20100202_R2.fastq
  └── patient2
      └── 20081130
          └──raw_data
             ├──patient2_20081130_R1.fastq
             └──patient2_20081130_R2.fastq

As input, V-pipe requires the raw sequencing data in fastq format (compressed or uncompressed), and a reference sequence in fasta format. Particularly, paired-end sequencing is supported, in which case two fastq-files are expected as input. Additionally, a configuration file containing user-configurable options should be provided (see options).

NOTES:

  1. Patient identifiers and sample dates, or alternatives chosen for the two-level directory hierarchy, must not contain underscores. We also recommend ommiting underscores in the reference's FASTA file.
  2. Input files named as R1.fastq or R1.fastq.gz are not handled. A prefix specifying whether the file corresponds to the forward or the reverse set of reads is expected. In the example above, we prepend the patient identifier and the sample date, however, any other string will do.

Reference sequences are well established for many virus populations. However, due to the high diversity, virus populations oftentimes diverge from those reference sequences considerably. In order to reduce biases introduced by reference-based read mapping, we use ngshmmaling (see repository) aligner. This aligner uses an initial reference (named cohort_consensus.fasta) to obtain an initial rough alignment, which, in turn, is used for building a profile Hidden Markov Model (profile-HMM). Thereby, features shared among related sequences are captured through position-specific scores. The fasta file for the initial alignment is expected to be stored in the reference directory, e.g.,

references
├──HXB2.fasta
└──cohort_consensus.fasta

This initial consensus can be obtained from an independent cohort or study, from available reference sequences or can be built de novo. If the cohort_consensus.fasta file is not provided, an initial consensus is generated assembling reads de novo using software Vicuna (for more information, see project website). If you want to use this mode, you need to install Vicuna as it is not part of bioconda.

Now, the reference sequence (here, HXB2.fasta) is used for pre-filtering reads before running Vicuna, as well as for the alignment “lift-over”. Sequencing reads from each sample are aligned against the profile-HMM. The “lift-over” allow the final alignments to be reported with respect to a reference sequence complying to numbering positions conventionally used for downstream analyses. For instance, for HIV-1 the standard reference sequence is HXB2.

Also, we use the reference sequence (here, HXB2.fasta) for aligning sequencing reads when using V-pipe with bwa or bowtie2 read mappers.

Running V-pipe

First, open a terminal and change into the working directory where input files are stored, e.g. workdir. Then, run the initialization script,

cd path/to/workdir
path/to/V-pipeDir/init_project.sh

If it is the first time you attempt to run V-pipe, we advise to check whether output files can be created from the inputs, using the --dryrun option.

./vpipe --dryrun

The option --dryrun allows you to see the scheduling plan including interpretation of wildcards.

In order to execute the pipeline in a single node, type:

./vpipe --use-conda -p

Additionally, you can specify the maximum number of cores that can be used for the rules thqt support parallel execution, e.g.:

./vpipe --use-conda -p --cores 2

NOTE: Starting Snakemake 5.11.0 the --cores option is mandatory. Depending on the Snakemake version you are using, you might need to always specify the number of cores.

Optionally, you can redirect the standard output and standard error to file log.txt, e.g.;

./vpipe --use-conda -p 2>&1 | tee log.txt

References

[1] Köster, J. and Rahmann, S. Snakemake - A scalable bioinformatics workflow engine. Bioinformatics 2012.

Clone this wiki locally