An SNV calling workflow developed in the Applied Bioinformatics and Theoretical Bioinformatics groups at the DKFZ. An earlier version (pre Github) of this workflow was used in the Pancancer project. The workflow is only suited for human data (hg37/hg19; some later versions also hg38), because of the big role annotations play in this workflow.
Your opinion matters! The development of this workflow is supported by the German Network for Bioinformatic Infrastructure (de.NBI). By completing this very short (30-60 seconds) survey you support our efforts to improve this tool.
The SNV workflow was in the pan-cancer analysis of whole genomes (PCAWG) and can be cited with the following publication:
- Pan-cancer analysis of whole genomes.
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium.
Nature volume 578, pages 82–93 (2020).
DOI 10.1038/s41586-020-1969-6
Containers are available in Dockstore.
Furthermore, the workflow is regularly used at the DKFZ throug the automation system One-Touch Pipeline:
- OTP: An automatized system for managing and processing NGS data.
Eva Reisinger, Lena Genthner, Jules Kerssemakers, Philip Kenschea, Stefan Borufka, Alke Jugold, Andreas Kling, Manuel Prinz, Ingrid Scholz, Gideon Zipprich, Roland Eils, Christian Lawerenz, Jürgen Eils.
Journal of Biotechnology, volume 261, pages 53-62 (2017).
DOI: 10.1016/j.jbiotec.2017.08.006
To run the workflow you first need to install a number of components and dependencies.
- You need a working Roddy installation. The version depends on the workflow version you want to use. You can find it in the buildinfo.txt under 'RoddyAPIVersion'. Please follow the instructions for the installation of Roddy itself and the PluginBase and the DefaultPlugin components. The main reference here is the Roddy documentation.
- Install the version you need -- either from the release tarballs or with git clone into your plugin directory.
Furthermore you need a number of tools and of course reference data, like a genome assembly and annotation databases.
The workflow contains a description of a Conda environment. A number of Conda packages from BioConda are required.
First install the BioConda channels:
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels bioconda-legacy
Then install the environment
conda env create -n SNVCallingWorkflow -f $PATH_TO_PLUGIN_DIRECTORY/resources/analysisTools/snvPipeline/environments/conda.yml
The name of the Conda environment is arbitrary but needs to be consistent with the condaEnvironmentName
variable. The default for that variable is set in resources/configurationFiles/analysisSNVCalling.xml
.
Note that the Conda environment not exactly the same as the software stack used for the Pancancer project.
The virtual env does not provide a full software stack. This section is only relevant for the ODCF cluster and mostly for developers. In rare cases it may be necessary for other users to set the tbiLsfVirtualEnvDir
.
This assumes, tbiLsfVirtualEnvDir
is installed centrally and accessible from all compute nodes. Furthermore, Python 2.7.9 in this example is assumed to be available via a module named "python/2.7.9".
module load python/2.7.9
virtualenv "$tbiLsfVirtualEnvDir"
source "$tbiLsfVirtualEnvDir/bin/activate"
pip install -r "$pluginInstallationDir/resources/analysisTools/snvPipeline/environments/requirements.txt
Then, in your configuration files, you need to set the tbiLsfVirtualEnvDir
variable.
PyPy support is broken and deprecated (2.2.0).
TBD
The SNV workflow takes two merged bam files as input (control and tumor sample)
It performs four steps:
- snv calling
- snv annotation
- snv deep annotation
- snv filter
Switch | Default | Description |
---|---|---|
bamfile_list | empty | Semicolon-separated list of BAM files, starting with the control's BAM. Each BAM file needs an index file with the same name as the BAM, but ".bai" suffixed |
sample_list | empty | Semicolon-separated list of sample names in the same order as bamfile_list |
possibleTumorSampleNamePrefixes | "( tumor )" | Bash-array of tumor sample name prefixes |
possibleControlSampleNamePrefixes | "( control )" | Bash-array of control sample name prefixes |
CHROMOSOME_INDICES | empty | Bash-array of chromosome names to which the analysis should be restricted |
CHROMOSOME_LENGTH_FILE | empty | Headerless TSV file with chromosome name, chromosome size columns |
CHR_SUFFIX | "" | Suffix added to the chromosome names |
CHR_PREFIX | "" | Prefix added to the chromosome names |
extractSamplesFromOutputFiles | true | Refer to the documentation of the COWorkflowBasePlugin for further information |
PYPY_OR_PYTHON_BINARY | python | Leave this set to "python". Some of the Python scripts used to work with PyPy (e.g. filter_PEoverlap.py that also uses hts-python instead of pysam), but this is deprecated. |
runSNVMetaCallingStep | false | Run a single job for snv calling instead of a job per chromosome. The meta job is optimized in regards to runtime and cpu / memory utilization. On the other hand it is a larger job which might be more difficult to schedule. |
runDeepAnnotation | true | Run the deep annotation step or stop the workflow before it. |
runFilter | true | Run the filter step or stop the workflow before it. |
runOnPancan | false | Run a special analysis type for pancancer type projects. This will produce an additional VCF. |
NUMBER_OF_MISMATCHES_THRESHOLD | -1 | resources/analysisTools/snvPipeline/snvAnnotation.sh: Number of mismatches that are allowed per read in order to consider this read. |
Please have a look at the resources/configurationFiles/analysisSNVCalling.xml
for a more complete list of parameters.
roddy.sh run projectConfigurationName@analysisName patientId \
--useconfig=/path/to/your/applicationProperties.ini \
--configurationDirectories=/path/to/your/projectConfigs \
--useiodir=/input/directory,/output/directory/snv \
--usePluginVersion=SNVCallingWorkflow:1.3.2 \
--cvalues="bamfile_list:/path/to/your/control.bam;/path/to/your/tumor.bam,sample_list:normal;tumor,possibleTumorSampleNamePrefixes:tumor,possibleControlSampleNamePrefixes:normal,REFERENCE_GENOME:/reference/data/hs37d5_PhiX.fa,CHROMOSOME_LENGTH_FILE:/reference/data/hs37d5_PhiX.chromSizes,extractSamplesFromOutputFiles:false"
- Set the parameter
isNoControlWorkflow
totrue
. - In
bamfile_list
provide only the tumor BAM as only BAM file. The same forsample_list
. possibleTumorSampleNamePrefixes
must still be set to the name of the tumor sample
In coding regions, the expected proportion of synonymous mutations compared to the total number of mutations should be low. By contrast, a high proportion of synonymous mutations suggests cross-species contamination. Any value above 0.5 (i.e. at least 50% of mutations are synonymous) is indicating a contamination. A value below 0.35 is considered to be OK. Values in the range of 0.35-0.5 are unclear.
The convertToStdVCF.py may be helpful to convert the VCFs produced by this workflow into standard 4.2-conform VCFs.
There are, however, few caveats you should be aware of:
- This is basically rescued old code that never has been extensively tested for 100% conformance in all cases. We advise you check the resulting VCFs with a VCF validator.
- Not all columns of the DKFZ VCF are currently migrated into the standard-conform VCF. Some columns may be lost.
- The VCF must only contain a single sample column.
- Only uncompressed VCFs are processed and the input must not be a stream/pipe, because the script does two passes over the input.
convertToStdVCF.py <infile> <outfile> <sampleId> [<config.json>]
The optional configuration JSON file defaults to the convertToStdVCF.json
residing besides the convertToStdVCF.py
script. The default JSON contains some comments that may help you to customize the script output.
-
upcoming
- patch: Remove all code related to PyPy and hts-python (including
copysam.py
andPYPY_OR_PYTHON_BINARY
)
- patch: Remove all code related to PyPy and hts-python (including
-
2.2.0
- minor: Update virtualenv
- Updated environment to matplotlib 1.5.3. Version 1.4.3 seems to be incompatible to numpy 1.11.3 (now;
import matplotlib.pyplot
failed). - Used pysam 0.16.0.1 from the orginal 2.1.1 environment. This breaks the PyPy support (
copysam.py
is not yet adapted) but is necessary for the changes done for workflow version 2.1.1. - Updated to BioPython 1.71. This was lost in workflow version 2.1.1 but is necessary for some data
from bio import bgzf
is needed (vcfparser.py
).
- Updated environment to matplotlib 1.5.3. Version 1.4.3 seems to be incompatible to numpy 1.11.3 (now;
- minor: PyPy is deprecated. Configuration values related to it are (mostly) removed.
- minor: Allow configuration of virtualenv in
tbi-lsf-cluster.sh
- Lifted readme from ReleaseBranch_1.2.166
-
Note: Again the Conda environment is broken and cannot get rebuild.
- minor: Update virtualenv
-
2.1.1
- patch: Removed hardcoded path to ExAC database
-
2.1.0
-
2.0.0
- Upgrading to BCFtools/SAMtools/htslib to 1.9
- Calling variants using BCFtools mpileup and samtools mpileup for lookup in control (similar to the old way)
- PE overlaps are removed as before
- Removing supplementary reads and left-out duplicate reads from variant calling
- Deactivating the use to ExAC for no-control workflow filtering.
- Updating to new local control and change the max AF threshold to 0.01
-
1.4.1
-
1.4.0
-
1.3.2
-
1.3.1
-
1.3.0
-
1.2.166-3 (ReleaseBranch_1.2.166)
- Long-term support version.
- Note that this version depends on the legacy COWorkflows plugin.
-
1.2.166-3
-
1.2.166-2
-
1.2.166-1
- This version derives from 1.0.166-2_R2.4 and was done in the context of a migration of our cluster from PBS to IBM LSF and associated update of the workflow manager Roddy.
-
1.1.4-2
-
1.1.4-1
-
1.1.4
-
1.1.3
-
1.1.2
-
1.0.166-2_R2.4
-
1.0.166-1_priorLsfTransition
-
1.0.166-1
The versions listed below are absent from this repository and are only listed as legacy versions. At the time, the code was managed in an SVN repository and its history was not migrated into the current git repository.
-
1.0.166
-
1.0.164
-
1.0.163
-
1.0.162
-
1.0.158
-
1.0.156
-
1.0.155
- Move SNV Calling workflow to a custom plugin
-
1.0.132
-
1.0.131
- Change workflow class to override another execute method. This makes the workflow a bit cleaner.
- filterVcfForBias.py
- filter_PEoverlap.py
-
1.0.114
-
1.0.109
- Bugfix: Rainfall plots work now also if there are less than two SNVs on chromosome 1 or any other chromosome
-
1.0.105
- Bugfix: Use CHROMOSOME_LENGTH_FILE instead of CHROM_SIZES_FILE in snv filter (filter_vcf.sh) script.
-
1.0.104
-
1.0.103
Have a look at the Contributors file.