Skip to content

Commit

Permalink
Merge pull request #355 from maxibor/kraken
Browse files Browse the repository at this point in the history
Adding Kraken2 metagenomics classifier
  • Loading branch information
jfy133 authored Feb 21, 2020
2 parents 9077aab + ace02e1 commit 103a746
Show file tree
Hide file tree
Showing 10 changed files with 347 additions and 25 deletions.
3 changes: 3 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,9 @@ jobs:
- name: MALTEXTRACT Basic with MALT plus MaltExtract
run: |
nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-maltextract" -profile test,docker --paired_end --run_bam_filtering --bam_discard_unmapped --bam_unmapped_type 'fastq' --run_metagenomic_screening --metagenomic_tool 'malt' --database "/home/runner/work/eager/eager/databases/malt" --run_maltextract --maltextract_ncbifiles "/home/runner/work/eager/eager/databases/maltextract/" --maltextract_taxon_list 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/testdata/Mammoth/maltextract/MaltExtract_list.txt'
- name: METAGENOMIC Run the basic pipeline but with unmapped reads going into Kraken
run: |
nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-kraken" -profile test_kraken,docker ${{ matrix.endedness }} --run_bam_filtering --bam_discard_unmapped --bam_unmapped_type 'fastq'
- name: SEXDETERMINATION Run the basic pipeline with the bam input profile, but don't convert BAM, skip everything but sex determination
run: |
nextflow run ${GITHUB_WORKSPACE} "$TOWER" -name "$RUN_NAME-sexdeterrmine" -profile test_humanbam,docker --bam --skip_fastqc --skip_adapterremoval --skip_mapping --skip_deduplication --skip_qualimap --single_end --run_sexdeterrmine
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
* [#326](https://github.com/nf-core/eager/pull/326) - Add Biopython and [xopen](https://github.com/marcelm/xopen/) dependencies
* [#336](https://github.com/nf-core/eager/issues/336) - Change default Y-axis maximum value of DamageProfiler to 30% to match popular (but slower) mapDamage, and allow user to set their own value.
* [#352](https://github.com/nf-core/eager/pull/352) - Add social preview image
* [#355](https://github.com/nf-core/eager/pull/355) - Add Kraken2 metagenomics classifier

### `Fixed`

Expand Down
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,8 @@ Additional functionality contained by the pipeline currently includes:
#### Metagenomic Screening

* Taxonomic binner with alignment (`MALT`)
* aDNA characteristic screening of taxonomically binned data (`MaltExtract`)
* Taxonomic binner without alignment (`Kraken2`)
* aDNA characteristic screening of taxonomically binned data from MALT (`MaltExtract`)

## Quick Start

Expand Down Expand Up @@ -157,3 +158,4 @@ If you've contributed and you're missing in here, please let me know and I'll ad
* Vågene, Å.J. et al., 2018. Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature ecology & evolution, 2(3), pp.520–528. Available at: [http://dx.doi.org/10.1038/s41559-017-0446-6](http://dx.doi.org/10.1038/s41559-017-0446-6).
* Herbig, A. et al., 2016. MALT: Fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv, p.050559. Available at: [http://biorxiv.org/content/early/2016/04/27/050559](http://biorxiv.org/content/early/2016/04/27/050559).
* **MaltExtract** Huebler, R. et al., 2019. HOPS: Automated detection and authentication of pathogen DNA in archaeological remains. bioRxiv, p.534198. Available at: [https://www.biorxiv.org/content/10.1101/534198v1?rss=1](https://www.biorxiv.org/content/10.1101/534198v1?rss=1). Download: [https://github.com/rhuebler/MaltExtract](https://github.com/rhuebler/MaltExtract)
* **Kraken2** Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. Available at: [https://doi.org/10.1186/s13059-019-1891-0](https://doi.org/10.1186/s13059-019-1891-0). Download: [https://ccb.jhu.edu/software/kraken2/](https://ccb.jhu.edu/software/kraken2/)
78 changes: 78 additions & 0 deletions bin/kraken_parse.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
#!/usr/bin/env python


import argparse
import csv


def _get_args():
'''This function parses and return arguments passed in'''
parser = argparse.ArgumentParser(
prog='kraken_parse',
formatter_class=argparse.RawDescriptionHelpFormatter,
description='Parsing kraken')
parser.add_argument('krakenReport', help="path to kraken report file")
parser.add_argument(
'-c',
dest="count",
default=50,
help="Minimum number of hits on clade to report it. Default = 50")
parser.add_argument(
'-o',
dest="output",
default=None,
help="Output file. Default = <basename>.kraken_parsed.csv")

args = parser.parse_args()

infile = args.krakenReport
countlim = int(args.count)
outfile = args.output

return(infile, countlim, outfile)


def _get_basename(file_name):
if ("/") in file_name:
basename = file_name.split("/")[-1].split(".")[0]
else:
basename = file_name.split(".")[0]
return(basename)


def parse_kraken(infile, countlim):
'''
INPUT:
infile (str): path to kraken report file
countlim (int): lowest count threshold to report hit
OUTPUT:
resdict (dict): key=taxid, value=readCount
'''
with open(infile, 'r') as f:
resdict = {}
csvreader = csv.reader(f, delimiter='\t')
for line in csvreader:
reads = int(line[1])
if reads >= countlim:
taxid = line[4]
resdict[taxid] = reads
return(resdict)


def write_output(resdict, infile, outfile):
with open(outfile, 'w') as f:
basename = _get_basename(infile)
f.write(f"TAXID,{basename}\n")
for akey in resdict.keys():
f.write(f"{akey},{resdict[akey]}\n")


if __name__ == '__main__':
INFILE, COUNTLIM, outfile = _get_args()

if not outfile:
outfile = _get_basename(INFILE)+".kraken_parsed.csv"

tmp_dict = parse_kraken(infile=INFILE, countlim=COUNTLIM)
write_output(resdict=tmp_dict, infile=INFILE, outfile=outfile)
59 changes: 59 additions & 0 deletions bin/merge_kraken_res.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/env python

import argparse
import os
import pandas as pd
import numpy as np


def _get_args():
'''This function parses and return arguments passed in'''
parser = argparse.ArgumentParser(
prog='merge_kraken_res',
formatter_class=argparse.RawDescriptionHelpFormatter,
description='Merging csv count files in one table')
parser.add_argument(
'-o',
dest="output",
default="kraken_count_table.csv",
help="Output file. Default = kraken_count_table.csv")

args = parser.parse_args()

outfile = args.output

return(outfile)


def get_csv():
tmp = [i for i in os.listdir() if ".csv" in i]
return(tmp)


def _get_basename(file_name):
if ("/") in file_name:
basename = file_name.split("/")[-1].split(".")[0]
else:
basename = file_name.split(".")[0]
return(basename)


def merge_csv(all_csv):
df = pd.read_csv(all_csv[0], index_col=0)
for i in range(1, len(all_csv)):
df_tmp = pd.read_csv(all_csv[i], index_col=0)
df = pd.merge(left=df, right=df_tmp, on='TAXID', how='outer')
df.fillna(0, inplace=True)
return(df)


def write_csv(pd_dataframe, outfile):
pd_dataframe.to_csv(outfile)


if __name__ == "__main__":
OUTFILE = _get_args()
all_csv = get_csv()
resdf = merge_csv(all_csv)
write_csv(resdf, OUTFILE)
print(resdf)
28 changes: 28 additions & 0 deletions conf/test_kraken.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* -------------------------------------------------
* Nextflow config file for running tests
* -------------------------------------------------
* Defines bundled input files and everything required
* to run a fast and simple test. Use as follows:
* nextflow run nf-core/eager -profile test, docker (or singularity, or conda)
*/

params {
config_profile_name = 'Test profile kraken'
config_profile_description = 'Minimal test dataset to check pipeline function with kraken metagenomic profiler'
// Limit resources so that this can run on Travis
max_cpus = 2
max_memory = 6.GB
max_time = 48.h
genome = false
//Input data
single_end = false
metagenomic_tool = 'kraken'
run_metagenomic_screening = true
readPaths = [['JK2782_TGGCCGATCAACGA_L008', ['https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz','https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz']],
['JK2802_AGAATAACCTACCA_L008', ['https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz','https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz']],
]
// Genome references
fasta = 'https://raw.githubusercontent.com/nf-core/test-datasets/eager/reference/Mammoth/Mammoth_MT_Krause.fasta'
database = 'https://github.com/nf-core/test-datasets/raw/eager/databases/kraken/eager_test.tar.gz'
}
4 changes: 3 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -485,6 +485,8 @@ Each module has it's own output directory which sit alongside the `MultiQC/` dir
* `sex_determination/` this contains the output for the sex determination run. This is a single `.tsv` file that includes a table with the Sample Name, the Nr of Autosomal SNPs, Nr of SNPs on the X/Y chromosome, the Nr of reads mapping to the Autosomes, the Nr of reads mapping to the X/Y chromosome, the relative coverage on the X/Y chromosomes, and the standard error associated with the relative coverages. These measures are provided for each bam file, one row per bam. If the `sexdeterrmine_bedfile` option has not been provided, the error bars cannot be trusted, and runtime will be considerably longer.
* `nuclear_contamination/` this contains the output of the nuclear contamination processes. The directory contains one `*.X.contamination.out` file per individual, as well as `nuclear_contamination.txt` which is a summary table of the results for all individual. `nuclear_contamination.txt` contains a header, followed by one line per individual, comprised of the Method of Moments (MOM) and Maximum Likelihood (ML) contamination estimate (with their respective standard errors) for both Method1 and Method2.
* `bedtools/` this contains two files as the output from bedtools coverage. One file contains the 'breadth' coverage (`*.breadth.gz`). This file will have the contents of your annotation file (e.g. BED/GFF), and the following subsequent columns: no. reads on feature, # bases at depth, length of feature, and % of feature. The second file (`*.depth.gz`), contains the contents of your annotation file (e.g. BED/GFF), and an additional column which is mean depth coverage (i.e. average number of reads covering each position).
* `metagenomic_classification/` This contains the output for a given metagenomic classifer (currently only for MALT). Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonmic assignment etc.
* `metagenomic_classification/` This contains the output for a given metagenomic classifer.
* Malt will contain RMA6 files that can be loaded into MEGAN6 or MaltExtract for phylogenetic visualisation of read taxonomic assignments and aDNA characteristics respectively. Additional a `malt.log` file is provided which gives additional information such as run-time, memory usage and per-sample statistics of numbers of alignments with taxonomic assignment etc.
* Kraken will contain the Kraken output and report files, as well as a merged Taxon count table.
* `MaltExtract/` this will contain a `results` directory in which contains the output from MaltExtract - typically one folder for each filter type, an error and a log file. The characteristics of each node (e.g. damage, read lengths, edit distances - each in different txt formats) can be seen in each sub-folder of the filter folders. Output can be visualised either with the [HOPS postprocessing script](https://github.com/rhuebler/HOPS) or [MEx-IPA](https://github.com/jfy133/MEx-IPA)
* `consensus_sequence` this contains three FASTA files from VCF2Genome, of a consensus sequence based on the reference FASTA with each sample's unique modifications. The main FASTA is a standard file with bases not passing the specified thresholds as Ns. The two other FASTAS (`_refmod.fasta.gz`) and (`_uncertainity.fasta.gz`) are IUPAC uncertainty codes (rather than Ns) and a special number-based uncertainity system used for other downstream tools, respectively.
Loading

0 comments on commit 103a746

Please sign in to comment.