Skip to content

Pipeline detecting pathogen DNA in sequencing data, specifically designed to differentiate between host (mouse) and pathogen genomes.

Notifications You must be signed in to change notification settings

alx-sch/microbiome_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Detecting Pathogen DNA in Sequencing Data

System Specification

  • OS: Ubuntu 22.04.01
  • Virtualization Platform: Virtualbox 6.5.0
  • Architecture: x86_64 GNU/Linux
  • Disk Space: 100 GB
  • Memory (RAM): 5.3 GB

Reference Genomes

Downloaded .fna files using 'ncbi-datasets' tool:

  • datasets download genome accession GCF_000007905.1,GCF_000393015.1,GCF_003812925.1,GCF_000013425.1,GCF_000730685.1,GCF_010587025.1 --include genome
  • Combine pathogen data: cat GCF_000007905.1_ASM790v1_genomic.fna GCF_000393015.1_EntefaecT5V1_genomic.fna GCF_003812925.1_ASM381292v1_genomic.fna GCF_000013425.1_ASM1342v1_genomic.fna GCF_000730685.1ASM73068v1_genomic.fna GCF_010587025.1_ASM1058702v1_genomic.fna> combined_pathogen_genomes.fna
  • Excursion FASTA Files:
    • FASTA (.fna) files have the following structure:
      • Lines starting with '>' indicate the beginning of a sequence record (incl unqiue identifier).
      • The actual sequence data follows this indicator line (can be multiple lines)
      >NC_004917.1 Helicobacter hepaticus ATCC 51449, complete sequence
      CATTAAACCAAGTATAAAATCTATAAATTATCTTTA...
  • Indexing:
    • BWA indexing is the process of creating an efficient data structure from a reference genome, allowing the BWA algorithm to rapidly align short DNA sequences during high-throughput sequencing analysis.
    • bwa index -p pathogen_combined_index combined_pathogen_genomes.fna
    • bwa index -p mouse_index GCF_000001635.27_GRCm39_genomic.fna (~18 hours!)

Experimental Data

Paired-end FASTQ files are provided by the client.

Download DNA sequencing data from experiments, onbtained from feces; ideally containing full genomes (microbial and host components)- - https://www.ebi.ac.uk/ena/browser/view/SRX3198644: fasterq-dump --outdir . --gzip SRX3198644 - Only 74.9 kB --> 220 reads (wc -l SRX3198644.fastq | awk '{print $1/4}')

- https://www.ncbi.nlm.nih.gov/sra/SRX22381836[accn]: fastq-dump --outdir . --gzip --split-files SRR26681942 - Both files 10.0 GB each (!) --> two files as data was generated by pair-end sequencing - -> 22.4M reads each set --> heavy on computation --> create subsets for a general overview

  • Excursion: FASTQ Files:
    • Contains sequence data and quality information, as produced by sequencing machines

    • Each entry conists of four lines:

      • Sequence Indentifier Line: Starts with '@' followed by unique identifier for the read, may contain additional information like sequencing instrument, sample details, read length, etc.
      • Sequence Line: Containts actual nucleotide sequence of DNA/RNA read.
      • Quality Score Identifier Line: Start with '+' and usually mirrors the sequence identifier line.
      • Quality Score Line: Contains ASCII-encoded quality scores, representing the confidence/accuracy of each base call.
      @SRX3198644.1 1 length=45
      TTGTTGAACTGGCTCTTTTTCGCAATCCCGCTGTAAGTACTGTCT
      +SRX3198644.1 1 length=45
      AAAAAEEEEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEE

- Generating subsets of ~1M reads each using 'seqtk':~~ - seqtk sample -s 42 SRR26681942_1.fastq 0.05 > subset_SRR26681942_1.fastq - seqtk sample -s 42 SRR26681942_2.fastq 0.05 > subset_SRR26681942_2.fastq - '42' is a seed value (making random sampling reproducale), new subsets contain 5% of total reads.

Trimming / Filtering

via 'fastp':

fastp -i F1_S4_R1.fastq.gz -I F1_S4_R2.fastq.gz -o F1_S4_TRIMMED_R1.fastq.gz -O F1_S4_TRIMMED_R2.fastq.gz -h F1_S4_fastp_report.html

Result report here in TRIMMED folders (HTML files).

  • F1_S4: The input has little adapter percentage (~0.075690%), probably it's trimmed before.
  • F11_S14: The input has little adapter percentage (~0.136754%), probably it's trimmed before.
  1. Before Filtering:

    • Total reads: 2,978,791 for both Read1 and Read2.
    • Total bases: Approximately 226 million bases for both Read1 and Read2.
    • High-quality bases: Over 95% of bases have a quality score of at least 20 (Q20), and over 92% have a quality score of at least 30 (Q30).
  2. After Filtering:

    • Total reads: 2,913,522 for both Read1 and Read2.
    • Total bases: Approximately 221 million bases for both Read1 and Read2.
    • Quality improvement: The percentage of high-quality bases (Q20 and Q30) has increased after filtering, indicating improved data quality.
  3. Filtering Result:

    • Reads passed filter: 5,827,044 reads passed the filtering criteria.
    • Failed reads: A small proportion of reads failed due to low quality, too many ambiguous bases (N), or being too short after filtering.
    • Adapter trimming: 18,330 reads had adapter sequences detected and trimmed.
    • Bases trimmed: A total of 346,344 bases were trimmed due to adapter sequences.
  4. Quality Metrics:

    • Duplication rate: Approximately 42.75% of reads were identified as duplicates after filtering.
    • Insert size peak: No distinct peak was detected for the estimated insert size based on paired-end reads.

Aligning Seq Data to Genomes

Suggested Directory Structure

  • microbiome_analysis/
    • run.sh
    • utils/
      • quack/
        • quack_qc.sh
        • pathogen_mapping.sh
    • 0_ref_genome/
      • mouse_genome.fna
      • pathogen_1_genome.fna
      • pathogen_2_genome.fna
      • ...
        • pathogen_combined_genome.fna
    • 1_index/
      • mouse_index.amb
        • mouse_index.ann
        • ...
        • pathogen_combined_index.amb
        • pathogen_combined_index.ann
        • ...
    • 2_exp_fastq/ - F1_S4_R1.fastq.gz - F1_S4_R2.fastq.gz - ...

Show number of read-pairs:

samtools view -f 2 "$output_folder/${exp}_pathogen_mapped.bam" | awk '{print $1}' | sort -u | wc -l

Pipeline

  • Mapping: Maps genomic sequence identifiers (at times several per pathogen) to pathogen names.
  • FASTQ QC: Performs quality assurance utilizing [quack]((https://github.com/IGBB/quack).
  • Reporting: Generates a summary report on host/pathogen mapped/unmapped reads.

The script iterates over a set of FASTQ files, aligns them to the mouse genome, filters unmapped reads, converts them to FASTQ format, aligns them to a combined pathogen index, and performs various post-processing steps. Finally, it calculates and reports statistics on the alignment results, such as the percentage of reads mapped to the mouse genome, pathogens, and those not mapped to either. The results and statistics are saved in the specified output and stats folders.

quack_qc.sh

#!/bin/bash

SAMPLE_NAME="$1"

INPUT_DIR="2_exp_fastq"
OUTPUT_DIR="4_stats"
OUTPUT_SAMPLE_DIR="$OUTPUT_DIR/$SAMPLE_NAME/"

# add directory to PATH environment
export PATH="utils/quack/:$PATH"

# Ensure the output directory exists
mkdir -p "$OUTPUT_SAMPLE_DIR"

# Run Quack on the FastQ files
quack -1 "$INPUT_DIR/${SAMPLE_NAME}_R1.fastq.gz" -2 "$INPUT_DIR/${SAMPLE_NAME}_R2.fastq.gz" -n $SAMPLE_NAME > $OUTPUT_SAMPLE_DIR/${SAMPLE_NAME}_quack_QC.svg

echo -e "FASTQ QC for ${SAMPLE_NAME} done...

run.sh (paired-end reads):

#!/bin/bash

exp="P11_S25"

R1="${exp}_R1"
R2="${exp}_R2"

mouse_index="1_index/mouse_index"
pathogen_index="1_index/pathogen_combined_index"
input_folder="2_exp_fastq"
output_folder="3_alignment/$exp"
stats_folder="4_stats/$exp"
input_reads1="$input_folder/${R1}.fastq.gz"
input_reads2="$input_folder/${R2}.fastq.gz"

mkdir -p "$output_folder"
mkdir -p "$stats_folder"

# Check if mapping file exists. If not, create mapping file
if [ ! -f "4_stats/pathogen_mapping.txt" ]; then
    bash "utils/pathogen_mapping.sh"
fi

# Execute quack FASTQ QC
bash utils/quack_qc.sh ${exp}

# Align against the mouse genome
echo -e "-----\n${exp} / Step 1: Aligning reads against the mouse genome...\n-----"
bwa mem -t 4 "$mouse_index" "$input_reads1" "$input_reads2" | samtools view -b -f 2 -F 2304 -U "$output_folder/${exp}_mouse_unmapped.bam" - > "$output_folder/${exp}_mouse_mapped.bam"
echo -e "-----\nStep 1 completed.\n-----"

# Align unmapped-to-mouse reads against pathogens
echo -e "-----\n${exp} / Step 2: Aligning unmapped-to-mouse reads against pathogens...\n-----"
# Create R1 and R2 FASTQ files out of mouse_unmapped.bam
samtools fastq -@ 4 -1 R1.fastq -2 R2.fastq "$output_folder/${exp}_mouse_unmapped.bam"

bwa mem -t 4 "$pathogen_index" R1.fastq R2.fastq | samtools view -b -f 2 -F 2304 -U "$output_folder/${exp}_pathogen_unmapped.bam" - > "$output_folder/${exp}_pathogen_mapped.bam"

rm R1.fastq R2.fastq

echo -e "-----\nStep 2 completed.\n-----"

# Sorting mapped pathogen reads and indexing
echo -e "-----\n${exp} / Step 3: Sorting mapped pathogen reads and indexing...\n-----"
samtools sort -o "$output_folder/${exp}_pathogen_mapped_sorted.bam" "$output_folder/${exp}_pathogen_mapped.bam"
samtools index "$output_folder/${exp}_pathogen_mapped_sorted.bam"
echo -e "-----\nStep 3 completed.\n-----"

# Generating index statistics for mapped pathogen reads
echo -e "-----\n${exp} / Step 4: Generating index statistics for mapped pathogen reads...\n-----"
samtools idxstats "$output_folder/${exp}_pathogen_mapped_sorted.bam" > "$stats_folder/${exp}_idxstats.txt"

####
####

### Get number of reads
# Get number of reads (not) mapped to mouse (excl. supplementary (flag 2048) and secondary mapped reads (flag 256)).
reads_mouse_mapped=$(samtools view -c "$output_folder/${exp}_mouse_mapped.bam")
reads_mouse_unmapped=$(samtools view -c -F 2304 "$output_folder/${exp}_mouse_unmapped.bam")

# Calculate total reads
reads_total=$(( $reads_mouse_mapped + $reads_mouse_unmapped ))

# Get number of non-mouse reads (not) mapped to pathogens (excl. supplementary (flag 2048) and secondary mapped reads (flag 256)).
reads_pathogen_mapped=$(samtools view -c "$output_folder/${exp}_pathogen_mapped.bam")
reads_pathogen_unmapped=$(samtools view -c -F 2304 "$output_folder/${exp}_pathogen_unmapped.bam")

# Mouse and pathogens
reads_total_mapped=$((reads_mouse_mapped + reads_pathogen_mapped))

### Calculate percentages
percentage_mouse_mapped=$(awk "BEGIN {printf \"%.2f\", ($reads_mouse_mapped/$reads_total)*100}")
percentage_mouse_unmapped=$(awk "BEGIN {printf \"%.2f\", ($reads_mouse_unmapped/$reads_total)*100}")
percentage_pathogen_mapped=$(awk "BEGIN {printf \"%.2f\", ($reads_pathogen_mapped/$reads_total)*100}")
percentage_pathogen_unmapped=$(awk "BEGIN {printf \"%.2f\", ($reads_pathogen_unmapped/$reads_total)*100}")
percentage_total_mapped=$(awk "BEGIN {printf \"%.2f\", (($reads_total_mapped/$reads_total)*100)}")

####
####

### Translate reference genome identifiers to pathogen names and sum up reads
# Associative array to store the mapping
declare -A mapping

# Read mapping file and populate the dictionary
while IFS= read line; do
    identifier=$(echo "$line" | awk '{print $1}')
    name=$(echo "$line" | awk '{$1=""; print $0}' | xargs)
    mapping["$identifier"]="$name"
done < "4_stats/pathogen_mapping.txt"

# Associative array to store the mapped reads for each pathogen
declare -A pathogen_reads

while IFS=$'\t' read -r identifier length mapped_reads unmapped_reads; do
    # Get the translated pathogen name from the dictionary 
    translated="${mapping[$identifier]}"
    # If the translation exists, use it; otherwise, keep the original identifier
    if [ -n "$translated" ]; then
        (( pathogen_reads["$translated"] += mapped_reads ))
    else
        pathogen_reads["$identifier"]="$mapped_reads"   
    fi
done < "$stats_folder/${exp}_idxstats.txt"

echo -e "-----\nStep 4 completed.\n-----\n"

pathogen_sorted=("Enterococcus_faecalis" "Helicobacter_hepaticus" "Klebsiella_oxytoca" "Rodentibacter_heylii" "Rodentibacter_pneumotropicus" "Staphylococcus_aureus" "*")

# Write 'translation' and summed up reads into new file
echo -e "Total reads for ${exp}: $reads_total (100%)\n-----" > "$stats_folder/${exp}_summary.txt"
echo -e "Mouse:\t\t\tMapped: $reads_mouse_mapped ($percentage_mouse_mapped%)\t| Unmapped: $reads_mouse_unmapped ($percentage_mouse_unmapped%)" >> "$stats_folder/${exp}_summary.txt"
echo -e "Pathogen in non-mouse:\tMapped: $reads_pathogen_mapped ($percentage_pathogen_mapped%)\t| Unmapped: $reads_pathogen_unmapped ($percentage_pathogen_unmapped%)" >> "$stats_folder/${exp}_summary.txt"
echo -e "Total:\t\t\tMapped: $reads_total_mapped ($percentage_total_mapped%)\t| Unmapped: $reads_pathogen_unmapped ($percentage_pathogen_unmapped%)" >> "$stats_folder/${exp}_summary.txt"
echo -e "\n###########\n"  >> "$stats_folder/${exp}_summary.txt"
echo -e "Pathogen name\t\tMapped reads\t(% pathogen | % non-mouse) \n-----" >> "$stats_folder/${exp}_summary.txt"

for pathogen in "${pathogen_sorted[@]}"; do
    percentage_mapped=$(awk "BEGIN {printf \"%.2f\", (${pathogen_reads[$pathogen]}/$reads_pathogen_mapped)*100}")
    percentage_unmapped=$(awk "BEGIN {printf \"%.2f\", (${pathogen_reads[$pathogen]}/$reads_mouse_unmapped)*100}")
    echo -e "$pathogen\t\t${pathogen_reads[$pathogen]}\t(${percentage_mapped}%\t| ${percentage_unmapped}%)" >> "$stats_folder/${exp}_summary.txt"
 
done

Tools

BWA (Burrows-Wheeler Aligner):

  • Purpose:
    • BWA is a software package used for aligning short DNA sequences against a large reference genome.
    • It employs the Burrows-Wheeler Transform (BWT) algorithm to efficiently align short reads to a reference sequence.
  • Usage in script:
    • BWA is used to align the experimental short DNA sequences (reads) against two different reference genomes: the mouse genome and a combined index for various pathogens.
    • The bwa mem command is specifically used for DNA sequence alignment using the mem algorithm, suitable for short reads.
    • The alignment results are stored in SAM (Sequence Alignment/Map) format files.

Samtools:

  • Purpose:
    • Samtools is a suite of programs for interacting with high-throughput sequencing data in SAM (Sequence Alignment/Map) and BAM (Binary Alignment/Map) formats.
    • It provides various utilities for manipulating, viewing, and analyzing sequence alignment data.
  • Usage in script:
    • Filtering unmapped reads from the alignment results.
    • Converting SAM to BAM format.
    • Indexing the sorted BAM file.
    • Generating index statistics for the final BAM file.
    • Calculating the number of reads mapped to different categories (mouse, pathogens, neither).

Mapping to Pathogen Names

The idxstats report looks like this, containing multiple reference genomes (18) for the same pathogen (6):

NC_004917.1	1799146	0	0
NC_007795.1	2821361	0	0
NZ_KB944666.1	2806553	0	0
NZ_KB944667.1	4841	0	0
NZ_KB944668.1	58987	0	0
NZ_BBIX01000009.1	671026	0	0
NZ_BBIX01000008.1	854302	0	0
NZ_BBIX01000007.1	21295	0	0
NZ_BBIX01000006.1	814831	0	0
NZ_BBIX01000005.1	18018	0	0
NZ_BBIX01000004.1	14392	0	0
NZ_BBIX01000003.1	14977	0	0
NZ_BBIX01000002.1	11561	0	0
NZ_BBIX01000001.1	14144	0	0
NZ_CP033844.1	5864574	1	0
NZ_CP033845.1	6078	0	0
NZ_CP033846.1	8424	0	0
NZ_CP040863.1	2607636	0	0
*	0	0	0

pathogen_mapping.sh

#!/bin/bash

input_file="0_ref_genome/combined_pathogen_genome.fna"
output_file="4_stats/pathogen_mapping.txt"

mkdir -p "4_stats"

# Declare a dictionary to store the mapping
declare -A reference_to_pathogen

# Extract reference genome identifiers and corresponding pathogen names
while IFS= read -r line; do
    identifier=$(echo "$line" | awk -F' ' '{print $1}')
    pathogen=$(echo "$line" | awk -F' ' '{print $2 "_" $3}')
    
    # Store the mapping in the dictionary
    reference_to_pathogen["$identifier"]="$pathogen"
done < <(grep ">" "$input_file")

# Write the mapping to output file
echo "Reference Genome Identifier	Pathogen Name" > "$output_file"
echo -e "Mapping file created..."
for key in "${!reference_to_pathogen[@]}"; do
    cleaned_key="${key#>}"
    printf "%-30s %s\n" "$cleaned_key" "${reference_to_pathogen[$key]}" >> "$output_file"
done

Dictionary: pathogen_mapping.txt

Reference Genome Identifier	Pathogen Name
>NZ_BBIX01000001.1             Rodentibacter_pneumotropicus
>NZ_BBIX01000007.1             Rodentibacter_pneumotropicus
>NZ_CP033846.1                 Klebsiella_oxytoca
>NZ_BBIX01000005.1             Rodentibacter_pneumotropicus
>NZ_CP040863.1                 Rodentibacter_heylii
>NZ_BBIX01000004.1             Rodentibacter_pneumotropicus
>NZ_KB944667.1                 Enterococcus_faecalis
>NZ_CP033844.1                 Klebsiella_oxytoca
>NZ_KB944666.1                 Enterococcus_faecalis
>NZ_BBIX01000008.1             Rodentibacter_pneumotropicus
>NC_007795.1                   Staphylococcus_aureus
>NZ_CP033845.1                 Klebsiella_oxytoca
>NZ_BBIX01000009.1             Rodentibacter_pneumotropicus
>NZ_KB944668.1                 Enterococcus_faecalis
>NZ_BBIX01000003.1             Rodentibacter_pneumotropicus
>NZ_BBIX01000006.1             Rodentibacter_pneumotropicus
>NZ_BBIX01000002.1             Rodentibacter_pneumotropicus
>NC_004917.1                   Helicobacter_hepaticus

quack_qc.sh

#!/bin/bash

SAMPLE_NAME="$1"

INPUT_DIR="2_exp_fastq"
OUTPUT_DIR="4_stats"
OUTPUT_SAMPLE_DIR="$OUTPUT_DIR/$SAMPLE_NAME/"

# add directory to PATH environment
export PATH="utils/quack/:$PATH"

# Ensure the output directory exists
mkdir -p "$OUTPUT_SAMPLE_DIR"

# Run Quack on the FastQ files
quack -1 "$INPUT_DIR/${SAMPLE_NAME}_R1.fastq.gz" -2 "$INPUT_DIR/${SAMPLE_NAME}_R2.fastq.gz" -n $SAMPLE_NAME > $OUTPUT_SAMPLE_DIR/${SAMPLE_NAME}_quack_QC.svg

echo -e "FASTQ QC for ${SAMPLE_NAME} done..."

Results

F1_S4

No Trimming

Total reads for F1_S4: 5957582 (100%)
-----
Mouse:			Mapped: 152609 (2,56%)	| Unmapped: 5804973 (97,44%)
Pathogen in non-mouse:	Mapped: 4739137 (79,55%)	| Unmapped: 1065836 (17,89%)
Total:			Mapped: 4891746 (82,11%)	| Unmapped: 1065836 (17,89%)

###########

Pathogen name		Mapped reads	(% pathogen | % non-mouse) 
-----
Enterococcus_faecalis		811	(0,02%	| 0,01%)
Helicobacter_hepaticus		81	(0,00%	| 0,00%)
Klebsiella_oxytoca		2355	(0,05%	| 0,04%)
Rodentibacter_heylii		54258	(1,14%	| 0,93%)
Rodentibacter_pneumotropicus		4681031	(98,77%	| 80,64%)
Staphylococcus_aureus		648	(0,01%	| 0,01%)
*		0	(0,00%	| 0,00%)

Trimmed:

Total reads for F1_S4_TRIMMED: 5827044 (100%)
-----
Mouse:			Mapped: 131518 (2,26%)	| Unmapped: 5695526 (97,74%)
Pathogen in non-mouse:	Mapped: 4685620 (80,41%)	| Unmapped: 1009906 (17,33%)
Total:			Mapped: 4817138 (82,67%)	| Unmapped: 1009906 (17,33%)

###########

Pathogen name		Mapped reads	(% pathogen | % non-mouse) 
-----
Enterococcus_faecalis		810	(0,02%	| 0,01%)
Helicobacter_hepaticus		81	(0,00%	| 0,00%)
Klebsiella_oxytoca		2294	(0,05%	| 0,04%)
Rodentibacter_heylii		53616	(1,14%	| 0,94%)
Rodentibacter_pneumotropicus		4628239	(98,78%	| 81,26%)
Staphylococcus_aureus		627	(0,01%	| 0,01%)
*		0	(0,00%	| 0,00%)

No Trimming: Screenshot 2024-02-29 at 16 46 54

Trimmed: Screenshot 2024-02-29 at 16 47 39

F11_S14

No Trimming:

Total reads for F11_S14: 5661678 (100%)
-----
Mouse:			Mapped: 385450 (6,81%)	| Unmapped: 5276228 (93,19%)
Pathogen in non-mouse:	Mapped: 19516 (0,34%)	| Unmapped: 5256712 (92,85%)
Total:			Mapped: 404966 (7,15%)	| Unmapped: 5256712 (92,85%)

###########

Pathogen name		Mapped reads	(% pathogen | % non-mouse) 
-----
Enterococcus_faecalis		3134	(16,06%	| 0,06%)
Helicobacter_hepaticus		542	(2,78%	| 0,01%)
Klebsiella_oxytoca		4473	(22,92%	| 0,08%)
Rodentibacter_heylii		1109	(5,68%	| 0,02%)
Rodentibacter_pneumotropicus		1001	(5,13%	| 0,02%)
Staphylococcus_aureus		9259	(47,44%	| 0,18%)
*		0	(0,00%	| 0,00%)

Trimmed:

Total reads for F11_S14_TRIMMED: 5499580 (100%)
-----
Mouse:			Mapped: 364859 (6,63%)	| Unmapped: 5134721 (93,37%)
Pathogen in non-mouse:	Mapped: 19283 (0,35%)	| Unmapped: 5115438 (93,02%)
Total:			Mapped: 384142 (6,98%)	| Unmapped: 5115438 (93,02%)

###########

Pathogen name		Mapped reads	(% pathogen | % non-mouse) 
-----
Enterococcus_faecalis		3077	(15,96%	| 0,06%)
Helicobacter_hepaticus		540	(2,80%	| 0,01%)
Klebsiella_oxytoca		4402	(22,83%	| 0,09%)
Rodentibacter_heylii		1077	(5,59%	| 0,02%)
Rodentibacter_pneumotropicus		1023	(5,31%	| 0,02%)
Staphylococcus_aureus		9165	(47,53%	| 0,18%)
*		0	(0,00%	| 0,00%)

No Trimming: Screenshot 2024-02-29 at 16 51 14

Trimmed: Screenshot 2024-02-29 at 16 50 05

  • Depending on the experimental setup, skipping alignment against the mouse genome could significantly reduce processing time (it took ~80 min with around 1.1 M reads). The exclusion of this step doesn't result in a substantial increase in host-derived reads, accounting for only about 1% of the total reads in SRR26681942.

About

Pipeline detecting pathogen DNA in sequencing data, specifically designed to differentiate between host (mouse) and pathogen genomes.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published