For each sample, sequencing adaptors and low quality bases were removed using cutadapt v1.16 (Martin 2011) and sickle v1.33 (Joshi and Fass 2011). Read quality was checked before and after processing using FastQC v0.11.5 (Andrews et al. 2010).
Trimmed reads were mapped against the mm9 mouse assembly using bwa-meth v0.2.0 (Pedersen et al. 2014) running bwa v0.7.12-r1039 (Li 2013). Mapping quality was evaluated with qualimap v2.2.1 (Garcia-Alcalde et al 2012). Duplicates were removed with Picard MarkDuplicates (picard-tools v1.96) (Broad Institute 2020) with REMOVE_DUPLICATES=TRUE and REMOVE_SEQUENCING_DUPLICATES=TRUE
flags.
Alignments with MAPQ > 40 were used as input to methylation calling with methyldackel v0.3.0-3-g084d926 (Ryan 2020) (using HTSlib v1.2.1) in cytosine_report
mode for all cytosines in the genome both CpG and CpH contexts.
For each sample, we retrieved genome-wide reports per-cytosine. These detailed, for each cytosine, its coordinate, mapping strand, number of methylated reads and number of unmethylated reads. Next, we added the strand-aware 8-mer sequence context using the bedtools v2.27.1 (Quinlan et al 2010) suite including slop
and getfasta
commands. Thus, we stored the methylation status (at least a methylated read) and unmethylated (only unmethylated reads) of each cytosine and its sequence context. The procedure is available as script extract_motifs_frequency_from_bam_binary.sh
(see below).
To aggregate the data by sequence, we generated count tables with the number of methylated and unmethylated instances. The CpG count table contained 4^6 = 4096 possible sequences (since two positions, CG, are fixed), whereas the CpH count table reported 4^6*3 = 12288 possible sequences (since the C
position is fixed and the H can only be [A,C,T]).
To visualize the DNA methylation per sequence 8-mer, we developed a methylation score that represents the proportion of loci with detected methylation as compared to the total, e.g. score =
For each sample, we next combined all 8-mer information to generate position weight matrices (PWMs) depicting DNA methylation preferences. To account for representation biases, we integrated both count tables of methylated and unmethylated 8-mers. First, we calculated the nucleotide frequency per position for the methylated and unmethylated 8-mer count tables separately. Second, we divided the proportion from the methylated frequencies by the unmethylated frequencies and log2-transformed the result. Hence, the score sign depicts the enrichment sign: positive values indicate methylation preference, whereas negative values suggest a trend towards unmethylation.
extract_motifs_frequency_from_bam_binary.sh
, bash script to generate the methylation count tables.extract_motifs_frequency_from_bam.sh
, bash script to generate discretized count tables
mapping
, bash scripts to bismark/bwa-meth retrieve, QC, map, and methylation call the single-end and paired-end readsreports_associated_to_mapping
, mainly for discovery and QC
accessory
, accessory scripts during the discovery phase.discarded
, discarded approachescytosine_report
, first prototype
media
, SVG and PNG flowchart
rmd_reports
, PWM and other plots01_motif_extract_run_nov_2019_postproc
, report (with coverage filtering)02_motif_extract_run_nov_2019_no_coverage_filtering_postproc
, report (no coverage filtering)03_stats_assessment
, attempt to evaluate significance I04_stats_assessment
, attempt to evaluate significance II (structFDR)05_ma_plots
, visualizationdata/counts_nested_list.RData
, used by several reports
-
tuncay.baubec ta uzh tod ch
-
izaskun.mallona ta gmail tod com
Izaskun Mallona, Ioana Mariuca Ilie, Ino Dominiek Karemaker, Stefan Butz, Massimiliano Manzo, Amedeo Caflisch, Tuncay Baubec (2020) Flanking sequence preference modulates de novo DNA methylation in the mouse genome