Software that detects PMDs (Partially Methylated Domains) from Illumina (EPIC) Infinium Methylation Assays. Based on MethylseekR, adapted to support EPIC assays, using KNN datapoint selection and alpha-value smoothing.
- PMDs are regions of intermediate methylation, that are highly disordered between CpGs. (in contrast the background methylation is polarized: highly methylated >=70% or depleted ) link
- PMDs are celltype-specific link
- PMDs can be related to closed chromatin compartments link
The local methylation level distribution is approximated using a beta-distribution. Its α,β parameter determines the shape. Here both parameters are set equal, meaning α=β. Values that fullfill α > 1, result in a probability density function shape that favors intermediate methylation levels (PMDs). Values that fullfill α < 1, will result in a probability density function shape that favors the polarized methylation levels (background)
Each covered CpG is assigned to an estimated α value by using its neighboring CpGs to determin the most likely value. Those estimated alpha values are used as input to an 2-state Hidden Markov Model (HMM) that predictes whenever a change between PMD/BG is likely. A solution (segmentation) to the HMM is found using the Viterbi algorithm.
This method is adapted from MethylseekR (MSR):
"Identification of active regulatory regions from DNA methylation data" Lukas Burger and Dimos Gaidatzis and Dirk Schubeler and Michael B. Stadler 2013 Nucleic Acids Research
Also RnBeads is used for EPIC data processing and representation
"Compehensive Analysis of DNA Methylation Data with RnBeads" "Yassen Assenov and Fabian Mueller and Pavlo Lutsik and Joern Walter and Thomas Lengauer and Christoph Bock" 2014 Nature Methods 11
MSR was developed with WGBS (Whole genome bisulfite sequencing) data in mind. Therefore it operates on read counts, comparing reads of methylated CpGs to the total number of reads. EPIC assays use red/green light intensities to determin the methylation level. The total number of reads is calculated by the sum of light intensities, light intensities covering methylated CpGs are passed as methylated reads.
In comparison to WGBS, EPIC assays only cover a small subset of CpGs from the human genome. Therefore size constrains (distance cutoff) and nearest neighbor selection using KNN ensure that only data points in reasonable distance are used for alpha value estimation.
Per default different values for k (neighboorhood size) and distance cutoff are evaluated to estimate alpha. Those alpha value estimations are averaged to diminish the effect of individual parameter setups to alpha.
Some areas have too few data points to reliably assign segments. Those are marked accordingly
load the library
library(epicPMDdetect)
generate a csv sampleSheet that describes your data, herefore refer to the instructions in RnBeads Manual.
Now load the data using rnbeads by calling readEPIC_idats. The sample.col.name musst be the column name of the csv holding the sample names
data = readEPIC_idats(sample.sheet,idat.dir,sample.col.name,preprocess=T)
Specific which samples should be processed (default all) by providing an vector of names (optional a settings object can be passed that controls further parameters.)
segmentRnbSet(data,outputFolder,samples=NULL)
The outputFolder contains the segmentations for each specified sample. They are named:
[sampleName].seg.bed.gz