title | author | date | output | vignette | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MeDeCom: Methylome Decomposition via Constrained Matrix Factorization |
Pavlo Lutsik, Martin Slawski, Gilles Gasparoni, Nikita Vedeneev, Matthias Hein and Joern Walter |
2020-03-19 |
|
%\VignetteIndexEntry{MeDeCom} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc}
|
MeDeCom is an R-package for reference-free decomposition of heterogeneous DNA methylation profiles.
It uses matrix factorization enhanced by constraints and a specially tailored regularization.
MeDeCom represents an input
MeDeCom starts with a set of related DNA methylation profiles, e.g. a series of Infinium microarray measurements from a population cohort, or several bisulfite sequencing-based methylomes. The key requirement is that the input data represents absolute DNA methylation measurements in a population of cells and contains values between 0 and 1. MeDeCom implements an alternating scheme which iteratively updates randomly initialized factor matrices until convergence or until the maximum number of iterations has been reached. This is repeated for multiple random initializations and the best solution is returned.
MeDeCom features two tunable parameters. The first one is the number of LMCs
MeDeCom can be installed directly from github using the package devtools
:
devtools::install_github("lutsik/MeDeCom")
The master branch of the GitHub repository only compiles on nix-like platforms with a C++11-compatible compiler are supported. Users on Mac OS might run into compilation errors as OpenMP, which is required for compilation, is disabled by default. To enable OpenMP, the following tutorial can be followed https://mac.r-project.org/openmp/ We created a separate branch for installation of MeDeCom on Windows machines. However, parallel processing options are currently not supported for Windows due to some incompatabilites of the R/Windows connection. Thus, executing MeDeCom on a Windows machine takes sustantially longer. Additionally, we creared a Docker image with MeDeCom, which can be used in case of further installation issues https://hub.docker.com/r/mscherer/medecom.
devtools::install_github("lutsik/MeDeCom",ref="windows")
MeDeCom uses stack model for memory to accelerate factorization for smaller ranks. This requires certain preparation during compilation, therefore, please, note the extended compilation time (15 to 20 minutes).
MeDeCom accepts DNA methylation data in several forms. Preferably the user may load and preprocess the data
using a general-purpose DNA methylation analysis package RnBeads. A resulting RnBSet object
can be directly supplied to MeDeCom. Alternatively, MeDeCom runs on any matrix of type numeric
with valid methylation values.
MeDeCom comes with a small example data set obtained by mixing reference profiles of blood cell methylomes in silico. The example data set can be loaded in a usual way:
## load the package
suppressPackageStartupMessages(library(MeDeCom))
## load the example data sets
data(example.dataset, package="MeDeCom")
## you should get objects D, Tref and Aref
## in your global R environment
ls()
## [1] "Aref" "D" "lmcs" "medecom.result"
## [5] "perm" "prop" "sample.group" "sge.setup"
## [9] "Tref"
Loaded numeric matrix D
contains 100 in silico mixtures and serves as an example input. Columns of matrix Tref
contains the methylomes
of 5 blood cell types used to generate the mixtures, while matrix Aref
provides the mixing proportions.
## matrix D has dimension 10000x100
str(D)
## num [1:10000, 1:100] 0.0354 0.0491 0.3411 0.873 0.1781 ...
## matrix Tref has dimension 10000x5
str(Tref)
## num [1:10000, 1:5] 0.1 0.0847 0.3167 0.834 0.0455 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:5] "Neutrophils" "CD4+ T-cells" "CD14+ Monocytes" "CD8+ T-cells" ...
## matrix Aref has dimension 5x100
str(Aref)
## num [1:5, 1:100] 6.54e-01 3.80e-02 3.06e-01 1.56e-13 2.45e-03 ...
MeDeCom can be run directly on matrix D
.
It is crucial to select the values of parameters
medecom.result<-runMeDeCom(D, Ks=2:10, lambdas=c(0,10^(-5:-1)))
MeDeCom is based upon an alternating optimization heuristic and requires a lot of computation. The processing of the data matrix can take several hours. One can speed up the run by decreasing the number of cross-validation folds and random initializations, and increasing the number of computational cores.
medecom.result<-runMeDeCom(D, 2:10, c(0,10^(-5:-1)), NINIT=10, NFOLDS=10, ITERMAX=300, NCORES=9)
##
## [Main:] checking inputs
## [Main:] preparing data
## [Main:] preparing jobs
## [Main:] 3114 factorization runs in total
## [Main:] runs 2755 to 2788 complete
## [Main:] runs 2789 to 2822 complete
## [Main:] runs 2823 to 2856 complete
## [Main:] runs 2857 to 2890 complete
## ......
## [Main:] finished all jobs. Creating the object
This can, however, lead to decomposition slightly different from the one presented given below.
The results of a decomposition experiment are saved to an object of class MeDeComSet
.
The contents of an object can be conveniently displayed using the print
functionality.
medecom.result
## An object of class MeDeComSet
## Input data set:
## 10000 CpGs
## 100 methylomes
## Experimental parameters:
## k values: 2, 3, 4, 5, 6, 7, 8, 9, 10
## lambda values: 0, 1e-05, 1e-04, 0.001, 0.01, 0.1
The first key step is parameter selection. It is important to carefully explore the obtained results and make a decision about the most feasible parameter values, or about extending the parameter value grids to be tested in refinement experiments.
MeDeCom provides a cross-validation error (CVE) for each tested parameter combination.
plotParameters(medecom.result)
A lineplot helping to select parameter
plotParameters(medecom.result, K=5, lambdaScale="log")
Cross-validation error has a minimum at
A matrix of LMCs can be extracted using getLMCs
:
lmcs<-getLMCs(medecom.result, K=5, lambda=0.01)
str(lmcs)
## num [1:10000, 1:5] 0 0.0184 0.2898 0.7856 0 ...
LMCs can be seen as measured methylation profiles of purified cell populations.
MeDeCom provides for several visualization methods for LMCs using the function plotLMCs
which operates directly on MeDeComSet
objects.
For instance, standard hierarchical clustering can be visualized using:
plotLMCs(medecom.result, K=5, lambda=0.01, type="dendrogram")
A two-dimensional embedding with MDS is also obtainable:
plotLMCs(medecom.result, K=5, lambda=0.01, type="MDS")
Input data can be included into the MDS plot to enhance the interpretation.
plotLMCs(medecom.result, K=5, lambda=0.01, type="MDS", D=D)
In many cases reference methylomes exists, which are relevant for the data set in question.
For our example analysis matrix Tref
contains the reference type
profiles which were in silico mixed. MeDeCom offers several ways to visualize the
resulting LMCs together with the reference methylation profiles. The reference methylomes
can be included into a joint clustering analysis:
plotLMCs(medecom.result, K=5, lambda=0.01, type="dendrogram", Tref=Tref, center=TRUE)
Furthermore, a similarity matrix of LMCs vs reference profiles can be visualized as a heatmap.
plotLMCs(medecom.result, K=5, lambda=0.01, type="heatmap", Tref=Tref)
Correlation coefficient values and asterisks aid the interpretation. The values are displayed in the cells which contain maximal values column-wise. The asterisks mark cells which have the highest correlation value in the respective rows. Thus, a value with asterisk corresponds to a mutual match, i.e. LMC unambiguously matching a reference profile.
In this example analysis each LMC uniquely matches one of the reference profiles. The matching of
Function matchLMCs
offers several methods for
matching LMCs to reference profiles.
perm<-matchLMCs(lmcs, Tref)
MeDeCom provides functions to perform enrichment analysis on the sites that are particularly hypo-/hypermethylated in an LMC. These sites can then be used for GO and LOLA enrichment analysis. Importantly, genomic annotations of the LMC sites is required to be specified. We thus recommend to use the DecompPipeline package for processing, but the annotation can also be specified manually using a data.frame
that looks as follows:
Chromosome Start End Strand CpG GC CGI Relation SNPs
30365 chr1 1036375 1036376 + 2 59 Shelf <NA>
42681 chr1 1184537 1184538 + 2 58 Open Sea <NA>
45091 chr1 1218625 1218626 + 5 66 Island <NA>
51615 chr1 1292773 1292774 + 3 64 Island <NA>
52001 chr1 1295504 1295505 + 9 65 Shore <NA>
52003 chr1 1295507 1295508 + 9 65 Shore <NA>
The required columns are Chromosome
, Start
, End
, and Strand
. Using this data.frame
(called df
in the following), enrichment analysis can be performed using:
lmc.lola.enrichment(medecom.result,anno.data=df,K=5,lambda=0.001,diff.threshold = 0.5, region.type = "tiling")
Please note that df
needs to have the same number of rows than the methylation matrix used as input to MeDeCom. CpGs are first aggregated over the region.type
specified, then the regions are selected that have a difference larger than diff.threshold
. The list of available region types is published here https://rnbeads.org/regions.html.
A matrix of mixing proportions is obtained using getProportions
:
prop<-getProportions(medecom.result, K=5, lambda=0.001)
str(prop)
## num [1:5, 1:100] 0.3411 0 0.065 0.0213 0.5726 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:5] "LMC1" "LMC2" "LMC3" "LMC4" ...
## ..$ : NULL
A complete matrix of propotions can be visualized as a stacked barplot:
plotProportions(medecom.result, K=5, lambda=0.01, type="barplot")
or a heatmap:
plotProportions(medecom.result, K=5, lambda=0.01, type="heatmap")
The heatmap can be enhanced by clustering the columns:
plotProportions(medecom.result, K=5, lambda=0.01, type="heatmap", heatmap.clusterCols=TRUE)
or adding color code for the samples:
sample.group<-c("Case", "Control")[1+sample.int(ncol(D))%%2]
plotProportions(medecom.result, K=5, lambda=0.01, type="heatmap", sample.characteristic=sample.group)
plotProportions(medecom.result, K=5, lambda=0.01, type="lineplot", lmc=2, Aref=Aref, ref.profile=2)
MeDeCom experiments require a lot of computational time. On the other hand most of the factorization runs are independent and, therefore, can be run in parallel. Thus, a significant speedup can be achieved when running MeDeCom in an HPC environment. MeDeCom can be easily adapted to most of the popular schedulers. There are, however, several prerequisites:
- the scheduler provides the standard utilities
qsub
for the submission of the cluster jobs andqstat
for obtaining the job statistics; - the cluster does not have a low limit on the number of submitted jobs;
- the R installation (location of the R binary and the package library) is consistent across the execution nodes.
The example below is for the cluster operated by Son of Grid Engine (SoGE). To be able to run on a SoGE cluster MeDeCom needs to know:
- location of the R executable (directory);
- an operating memory limit per each factorization job;
- a pattern for the names of cluster nodes to run the jobs on.
These settings should be stored in a list
object:
sge.setup<-list(
R_bin_dir="/usr/bin",
host_pattern="*",
mem_limit="5G"
)
This object should be supplied to MeDeCom as the argument cluster.settings
. It is also important to specify a valid temporary
directory, which is available to all execution nodes.
medecom.result<-runMeDeCom(D, Ks=2:10, lambdas=c(0,10^(-5:-1)), N_COMP_LAMBDA=1, NFOLDS=5, NINIT=10,
temp.dir="/cluster_fs/medecom_temp",
cluster.settings=sge.setup)
MeDeCom will start the jobs and will periodically monitor the number of remaining ones.
##
## [Main:] checking inputs
## [Main:] preparing data
## [Main:] preparing jobs
## [Main:] 3114 factorization runs in total
## [Main:] 3114 jobs remaining
## ....
## [Main:] finished all jobs. Creating the object
Here is the output of sessionInfo()
on the system on which this document was compiled:
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 10 (buster)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats4 parallel stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] MeDeCom_0.3.3
## [2] RnBeads_2.4.0
## [3] plyr_1.8.5
## [4] methylumi_2.32.0
## [5] minfi_1.32.0
## [6] bumphunter_1.28.0
## [7] locfit_1.5-9.1
## [8] iterators_1.0.12
## [9] foreach_1.4.7
## [10] Biostrings_2.54.0
## [11] XVector_0.26.0
## [12] SummarizedExperiment_1.16.1
## [13] DelayedArray_0.12.2
## [14] BiocParallel_1.20.1
## [15] FDb.InfiniumMethylation.hg19_2.2.0
## [16] org.Hs.eg.db_3.10.0
## [17] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [18] GenomicFeatures_1.38.1
## [19] AnnotationDbi_1.48.0
## [20] reshape2_1.4.3
## [21] scales_1.1.0
## [22] Biobase_2.46.0
## [23] illuminaio_0.28.0
## [24] matrixStats_0.55.0
## [25] limma_3.42.1
## [26] gridExtra_2.3
## [27] ggplot2_3.2.1
## [28] fields_10.2
## [29] maps_3.3.0
## [30] spam_2.5-1
## [31] dotCall64_1.0-0
## [32] ff_2.2-14
## [33] bit_1.1-15.1
## [34] cluster_2.1.0
## [35] MASS_7.3-51.5
## [36] GenomicRanges_1.38.0
## [37] GenomeInfoDb_1.22.0
## [38] IRanges_2.20.2
## [39] S4Vectors_0.24.3
## [40] BiocGenerics_0.32.0
## [41] RUnit_0.4.32
## [42] gplots_3.0.1.1
## [43] gtools_3.8.1
## [44] pracma_2.2.9
## [45] Rcpp_1.0.3
## [46] knitr_1.27
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.4-1 siggenes_1.60.0 mclust_5.4.5
## [4] base64_2.0 bit64_0.9-7 xml2_1.2.2
## [7] splines_3.6.3 codetools_0.2-16 scrime_1.3.5
## [10] Rsamtools_2.2.1 annotate_1.64.0 dbplyr_1.4.2
## [13] HDF5Array_1.14.2 readr_1.3.1 compiler_3.6.3
## [16] httr_1.4.1 assertthat_0.2.1 Matrix_1.2-18
## [19] lazyeval_0.2.2 prettyunits_1.1.1 tools_3.6.3
## [22] gtable_0.3.0 glue_1.3.1 GenomeInfoDbData_1.2.2
## [25] dplyr_0.8.4 rappdirs_0.3.1 doRNG_1.8.2
## [28] vctrs_0.2.2 multtest_2.42.0 nlme_3.1-144
## [31] preprocessCore_1.48.0 gdata_2.18.0 rtracklayer_1.46.0
## [34] DelayedMatrixStats_1.8.0 xfun_0.12 stringr_1.4.0
## [37] lifecycle_0.1.0 rngtools_1.5 XML_3.99-0.3
## [40] beanplot_1.2 zlibbioc_1.32.0 hms_0.5.3
## [43] GEOquery_2.54.1 rhdf5_2.30.1 RColorBrewer_1.1-2
## [46] curl_4.3 memoise_1.1.0 biomaRt_2.42.0
## [49] reshape_0.8.8 stringi_1.4.5 RSQLite_2.2.0
## [52] highr_0.8 genefilter_1.68.0 caTools_1.17.1.1
## [55] rlang_0.4.4 pkgconfig_2.0.3 bitops_1.0-6
## [58] nor1mix_1.3-0 evaluate_0.14 lattice_0.20-40
## [61] purrr_0.3.3 Rhdf5lib_1.8.0 GenomicAlignments_1.22.1
## [64] tidyselect_1.0.0 magrittr_1.5 R6_2.4.1
## [67] DBI_1.1.0 pillar_1.4.3 withr_2.1.2
## [70] survival_3.1-8 RCurl_1.98-1.1 tibble_2.1.3
## [73] crayon_1.3.4 KernSmooth_2.23-16 BiocFileCache_1.10.2
## [76] progress_1.2.2 data.table_1.12.8 blob_1.2.1
## [79] digest_0.6.23 xtable_1.8-4 tidyr_1.0.2
## [82] openssl_1.4.1 munsell_0.5.0 quadprog_1.5-8
## [85] askpass_1.1