This script contains an implementation in R of a weighted average procedure to generate consensus locus-specific LD matrices from multiple single-cohort correlation files. Together with FINEMAP, it has been used in the generation of the PGC3-SCZ fine-mapping results.
LDmergeFM
requires the
readr,
dplyr,
purrr,
reshape2
and Matrix
packages to be available for your local R installation. It is designed
to be run from the command-line as:
Rscript --vanilla LDmergeFM.R $LOCUS $COR_FORMAT $ESS_FORMULA
Where the argument $LOCUS
is the locus identifier for the LD matrix
being calculated, present in all of the input files:
Filename | Contents |
---|---|
$LOCUS.ref |
Two-column whitespace-delimited file. Column 1: SNP name. Column 2: Effect allele. Equivalent to columns 1 and 4 of a FINEMAP .z file. No header. |
$COHORT_$LOCUS.fam |
1x PLINK v1.07+ .fam file for each cohort being analysed (with case/control phenotype). Individuals with missing phenotypes not used to compute the pairwise correlations should be excluded from this file. |
$COHORT_$LOCUS.cor.gz |
1x LDSTORE v1.1 .cor file (compressed output of –table flag) for each cohort being analysed. Output of the PLINK v1.9+ –r inter-chr gz flag ($COHORT_$LOCUS.ld.gz ) is also acceptable if the $COR_FORMAT argument is changed as described below. |
Single-cohort names ($COHORT
) should be unique but can contain any
non-whitespace characters. The underscore ("_") separation with
$LOCUS
is mandatory.
The other two arguments of the script are optional:
$COR_FORMAT
indicates whether the input correlations have been
computed with “LDSTORE” or “PLINK”, allowing the script to correctly
process these files. Defaults to “LDSTORE” if not explicit.
$ESS_FORMULA
indicates how to compute the effective sample size used
as weight of each LD matrix. Options are “METAL” for the formula used
in Willer et al. 2010
or “NCP” for the definition of Matti
Pirinen
and Vukcevic et al. 2011. Defaults
to “METAL” if not explicit.
Note that if these last two arguments are used, they have to be used
in the order above. This implies that to change $ESS_FORMULA
one
needs to be explicit and state the value of $COR_FORMAT
as well (but
the converse is not true).
Filename | Contents |
---|---|
$LOCUS.ld |
Square consensus LD matrix. SNPs are given on the same order as $LOCUS.ref . |
$LOCUS.snps.log |
SNPs used in the computation of the consensus LD matrix. Should match those on $LOCUS.ref . |
$LOCUS.samples.log |
Cohorts used in the computation of the consensus LD matrix. Should match all of those provided as input files. |
$LOCUS.heatmap.png |
Basic illustration of the consensus LD structure at the locus. Intended for troubleshooting or to identify regions that could be problematic for fine-mapping. Only generated if R installation has PNG capability. |
The ./test/
folder contains some input/output files that can
be used to conduct a reproducible run. For illustration purposes, these
files include the region around exon 12 of the
EDAR gene,
which contains some very strong linkage as previously discussed by
Sabeti et al. 2007.
Genotypes were derived from polymorphic SNPs from four subpopulations
(Europeans, Sub-Saharan Africans, East Asians and Native Americans) of
the public
HGDP
dataset. Please reference Bergström et
al. 2020 if you find this
data useful for other purposes.
LDmergeFM
has been designed with fine-mapping in a meta-analytic
case-control GWAS setting in mind, so one of its implicit requirements
(in line with FINEMAP)
is that the reference allele for the correlations is the same in all
cohorts. Given that inconsistent criteria are currently used to decide
effect/reference alleles, it can help to set these explictly using the
PLINK
–a1-allele/–ref-allele flag, which in fact can accept the
format of the $LOCUS.ref
file.
For a similar reason, LDmergeFM
expects that all SNPs in the
$LOCUS.ref
file will be uniquely named and that each of them can be
found in the correlation file of at least one cohort. Duplicated or
missing SNPs might cause the script to fail silently, returning
erroneous output, so please ensure these are not present.
LDmergeFM
has not been tested with correlation table files from
LDSTORE v2+
, please conduct a test run before using those in important
analyses.
LDmergeFM
can work with an arbitrary number of input matrices but in
its current state is not optimised to take advantage of multicore
environments or matrix sparsity, and thus can be potentially
resource-hungry. If working on systems with resource quotas, please
check .log
files to make sure the computation of the consensus LD
matrix has used all available data.
LDmergeFM
is not ancestry-aware. If ancestry-specific consensus
matrices are needed (e.g. for trans-ancestry fine-mapping purposes) you
should run the script separately for each group of single-ancestry
inputs.
2021-03-09 => Added some internal checks for better error reporting. Introduced arguments to accommodate other correlation file formats and change the calculation for effective sample size weights if desired. New basic heatmap output.
2020-11-13 => Upload of initial version with essential functionality.
FINEMAP
/LDSTORE
: http://www.christianbenner.com/
PLINK
: https://www.cog-genomics.org/plink/1.9/
If this script is helpful for your work, just reference the main PGC3-SCZ paper. If it ends up being very helpful, please let me know so I can keep fighting impostor syndrome one day at a time 😌.
Please submit suggestions and bug-reports at https://github.com/pintaius/LDmergeFM/issues.