-
Notifications
You must be signed in to change notification settings - Fork 348
FAQ
Q. Is it cool if I fork ldsc
?
A. Of course!
Q. Should I parallelize over chromosomes when estimating LD Score?
A. Yes. In addition to saving time, LD across chromosome boundaries is almost always spurious LD due to finite sample noise or sample structure.
Q. How much time/memory does it take to estimate LD Scores?
A. Using a 1cM window, it takes about 1 hour and 1GB RAM on my 1.7GHz Macbook Air to compute LD Scores for chromosome 1 using the N=378 1000 Genomes Europeans. The time and memory complexity are both linear in the number of samples and the number of SNPs. Computing partitioned LD Scores with a large number of partitions increases the memory usage (~8GB for 50 categories) but has only minimal impact on the runtime.
Q. Why are some LD Scores negative?
A. The naive estimator of r2 is biased upwards by approximately 1/N where N is sample size. The amount of bias per pair of SNPs is small, but when summing over thousands of r2 estimate, the bias becomes substantial. LDSC therefore uses an unbiased estimator of r2; which necessarily has the property of returning negative r2 estimates occasionally. It is normal for a small number of LD Scores to be negative (e.g., a few percent of all SNPs). This has minimal impact on the regression. If more than a few percent of all SNPs have negative LD Scores, this probably means that something is wrong, either that the sample size used for estimating LD Scores is too small or the window size used is too large.
Q. Why am I getting the error message
The .annot file must contain the same SNPs in the same order as the
.bim
file.
A. Because your .annot
file does not contain the same SNPs in the same order as your .bim
file. Merging files on SNP ID's (rs numbers) or chromosomal coordinates is a frequent source of bugs, so LDSC tries to avoid doing this whenever possible.
Q. Why am I getting the error message
Some SNPs have no annotation in
--cts-bin
. This is a bug!
A. This error message should never appear. Email me.
Q. Why am I getting the error message
Do you really want to compute whole-chomosome LD Score? If so, set the
--yes-really flag
(warning: it will use a lot of time / memory)"
A. Your --ld-wind-[snp | kb | cm]
option told ldsc
to compute whole-chromosome LD Scores, i.e., sum r2 with the sum taken over all the SNPs in your .bim file. This is more often a time-consuming mistake than the real intent of the user (whole-chromosome LD Score contains more information about population structure in the sample than anything else, and this information can be obtained much more rapidly via e.g., PCA). If you actually want to compute sum r2 with the sum taken over all SNPs in your .bim
file, set the --yes-really flag
, but be warned that if you .bed file is large, this could take a very
long time to finish running.
Q. Why are the entries in my l2.M
or .l2.M_5_50
file non-integers?
A. If you set any of the flags in the set {--per-allele
, --pq-exp
, --maf-exp
} or you passed a .annot
file with non-integer entries, then this is to be expected. For example, --per-allele
computes LD Scores as sumk pjk(1-pjk)r2k, where pk denotes the MAF of SNP k. The entry in the .M
file will be M = sumk pk(1-pk). ldsc
does not normalize by M in order to facilitate parallelization over chromosomes.
Q. Why am I getting the error message
WARNING: ill-conditioned LD Score Matrix!
A. This only applies to partitioned LD Scores. This error message means that some of the columns of the LD Score matrix are linearly dependent or are sufficiently close to linearly dependent (condition number > 100,000) that ldsc
will run into numerical difficulties when attempting to perform LD Score regression with these LD Scores. There are several reasons why this might occur: if the number of SNPs is smaller than the number of LD Scores, if the annot matrix has linearly dependent columns or if the number of categories is very large. Sometimes the LD Score matrix for one of the smaller chromosomes will be ill-conditioned, but the whole-genome LD Score matrix will be fine. Increasing the sample size can also help.
Q. What sample size do I need for LD Score regression?
A. As a rule of thumb, LD Score regression tends to yield very noisy results when applied to datasets with fewer than ~5k samples, even for univariate h2 estimation. One needs even larger sample sizes for asking more complicated questions, e.g., partitioned h2. If your sample size is so small that standard error is the limiting factor, but you happen to have individual genotype data, consider using REML (e.g., as implemented in GCTA).
Q. Why am I getting the error message
WARNING: One of the h2's was out of bounds.
A. This means that the slope of the LD Score regression was < 0. Since rg has sqrt(h2) in the denominator, LDSC cannot estimate rg if one of the heritabilities is nonnegative, because the square root of a negative number is not a real number. Negative heritabilites are not physically meaningful, but sometimes LD Score regression will return a negative slope, especially if the true value of h2 is low or the sample size is small.
Q. Why am I getting the error message
WARNING: rg was out of bounds.
A. ldsc
returns this error message whenever the rg estimate is above 1.25 or below -1.25. This usually means that one of the h2 estimates was very close to zero. Since h2 is in the denominator of rg, if h2 is close to zero, rg estimates can blow up. This can also occur if you have (incorrectly) constrained the gencov intercept. You can override this message with the --return-silly-things
flag
Q. Why is the rg standard error so high?
A. The usual culprits are low sample size and low h2 (since h2 is in the denominator of rg, low h2 makes rg hard to estimate). You may also run into problems if the number of regression SNPs is small. ldsc
will give a warning message if the number of regression SNPs is below 200,000, because this usually indicates some sort of merge error, but as a rule of thumb, the standard error may become large with fewer than ~600,000 SNP.
Q. How long does it take to compute rg and h2 with ldsc
?
A. The regression itself takes a few seconds. Reading files into memory and merging them can take a few minutes, especially if you are using a large number of regression SNPs and many partitioned LD Scores.
Q. Why is my total h2 estimate negative?
A. Negative h2 is of course not meaningful, but negative h2estimates can occur. This usually means that the true h2 is close to zero, and sampling error pushed the estimate below zero.
Check whether mean chi2 is above ~1.02. If no, this means there is very little polygenic signal for LDSC to work with.
If your h2 estimate is significantly negative and the mean chi2 is reasonable large, then something is wrong, either model specification or data processing. Check for warning messages and data munging errors.
Q. Why is my h2 estimate for a disease trait different from previous reports?
A. By default, ldsc
reports h2 on the observed scale. Most publications report h2 on the liability scale, since this allows for comparison across studies of the same disease with different proportions of cases and controls. You can use the --samp-prev
and --pop-prev
flags to convert to liability scale h2 and genetic covariance.
Q. Why is my h2 estimate above one?
A. If you are estimating h2 from an ascertained study of a highly heritable rare polygenic disease, then it is possible for the observed scale h2 to be above one. Quantitative trat h2 and liability scale h2 should never be above one. The most likely culprits are errors in your .M_5_50
file or the N
(sample size) column of your .sumstats
file. For example, if the entries in you N
column are half what they should be, then you will over-estimate h2 by a factor of two.
Q. Why is one of my partitioned h2 estimates negative?
A. If your partitioned h2 estimate is non-significantly negative, then this may just be sampling noise. If your partitioned h2 estimate is significantly negative, this may indicate model misspecification. Fitting well-specified models of partitioned heritability is hard. I hate to link to a paper in the middle of an FAQ, but this was the main technical challenge in the paper we wrote about partitioned heritability, so I don't know of a better resource. See Finucane, Bulik-Sullivan et al, 2015.
Q. I estimated partitioned LD Score regression using an annot file with overlapping categories. What do the h2 estimates mean?
A. You need to use the --overlap-annot
flag in order to obtain meaningful h2 estimates in this case. Consult the tutorial on partitioned h2 estimation in the wiki.
Q. Why is my single-trait LD Score regression intercept below one?
A. If the LD Score regression intercept is non-significantly less than one, If your summary statistics were generated from GC corrected data (even data that were only single GC corrected), the intercept should be less than one (see the supp note of Bulik-Sullivan et al., Nature Genetics, 2015). If your summary statistcs contain chi2 statistics for SNPs with low minor allele count, then the chi2 statistics for these SNPs may be deflated relative to the asymptotic distribution. Try filtering out low MAF SNPs. If mean chi2. is below one, ldsc
will not work properly.
Q. When should I constrain the intercept?
A. If you can rule out QC problems, such as inflation from population stratification. For genetic correlation, you should constrain the intercept if you can quantify sample overlap (or you are sure that there is no sample overlap).
Q. Why is my h2 estimate so low?
A. Double- or single-GC correction both bias the h2 estimate downwards. If you constrain the intercept to 1 with GC corrected data, this will increase the bias, because the intercept should be < 1 for GC corrected data.
Check to make sure that you have the right sample size in the N
column of your .sumstats
file. Most GWAS provide summary statistics from their discovery sample, but mention their replication sample size in the abstract, since the replication sample size is always larger and more impressive-sounding. If you mistakenly use the replication sample size in your chisq file, this will bias your h2 estimate downwards.
If you are comparing an h2 estimate from GWAS data to an h2 estimate from a twin study, you should be aware that these are different quantities. The h2 parameter estimated by twin studiesincludes the contributions of all genetic variation, not just common SNPs, and the twin study estimator of h2 is probably biased upwards (see Zuk, et al., PNAS 2012). Almost all h2 estimates from GWAS data have been much lower than h2 estimates from twin data.
Q. Why is the standard error for my h2 estimate so high?
A. The common culprits are small sample size, small number of regression SNPs, too many partitioned LD Scores. Try constraining the regression intercept.
Q. Why am I getting the warning message
WARNING: number of SNPs less than 200k; this is almost always bad.
A. Having fewer than 200k regression SNPs is often a sign that something went wrong when merging summary statistics and LD Score. With only 200k SNPs, the standard error tends to be very high.
Q. I used [immuno | metabo | psych | exome]-chip data and got funky results. What gives?
A. LD Score regression doesn't work well with targeted genotyping arrays. Figuring out how to deal with this is on our to-do list. For now, we recommend removing individuals who were genotyped on a specialty array from your dataset OR restricting to SNPs in the GWAS backbone of your specialty chip.
Q. I am estimating rg between the same trait using the same set of summary statistics twice, and ldsc
is not returning 1. What is wrong?
A. When study1 = study2, z1*z1 = chi2 and genetic covariance = 2, the true value of g is exactly one. The estimate from ldsc
will probably not be exactly one, because ldsc
uses a slightly different initial guess for fitting the single-trait regressions than for the cross-trait regressions (ldsc
fits the regression via iteratively reweighted least squares).
Q. I am estimating rg between different studies of the same trait, and ldsc
is not returning 1. What is wrong?
A. Differences in ancestry between studies could push the rg estimate down slighly. rg between different definitions of the same phenotype might not be one (which could be an interesting result). rg between a trait and a non-linear transformation of that trait (e.g., height and log(height) ) will generally not be 1.
Q. How come ldsc
is so fast even though you're estimating standard errors with a block jackknife? I thought jackknives were slow?
A. It turns out that certain statistics (including WLS estimates) admit fast block jackknife algorithms. Instead of deleting a block of k SNPs and computing the WLS estimate using the remaining M-k SNPs (which requires around (M/k)*(M-k)*k = M(M-k) operations), we compute a statistic on the block of k SNPs (which requires around k operations), then aggregate these block-wise statistics to obtain the delete-a-block statistics. Code is clearer than words, so check out ldscore.jackknife.py
.