-
Notifications
You must be signed in to change notification settings - Fork 15
FAQ
Welcome to the MungeSumstats wiki! Here, we will list out frequently asked questions for formatting summary statistics with MungeSumstats which will help ensure you run MungeSumstats in the most suitable way based on your data and needs.
MungeSumstats can be used for the standardisation and quality control of most every summary statistics from GWAS (and for xQTL studies). To run MungeSumstats, download the latest release version from Bioconductor (BiocManager::install("MunegSumstats")
) and run MungeSumstats::format_sumstats(path_to_sumstats)
. Doing this will result in Mungesumstats standardising the column headers using a mapping file, mapping to different genome builds if required and performing many checks such as, are the listed SNPs' reference alleles on the reference genome. More information on all quality control checks and the defaults are given in our getting started vignette.
Although MungeSumstats is designed to work by default for most users, there are cases where certaqin changes to these defaults may be optimal for certain users. Moreover, it can be daunting to get through all the documentation when you may have a specific question about a use case. As such, we have collated FAQs below which have come up the most since we released MungeSumstats.
The answer, for most cases, is no. MungeSumstats can handle many column name formats without issue (it is case insensitive and has mappings for all common synonyms, e.g. BP
, Pos
, Position
,Base Pair Position
all represent the genomic location of a SNP). However, if you run MungeSumstats and it does not recognise a column header as a standardised one but correspond to one of
SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON,
NSTUDY, INFO or FRQ,
You can either update the mapping file data("sumstatsColHeaders")
following the approach in the data.R
file to add the necessary mapping. Then use a Pull Request on GitHub and we will incorporate this change into the package. Or create an issue explaining the missing mapping with a executable, small example and we will add it for you.
One caveat to the above when it may be best to reformat the column headers for the summary statistic file is if your file contains the columns A1 and A2 for the effect and non-effect alleles. The issue with this naming is that different groups infer A1/A2 differently. For one group, A2 may be the effect allele and for another, it may be A1. There are checks for this in MungeSumstats which will be discussed in more detail in these FAQs, see Q.13 but if you know which column is which, it is best to rename these to effect_allele
/non-effect_allele
so no misinterpretation can happen. When MungeSumstats formats your file, A2 will be the effect allele and A1 will be the non-effect allele.
Yes! MungeSumstats uses the same allele naming convention as GWAS VCF and IEU GWAS, where A2 is the effect allele and A1 is the non-effect (usually, but not always, the reference allele). See more about this in our publication. We advise to stick to this output format to enable some level of cohesion across GWAS summary statistics.
If this happens to you, you should look into what caused a lot of the SNPs to be removed. MungeSumstats outputs very helpful information where it tells you exactly how many SNPs are removed after each quality control step and at the end, the overall proportion of SNPs remaining. By default, MungeSumstats have some checks which can remove large proportions of SNPs by default such as the info quality filter for 0.9 (MungeSumstats::format_sumstats(..., INFO_filter = 0.9)
) or removing non-bi-allelic SNPs (MungeSumstats::format_sumstats(..., bi_allelic_filter = TRUE)
). More on non-bi-allelic SNPs and what to do with these in Q.8.
4. Can I use MungeSumstats to have my summary statistics formatted for use with LDSC?
Yes! MungeSumstats standardises summary statistics which makes them usable for a whole host of downstream analysis. If your goal is to run LDSC or its variants like s-LDSC, just set MungeSumstats::format_sumstats(..., save_format="LDSC")
. This will update the set parameters for the run to ensure they are compatible with LDSC format. Included in this setting is to change the A1 and A2 columns on output - the A1 column will now be the effect allele and A2 the non-effect allele. Usually, MungeSumstats reports this in the opposite way, see Q.2 for more on MungeSumstats usual output of A1/A2.
MungeSumstats works for GRCh37 and GRCh38 data and uses dbSNP to validate SNPs. The available versions of dbSNP for use are 144 and 155. To change the used version set MungeSumstats::format_sumstats(..., dbSNP = 155)
. Note we have plans to enable a user to input their own dbSNP build but this has not yet been implemented - Keep up-to-date on this here.
MungeSumstats represents EAF
as FRQ
in its output. So for some cases EAF and MAF will be the same - where all SNPs being tested are not the allele most commonly found in the population at a given position (MAF = EAF). However, there are cases where some SNPs tested in a GWAS are testing the most frequently found allele (the reference/major allele). MungeSumstats allows for this and does not assume MAF=EAF and imposes no hard constraints on it. It will however print a message to let the user know how many SNPs have FRQs that likely relate to the major allele, for example:
427,945 SNPs (4.6%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
7. How does MungeSumstats handle allele discrepancies? What if one SNP in my file has mislabeled alleles? Does it flip alleles?
Once MungeSumstats has inferred the effect and non-effect alleles, it proceeds to perform a SNP level check to ensure all SNPs' reference allale matches the reference genome. For those that don't but their effect allele does (for example where the SNP being tested is the major allele), MungeSumstats will flip these SNPs and their effect columns to match the other SNPs (MungeSumstats::format_sumstats(..., allele_flip_check = TRUE)
).
Non-bi-allelic SNPs are SNPs in a location of the genome where more than one alternative base has been seen in a population. Unsurprisingly, as we sequence more people from diverse populations, we have found more and more non-bi-allelic SNPs. So when you use later versions of dbSNP, more are found - MungeSumstats by default uses a very recent version of dbSNP; 155. By default, MungeSumstats removes non-bi-allelic SNPs (MungeSumstats::format_sumstats(..., bi_allelic_filter = TRUE)
) as downstream analysis tools often require this. However, given the growing numbers of these SNPs, I don't believe this is sensible choice if your downstream analysis does not require it - have a look at MungeSumstats output to see just how many SNPs you lose on this step. I would choose to keep these unless completely necessary to remove. (A lot) More on this here.
Yes! Overall, mungeSumstats is suitable for use with GWAS summary statistics from any ancestry/population. The reference datasets it uses could be the only place there could be any issue. These are:
- Reference genomes GRCh37/GRCh38 for checks like is the SNP found on the reference genome (reference allele) and inferring the genome build
- dbSNP differing versions for checks like is the SNP present in dbSNP version of interest (based on RS ID/position & reference/alternative allele) and imputing any missing data for the SNP
Overall, this will not cause an issue as both the reference genome and the dbSNP databases account for non-European versions. If you want to be specially careful, just use the latest version of dbSNP we have v155 (it's the default) and read MungeSumstats output messages to make sure a lot of SNPs aren't removed as they aren't found.
See more on this here and here.
Yes! You can do this when standardising and performing quality control checks with MungeSumstats::format_sumstats(..., convert_ref_genome='GRCh38')
which will convert your data to GRCh38 if it is not already GRCh38. Or you can just liftover your data without performing any other checks:
sumstats_dt <- MungeSumstats::formatted_example()
sumstats_dt_hg38 <- liftover(sumstats_dt=sumstats_dt,
ref_genome = "hg19",
convert_ref_genome="hg38")
Note that MungeSumstats uses chain files to perform liftover from either UCSC or ensembl. These require internet connection to be downloaded and cached the first time. MungeSumstats::format_sumstats(..., chain_source='ucsc')
("ucsc" or "ensembl"). Note that the UCSC chain files require a license for commercial use so the Ensembl chain is used by default. Note local chain files can also be used, specifying the path with MungeSumstats::format_sumstats(..., local_chain='some_path')
Yes! If you leave MungeSumstats::format_sumstats(..., ref_genome = NULL)
, MungeSumstats will infer the build from your data.
In Q.7 we explained how MungeSumstats can flip alleles and effect columns for SNPs if necessary. However, it also tries to flip frequency values. This works fine for bi-allelic SNPs where the FRQ_new = 1- FRQ_old
. However, this is not possible for multi-allelic SNPs so by default, MungeSumstats does not flip these. This can be annoying though especially if you have cases where the other alternative allele is very rare. Thus, we added a parameter MungeSumstats::format_sumstats(..., flip_frq_as_biallelic=TRUE)
. Which can be used to flip the multi-allelic SNPs as if they were bi-allelic (FRQ_new = 1- FRQ_old
). See more on this here.
In Q.1, we mentioned how it is better to rename allele columns from A1/A2 if the effect and non-effect column is known as this is 100% the safest way to ensure correct labelling. However, MungeSumstats does have a pretty in-depth approach to inferring these columns even in the face of ambiguous naming like this.
Firstly, this approach will only be used if ambiguous naming is used (A0, A1, A2 for alleles). Otherwise, the user defined summary statistics column headers will be used to infer the alleles and thus, could be incorrect. Note that this can be changed and enforced all the time if you are certain that for the majority of SNPs in your GWAS sumstats that the minor alleles are the effect alleles (see Q.6). If so, you can set MungeSumstats::format_sumstats(..., eff_on_minor_alleles=TRUE)
. The benefit of this is that you can catch cases where a summary statistics file has simply had it's effect and non-effect columns mislabled (mixed up). See more on this here.
MungeSumstats uses three checks are made to infer which allele the effect/frequency information relates to:
-
Check if ambiguous naming conventions are used (i.e. allele 0, 1 and 2 or equivalent). If none exit, otherwise continue to next checks. This can be checked by using the mapping file and splitting A1/A2 mappings by those that contain 0, 1 or 2 (ambiguous) or doesn't contain 0, 1 or 2 e.g. effect, tested (unambiguous so fine for MungeSumstats to handle as is).
-
Look for effect column/frequency column where the A0/A1/A2 explicitly mentioned, if found then we know the direction and should update A0/A1/A2 naming so A2 is the effect column. We can look for such columns by getting every combination of A0/A1/A2 naming and effect/frq naming.
-
If not found in check 2, a final check should be against the reference genome, whichever of A0, A1 and A2 has more of a match with the reference genome should be taken as not the effect allele. There is an assumption in this but is still better than guessing the ambiguous allele naming.
Also, if eff_on_minor_alleles=TRUE, check 3 will be used in all cases. However, this assumes that the effects are majoritively measured on the minor alleles and should be used with caution as this is an assumption that won't be appropriate in all cases. However, the benefit is that if we know the majority of SNPs have their effects based on the minor alleles, we can catch cases where the allele columns have been mislabelled. If eff_on_minor_alleles=TRUE, checks 1 and 2 will be skipped.
Adding extra columns that are not listed in the columns to standardise (data("sumstatsColHeaders")
) will not affect any processing of GWAS summary statistics. These columns will just be ignored and returned on output - although the name will be converted to uppercase which is standard of MSS column processing. Note that MSS will also not drop rows with missing values in these extra columns - if you want this functionality, you can add the column names into the vector input for drop_na_cols
in format_sumstats()
call or set just drop_na_cols=NULL
which will check every input column for missing values. Note though that drop_na_cols=NULL
may not be sensible in a lot of cases.
If you use MungeSumstats, please cite the original authors of the GWAS as well as:
Alan E Murphy, Brian M Schilder, Nathan G Skene (2021) MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics. Bioinformatics, btab665, https://doi.org/10.1093/bioinformatics/btab665
See the FAQ for some helpful pointers for running MungeSumstats