-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add field to signify which allele is the reference #4
Comments
I very much agree with @jdhayhurst on this point. This might seem at first as a minor issue, but it really seems like a limitation of the GWAS-SSF standard in my opinion. When looking at the high coverage 1000 Genomes reference panel, there are actually >700k indel pairs with reference and alternate alleles swapped, corresponding to the two possible type of indel variants, insertions and deletions. This represents a typical scenario at homopolymer runs and STR loci and, as our ability to correctly assess these loci increases, this is going to become an increasingly larger problem As early as 2014 I had noticed that this unordered way to represents the two alleles at a locus like PLINK1 does (see here) was flawed and I believed that the VCF standard was partly driven by having to address these scenarios. Indeed, PLINK2 has now a framework to remember which allele is the reference allele I would strongly encourage to recommend the genetic community to adopt a summary statistics standard built on top of VCF. The GWAS-VCF standard seemed a much better option in my opinion It has several advantages over GWAS-SSF:
Once a format like GWAS-VCF is agreed and adopted, I think developers of software tools such as PLINK2 or REGENIE could be convinced to allow generation of summary statistics in such a format |
I agree with the points raised by @freeseek and not because I am one of the proponents of the GWAS-VCF format ;) I would be ecstatic with the wide adoption of any sumstats format but, in my opinion, one of the major difficulties with the absence of such standardized format has been the ambiguity regarding what is ref and what is alt (or min vs major) in combination with what is effect vs non-effect allele. |
I also generally agree that this should be built on top of the VCF spec rather than kind-of duplicate it, but differently. VCF has well-supported tooling and has already solved most of the variant representation issues. This spec could aim to constrain some aspects of the VCF spec to fit with the existing proposal e.g. requiring the use of contig names 1-25 instead of the slightly more variable (chr)[1-22,X,Y,M,MT] 'traditional' names or requiring splitting multiple alternate alleles onto separate lines along with defining specific keys required/recommended for the What are the benefits of this over GWAS-VCF? This is a genuine question, as I have no vested interest (or in-depth knowledge) in either this or GWAS-VCF! |
Thanks for the discussion and helpful contributions. We have now amended the spec to include variant ID in the format chr_bp_ref_alt, and added an additional column to specify whether the effect or other allele is REF. We haven’t mandated that submitters fix their data with EA=ALT. This is because our main goal in GWAS-SSF is to encourage data sharing (bearing in mind that still only around 30% of GWAS publications share data at all). We know that many data generators (rightly or wrongly) prefer to use EA=minor allele or other allele orders, and we wanted the format to be flexible enough to handle legacy data without presenting further overheads/barriers to data sharing. However in the GWAS Catalog’s submission documentation accompanying GWAS-SSF we now strongly recommend that EA is fixed to ALT at the point of data generation. Of note, the GWAS Catalog also provides data in a harmonised format which includes harmonisation to the reference allele. @julesjacobsen The current state of the field is that many summary statistics files are lacking key information (particularly effect allele, EAF) which hinder downstream use of the data, or are not shared at all. The main goal of GWAS-SSF is to identify key mandatory and non-mandatory data and metadata fields for usability and encourage data sharing. We believe at this point in time, the community will benefit from definition of these data fields which can be applied to the simple tsv format described here (which is the GWAS Catalog’s implementation), or GWAS-VCF, or any other file format. We are updating the manuscript to focus on the data content and make this clearer. |
@ljwh2 thank you for the update. I would like to share another example of why I think not requiring the reference allele in the file format is a bad idea for a file standard. Take the latest Alzheimer disease GWAS. The summary statistics are shared as a GWAS-SSF and the most strongly associated variant is the following:
Now, if you look at the 1000 Genomes project high coverage reference panel you can find two variants at position
Which variant is the one including the effect allele? Both variants have rsID rs747519137 in dbSNP. It's impossible to tell and this problem is just going to get worse as users switch to running GWAS using the latest imputation reference panels that have numerous cases of variants like these. Now, I can sympathize with the fact that the problem originates at the point of data generation, with software like PLINK largely to blame, and so the GWAS Catalog has limited influence over this. But allowing a file format to be adopted that does not mandate this will inevitably lower the pressure for the authors of the softwares from which the summary statistics originate to fix the problem. Ideally, you would want a file format to be adopted by software like PLINK, but if recording which one is the reference allele is not mandated, the problem is likely to persist. What would be the incentives for users to do what they can to keep this additional information and do what they can to make sure it is correct? Also, what does it mean that the GWAS Catalog provides data in a harmonised format which includes harmonisation to the reference allele? How would the reference allele be determined in situations like the example presented here? |
Note that PLINK itself will be updated; this should be taken into account when analyzing the relative merits of these options. |
I guess I meant the original PLINK software and the BIM file format, rather than the current implementation of PLINK, which did not require to specify which allele is the reference and which allele is the alternate. This is likely due to the fact that the first GWASs were restricted to SNPs and did not have to deal with indels. |
Note that PLINK 1.x also did not include effect-allele-frequency in its output, which is a required field. So anyone who wants to submit a GWAS-SSF file based on an old PLINK 1.x analysis will need to do some additional work. A primary goal with my current PLINK 2.0 update is to make it easy for scientists in this boat to submit the most useful thing they realistically can. There are two scenarios to balance here: (i) the scientist has easy access to a VCF file which spells out the REF/ALT alleles for these otherwise-ambiguous indels, and (ii) they do not. In the second scenario, it is better for the scientist to submit 99.99%-unambiguous results than to submit nothing at all. But we want to nudge as many scientists as possible in the first scenario to generate results with correctly-annotated REF/ALT alleles, rather than to just use the workflow designed to allow scenario-2 scientists to submit anything at all. To address this, I will modify plink2 --gwas-ssf so that, if an ambiguous indel is present, the command errors out unless the user manually overrides. |
Implementation: chrchang/plink-ng@c961b6a Note that there is still an ambiguous case that comes up for multiallelic variants where REF is neither 'effect' nor 'other'. |
100% agree with this @chrchang
@freeseek I think we can help to address this with messaging and education. Minor allele can make more intuitive sense to a biologist, especially if they're used to working with SNVs only, or looking more broadly for loci for followup study rather than specific variants. It hasn't occurred to them that there is an issue with not using ref/alt. We'll address this with recommendations in the paper and the documentation. I also propose we review in a year or so to see how effective that is, and we can consider mandating it in future. |
@chrchang thank you for addressing this issue in the |
@jdhayhurst just tagging you for the harmonisation pipeline question |
Messaging, education and recommendations (i.e., relying on good will without strong incentives) clearly does not work. Much messaging, education, and recommendations has been done for a decade or longer on NHST, reproducibility, etc. and not much has changed. Enforcement of unambiguous and robust standards for submission is the only way to go in my opinion. I totally agree with @freeseek on this. |
I think we are in agreement on the problem, but differ on the best method to achieve the goal. In our experience, the path of least resistance is for data generators to not share the data at all (Nature & PloS journals have data sharing mandates, but other journals do not - less than 30% of GWAS papers share data at the point of publication) or to stick it on an internal website or FigShare in a non standard format, rather than submit to a repository. Thus we think the best approach is to start with messaging and work towards a mandate. The GWAS Catalog provides harmonised files alongside the submitted version which have been oriented to the reference strand & mapped to the same genome build (these are the files that are ingested by OpenGWAS, for example). This does not solve the indel problem, of course - if the alleles cannot be oriented, the variant is not harmonised - but does take some of the strain for bioinformaticians. Harmoniser repo is here: https://github.com/EBISPOT/gwas-sumstats-harmoniser, and useful info here: https://github.com/opentargets/genetics-sumstat-harmoniser/blob/master/flowchart_v3.svg |
We need to be able to determine which is the reference allele.
For example if you have a ref sequence of ATGTGA and at position 3 you have an EA of GT and an OA of G, you have no way of knowing which is ref. The trouble with that is that the variant is ambiguous.
If the EA is the ref: the OA is a deletion wrt ref seq --> effect sequence: ATGTGA
If the OA is the ref: the EA is an insertion wrt ref seq. --> effect sequence: ATGTTGA
Since you have no way of knowing if the EA sequence is ATGTGA or ATGTTGA, it's ambiguous. Yes, you know with respect to the other allele that there is an additional T, but without knowing which is the reference, you cannot actually know the variant.
Suggestions:
The text was updated successfully, but these errors were encountered: