-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large number of non-biallelic SNPs #111
Comments
@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020? |
Sure! |
I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:
There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on |
Ok, well that at least narrows thing down. Thanks!
That would be great, thanks! |
Had a look into all the RS ID's from dbSNP 144 to check the proportion that are non-biallelic across the dbSNP 155 database and 144. I used an approach separate to MSS to ensure the issue isn't there. This approach was given by @hpages (see below). It appears there is just a substantial number more non-biallelic SNPs in 155. This doesn't seem to be a MSS issue and just reflects the changes in the versions. I'll post about this in https://github.com/hpages/SNPlocsForge in case anyone else has further insight but I believe the field's focus on biallelic SNPs may need to change as the number of non-biallelic SNPs increase. ~24% is a large proportion of SNPs to exclude. I'll leave this issue open for now for others to see.
|
Posted on Biostars about this issue and got some replies too - https://www.biostars.org/p/9534249/#9534468 Generally, seems like people agree that the observed numbers are true |
I'm not conviced that dropping anything marked as multi-allelic in dbSNP is a sensible idea. As I understand it, the reasons for removing multi-allelic SNPs could be the following:
For point 1, it could well be valid to simply split the row into two rows f.e.g:
(Although in some cases having duplicated genomic location could be an issue.) For point 2, bi-allelic vs multi-allelic should only matter within the dataset. If a study sample includes only two alleles at a genomic location, then it doesn't matter that dbSNP has a third allele on record from a different dataset, as this doesn't affect anything within the study. |
I think those 2 points highly depend on the downstream analysis use case. I do agree though, I feel dropping non-bi-allelic SNPs is no longer a sensible option given the increasing numbers. I believe the options of splitting multi-allelic SNPs where the ALT contains more than one alternative allele to multiple rows and removing multi-allelic SNPs where more than one of the alternative alleles are in the sumstats itself are both very sensible. If you would like to have a go at adding functionality to do these to MSS as alternative multi-allelic approaches and submit a pull request that would be great! I think it would really benefit the wider community (I'd be happy to add you to the list of contributors too). |
OpenGWAS report: using dbSNP155Here's a sampling of 10,863 of the most well-powered GWAS in OpenGWAS. Absolute numberThe number of non-biallelic SNPs dropped ranged from 177 to 8.6 million (with a mean of 1.1 million) for a single GWAS. PercentAs a percentage of the total SNPs in each GWAS to begin with: The average percentage of non-biallelic SNPS was 47%, ranging from ~0 to 63%. Session info
R Under development (unstable) (2022-02-25 r81808)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS
Matrix products: default locale: attached base packages: loaded via a namespace (and not attached): |
Hi Murphy! I seem to face the same situation that a huge number of SNPs get removed for being non-biallelic. I was wondering if there is now an argument I can specify to keep the non-biallelic SNPs. Thanks in advance! |
Sorry for my ignorance, non-biallelic SNPs can be kept by specifying the |
Hi! I'm, sorry for writing in this old issue, but I am looking for a solution for this and saw the conversation in the SNPlocsForge Repo, which was concluded some time ago, but I failed to understand the solution. I have a high number of non bi-allelic SNP dropped when using MungeSumtats, with all up-to-date packages from Bioconductor 3.18? Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP? Thanks for your help and the great work on this package! Cheers! |
Hey,
Yes, this is the latest release version of MSS.
That's one solution, another one would be to keep non-bi-allelic SNPs (set Cheers, |
Closing as I believe we have looked into this enough at this stage and have a work around in place (set |
I'm munging a bunch of OpenGWAS data with MSS, and i noticed that the number of variants dropped due to being non-biallelic was quite high (hundreds of thousands-millions). In some cases, up to half the original SNPs are being dropped because they’re non-biallelic.
Reprex
Example 1: ieu-a-8
In this case, a GWAS with 2.4M variants had 1.2M+ non-bilallelic variants removed:
mss.txt
Example 2: Vuckovic2020
Here’s another where 5M/44M variant were considered non-biallelic:
mss_vuckovic.txt
Literature
Possible explanations
To do
The text was updated successfully, but these errors were encountered: