Large number of non-biallelic SNPs #111

bschilder · 2022-08-04T17:15:10Z

I'm munging a bunch of OpenGWAS data with MSS, and i noticed that the number of variants dropped due to being non-biallelic was quite high (hundreds of thousands-millions). In some cases, up to half the original SNPs are being dropped because they’re non-biallelic.

Reprex

gwas_paths <- MungeSumstats::import_sumstats(
  ids = "ieu-a-8", 
  save_dir = here::here("data/GWAS_sumstats"), 
  nThread = 30, # >30 causes issues with read_vcf_parallel
  parallel_across_ids = FALSE, 
  force_new_vcf = FALSE,
  force_new = FALSE,
  vcf_download = TRUE,
  vcf_dir = here::here("data/VCFs"), 
  #### Record logs
  log_folder_ind = TRUE,
  log_mungesumstats_msgs = TRUE,
  )

Example 1: ieu-a-8

In this case, a GWAS with 2.4M variants had 1.2M+ non-bilallelic variants removed:
mss.txt

Example 2: Vuckovic2020

Here’s another where 5M/44M variant were considered non-biallelic:
mss_vuckovic.txt

Literature

Campbell2016: Notes that even at the higher estimates we’d only expect multi-allelic sites in 3% of SNPs.
Phillips2020: Found only 0.3% of SNPs in 1KG Phase 3 were tri-allelic.
Phillips2016: Found only 961 SNPs in 1KG Phase 3 were tetra-allelic.

Possible explanations

MSS is over-attributing multi-allelic status to variants somehow,
The updates to Bioc's dbSNP ref contains lots more reported multi-allelic sites that in previous versions.
Something went awry when making the latest Bioc dbSNP dataset.
This is the true number of multi-allelic variants and it is far above the estimates from single-cohort studies (e.g. 1KGp3) bc it incorporates info from across many different datasets.

To do

Double check relevant MSS code. @Al-Murphy
Compare number of multi-allelic SNPs in old vs new Bioc dbSNP datasets. @Al-Murphy
Check the %/# non-ballelic variants that were filtered in previous versions of MSS. @bschilder
Check the %/# non-ballelic variants that were filtered using the latest version of MSS. @bschilder

Al-Murphy · 2022-08-04T19:00:07Z

@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020?

bschilder · 2022-08-04T19:23:21Z

@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020?

Sure!

Al-Murphy · 2022-08-04T19:27:56Z

I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:

#MSS showing a lot is non-biallelic dbSNP 155 vs 144
sumstats_dt <- MungeSumstats::formatted_example()
rsids <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
                                              ref_genome = "GRCH38",
                                              dbSNP=144)
rsids155 <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
                                                 ref_genome = "GRCH38",
                                                 dbSNP=155)

# get chars for SNPs not bi/tri allelic
# or strand ambig from IUPAC_CODE_MAP
nonambig_IUPAC_CODE_MAP <-
  names(Biostrings::IUPAC_CODE_MAP[nchar(
    Biostrings::IUPAC_CODE_MAP
  ) < 3])

data.table::setkey(rsids, SNP)
data.table::setkey(rsids155, SNP)
bad_ids <- rsids[!alleles_as_ambig %in% nonambig_IUPAC_CODE_MAP]
print(nrow(bad_ids))
bad_ids155 <- rsids155[!alleles_as_ambig %in% nonambig_IUPAC_CODE_MAP]
print(nrow(bad_ids155))
print(bad_ids155)

There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on

bschilder · 2022-08-04T19:34:41Z

I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:

Ok, well that at least narrows thing down. Thanks!

There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on

That would be great, thanks!

Al-Murphy · 2022-08-04T20:40:10Z

I randomly picked 3 SNPs from my example code above and checked them in NCBI dbSNP online portal (dbSNP 155) and they all appear to be non-biallelic as described:

rs9824539
rs9556958
rs7854982

I know this is by no means definitive but does seem like it is doing things correctly

Al-Murphy · 2022-08-08T13:06:35Z

Had a look into all the RS ID's from dbSNP 144 to check the proportion that are non-biallelic across the dbSNP 155 database and 144. I used an approach separate to MSS to ensure the issue isn't there. This approach was given by @hpages (see below).

It appears there is just a substantial number more non-biallelic SNPs in 155. This doesn't seem to be a MSS issue and just reflects the changes in the versions.

I'll post about this in https://github.com/hpages/SNPlocsForge in case anyone else has further insight but I believe the field's focus on biallelic SNPs may need to change as the number of non-biallelic SNPs increase. ~24% is a large proportion of SNPs to exclude.

I'll leave this issue open for now for others to see.

#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt", 
                            required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)

library(BSgenome)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
  pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
  pkgenv <- loadNamespace(pkgname)
  snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
  snps <- snpsById(snplocs, rsids, ifnotfound="drop")
  unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
                  mcols(snps)$RefSNP_id)[rsids])
}

dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
        match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
        match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
        match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
        match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match         N         prop
#1:          both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic  32266661 0.2382168183
#3:      both non-biallelic   2565775 0.0189424855
#4: dbSNP 144 non-biallelic     13116 0.0000968322

Al-Murphy · 2022-08-12T07:24:45Z

Posted on Biostars about this issue and got some replies too - https://www.biostars.org/p/9534249/#9534468

Generally, seems like people agree that the observed numbers are true

ilevantis · 2022-09-02T15:38:43Z

I'm not conviced that dropping anything marked as multi-allelic in dbSNP is a sensible idea. As I understand it, the reasons for removing multi-allelic SNPs could be the following:

The downstream tool cannot handle a table row defining a SNP with multiple alternate alleles.
The downstream tool assumes all input is bi-allelic and calculates things such as allele frequency as 1 - effect_allele_frq or uses a statistical model that would have assumptions violated by multi-allelic variants.

For point 1, it could well be valid to simply split the row into two rows f.e.g:

POS   REF   ALT
123   A     G
123   A     T

(Although in some cases having duplicated genomic location could be an issue.)

For point 2, bi-allelic vs multi-allelic should only matter within the dataset. If a study sample includes only two alleles at a genomic location, then it doesn't matter that dbSNP has a third allele on record from a different dataset, as this doesn't affect anything within the study.

Al-Murphy · 2022-09-02T16:00:12Z

I think those 2 points highly depend on the downstream analysis use case. I do agree though, I feel dropping non-bi-allelic SNPs is no longer a sensible option given the increasing numbers.

I believe the options of splitting multi-allelic SNPs where the ALT contains more than one alternative allele to multiple rows and removing multi-allelic SNPs where more than one of the alternative alleles are in the sumstats itself are both very sensible. If you would like to have a go at adding functionality to do these to MSS as alternative multi-allelic approaches and submit a pull request that would be great! I think it would really benefit the wider community (I'd be happy to add you to the list of contributors too).

bschilder · 2022-09-17T10:47:29Z

OpenGWAS report: using dbSNP155

Here's a sampling of 10,863 of the most well-powered GWAS in OpenGWAS.

Absolute number

The number of non-biallelic SNPs dropped ranged from 177 to 8.6 million (with a mean of 1.1 million) for a single GWAS.

Percent

As a percentage of the total SNPs in each GWAS to begin with:

The average percentage of non-biallelic SNPS was 47%, ranging from ~0 to 63%.

Session info

R Under development (unstable) (2022-02-25 r81808) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
[1] C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a [1] rappdirs_0.3.3 [4] R.methodsS3_1.8.2 [7] ggplot2_3.3.6 [10] irlba_2.3.5 [13] data.table_1.14.2 [16] KEGGREST_1.37.3 [19] BiocGenerics_0.43.1 [22] cowplot_1.1.1 [25] proxy_0.4-27 [28] webshot_0.5.3 [31] httpuv_1.6.5 [34] orthogene_1.3.2 [37] scNLP_0.1.0 [40] TSP_1.2-1 [43] promises_1.2.0.1 [46] progress_1.2.2 [49] caTools_1.18.2 [52] DBI_1.1.3 [55] spatstat.geom_2.4-0 [58] ellipsis_0.3.2 [61] backports_1.4.1 [64] biomaRt_2.53.2 [67] MungeSumstats_1.5.13 [70] Biobase_2.57.1 [73] abind_1.4-5 [76] progressr_0.11.0 [79] treeio_1.21.2 [82] cluster_2.1.4 [85] ape_5.6-2 [88] pkgconfig_2.0.3 [91] nlme_3.1-159 [94] globals_0.16.1 [97] registry_0.5-1 [100] AnnotationHub_3.5.0 [103] polyclip_1.10-0 [106] Matrix_1.4-1 [109] boot_1.3-28 [112] png_0.1-7 [115] rootSolve_1.8.2.3 [118] R.oo_1.25.0 [121] blob_1.2.3 [124] spatstat.random_2.2-0 [127] gridGraphics_0.5-1 [130] scales_1.2.1 [133] plyr_1.8.7 [136] zlibbioc_1.43.0 [139] RColorBrewer_1.1-3 [142] homologene_1.4.68.19.3.27 [145] XVector_0.37.1 [148] pbapply_1.5-0 [151] tidyselect_1.1.2 [154] ggrepel_0.9.1 [157] VariantAnnotation_1.43.3 [160] future.apply_1.9.0 [163] RNOmni_1.0.1 [166] gridExtra_2.3 [169] HGNChelper_0.8.1 [172] rgeos_0.5-9 [175] GenomicRanges_1.49.1 [178] BiocVersion_3.16.0 [181] httr_1.4.4 [184] Rdpack_2.4 [187] fs_1.5.2 [190] IRanges_2.31.2 [193] yulab.utils_0.0.5 [196] spatstat.utils_2.3-1 [199] ggplotify_0.1.0 [202] jsonlite_1.8.0 [205] heatmaply_1.3.0 [208] EWCE_1.5.7 [211] mime_0.12 [214] minqa_1.2.4 [217] interactiveDisp [220] utf8_1.2.2 [223] tibble_3.1.8 [226] DescTools_0.99.46 [229] survival_3.4-0 [232] googleAuthR_2.0.0 [235] GenomeInfoDbData_1.2.8 [238] reshape2_1.4.4 [241] spatstat.core_2.4-4 namespace (and not attached):
rtracklayer_1.57.0 scattermore_0.8
SeuratObject_4.1.1 tidyr_1.2.0
bit64_4.0.5 knitr_1.40
DelayedArray_0.23.1 R.utils_2.12.0
rpart_4.1.16 doParallel_1.0.17
RCurl_1.98-1.8 generics_0.1.3
GenomicFeatures_1.49.6 RhpcBLASctl_0.21-247.1
RSQLite_2.2.17 RANN_2.6.1
future_1.28.0 bit_4.0.4
spatstat.data_2.2-0 xml2_1.3.3
SummarizedExperiment_1.27.2 assertthat_0.2.1
viridis_0.6.2 gargle_1.2.1
xfun_0.32 hms_1.1.2
babelgene_22.3 evaluate_0.16
fansi_1.0.3 restfulr_0.0.15
readxl_1.4.1 dendextend_1.16.0
dbplyr_2.2.1 igraph_1.3.4
htmlwidgets_1.5.4 sparsesvd_0.2-1
stats4_4.2.0 purrr_0.3.4
dplyr_1.0.10 ggpubr_0.4.0
gprofiler2_0.2.1 aod_1.3.2
deldir_1.0-6 MatrixGenerics_1.9.1
vctrs_0.4.1 SingleCellExperiment_1.19.0
here_1.0.1 ROCR_1.0-11
cachem_1.0.6 BSgenome_1.65.2
sctransform_0.3.4 GenomicAlignments_1.33.1
prettyunits_1.1.1 goftest_1.2-3
ExperimentHub_2.5.0 ontologyIndex_2.10
lazyeval_0.2.2 crayon_1.5.1
GenomeInfoDb_1.33.5 seriation_1.3.6
ewceData_1.5.0 rlang_1.0.5
lifecycle_1.0.2 miniUI_0.1.1.1
filelock_1.0.2 BiocFileCache_2.5.0
cellranger_1.1.0 rprojroot_2.0.3
matrixStats_0.62.0 lmtest_0.9-40
aplot_0.1.6 carData_3.0-5
zoo_1.8-10 ggridges_0.5.3
viridisLite_0.4.1 rjson_0.2.21
phenomix_0.99.4 bitops_1.0-7
KernSmooth_2.23-20 Biostrings_2.65.3
stringr_1.4.1 parallelly_1.32.1
rstatix_0.7.0 MAGMA.Celltyping_2.0.6
S4Vectors_0.35.3 ggsignif_0.6.3
memoise_2.0.1 magrittr_2.0.3
ica_1.0-3 gplots_3.1.3
compiler_4.2.0 BiocIO_1.7.1
lme4_1.1-30 fitdistrplus_1.1-8
Rsamtools_2.13.4 cli_3.4.0
listenv_0.8.0 patchwork_1.1.2
mgcv_1.8-40 MASS_7.3-58.1
stringi_1.7.8 yaml_2.3.5
GeneOverlap_1.33.0 grid_4.2.0
lmom_2.9 tools_4.2.0
parallel_4.2.0 rstudioapi_0.14
piggyback_0.1.4 foreach_1.5.2
gld_2.6.5 Rtsne_0.16
digest_0.6.29 BiocManager_1.30.18
shiny_1.7.2 Rcpp_1.0.9
car_3.1-0 broom_1.0.1
later_1.3.0 RcppAnnoy_0.0.19
ggdendro_0.1.23 AnnotationDbi_1.59.1
colorspace_2.0-3 XML_3.99-0.10
tensor_1.5 reticulate_1.26
splines_4.2.0 uwot_0.1.14
expm_0.999-6 tidytree_0.4.0
sp_1.5-0 Exact_3.1
plotly_4.10.0 xtable_1.8-4
nloptr_2.0.3 ggtree_3.5.3
ggfun_0.0.7 R6_2.5.1
pillar_1.8.1 htmltools_0.5.3
glue_1.6.2 fastmap_1.1.0
BiocParallel_1.31.12 class_7.3-20
layBase_1.35.0 codetools_0.2-18 mvtnorm_1.1-3
lattice_0.20-45 spatstat.sparse_2.1-1
pbkrtest_0.5.1 curl_4.3.2
leiden_0.4.2 gtools_3.9.3
limma_3.53.6 rmarkdown_2.16
munsell_0.5.0 e1071_1.7-11
iterators_1.0.14 variancePartition_1.27.3
gtable_0.3.1 rbibutils_2.2.9
Seurat_4.1.1

Lloyd-LiuSiyi · 2022-11-23T05:31:32Z

Hi Murphy! I seem to face the same situation that a huge number of SNPs get removed for being non-biallelic. I was wondering if there is now an argument I can specify to keep the non-biallelic SNPs. Thanks in advance!

Lloyd-LiuSiyi · 2022-11-23T05:44:17Z

Sorry for my ignorance, non-biallelic SNPs can be kept by specifying the bi_allelic_filter argument.

cfbeuchel · 2024-02-02T14:17:40Z

Hi!

I'm, sorry for writing in this old issue, but I am looking for a solution for this and saw the conversation in the SNPlocsForge Repo, which was concluded some time ago, but I failed to understand the solution. I have a high number of non bi-allelic SNP dropped when using MungeSumtats, with all up-to-date packages from Bioconductor 3.18? Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP?

Thanks for your help and the great work on this package!

Cheers!

Al-Murphy · 2024-02-02T14:49:37Z

Hey,

using MungeSumtats, with all up-to-date packages from Bioconductor 3.18

Yes, this is the latest release version of MSS.

Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP?

That's one solution, another one would be to keep non-bi-allelic SNPs (set bi_allelic_filter=FALSE when running format_sumstats()) - note this affects FRQ calculations if they need to be flipped etc. It really depends on what you plan on doing with the sumstats, some downstream tools like LDSC for example, will remove non-bi-allelic SNPs anyway (leaving just HapMap3 SNPs). Personally, I think you should keep non-bi-allelic SNPs unless you have a good reason to remove them.

Cheers,
Alan.

Al-Murphy · 2024-08-15T13:16:40Z

Closing as I believe we have looked into this enough at this stage and have a work around in place (set bi_allelic_filter=FALSE in format_sumstats())

bschilder changed the title ~~Large number o f non-biallelic SNPs~~ Large number of non-biallelic SNPs Aug 4, 2022

bschilder assigned bschilder and Al-Murphy Aug 4, 2022

Al-Murphy mentioned this issue Aug 8, 2022

High proportion of non-biallelic SNPs in dbSNP 155 vs 144 hpages/SNPlocsForge#1

Closed

AMCalejandro mentioned this issue Sep 19, 2022

Many non biallelic variants get removed #124

Closed

rootze mentioned this issue Dec 13, 2022

Problem in running MungeSumstats::format_sumstats with bi_allelic_filter = FALSE #137

Closed

jksull mentioned this issue Jan 16, 2023

SNPs not found on reference when using (default)flip_check even though they do exist #140

Closed

bschilder mentioned this issue May 15, 2023

Assess whether MAGMA can take multi-allelic SNPs neurogenomics/MAGMA_Celltyping#143

Closed

Al-Murphy mentioned this issue Aug 18, 2023

Allele frequency cant be flipped for multi-allelic variants. #165

Closed

Al-Murphy closed this as completed Aug 15, 2024

Al-Murphy mentioned this issue Sep 16, 2024

Add FAQ of common queries to documentation #194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large number of non-biallelic SNPs #111

Large number of non-biallelic SNPs #111

bschilder commented Aug 4, 2022 •

edited

Loading

Al-Murphy commented Aug 4, 2022

bschilder commented Aug 4, 2022

Al-Murphy commented Aug 4, 2022

bschilder commented Aug 4, 2022

Al-Murphy commented Aug 4, 2022

Al-Murphy commented Aug 8, 2022

Al-Murphy commented Aug 12, 2022

ilevantis commented Sep 2, 2022

Al-Murphy commented Sep 2, 2022

bschilder commented Sep 17, 2022 •

edited

Loading

Lloyd-LiuSiyi commented Nov 23, 2022

Lloyd-LiuSiyi commented Nov 23, 2022

cfbeuchel commented Feb 2, 2024

Al-Murphy commented Feb 2, 2024 •

edited

Loading

Al-Murphy commented Aug 15, 2024

Large number of non-biallelic SNPs #111

Large number of non-biallelic SNPs #111

Comments

bschilder commented Aug 4, 2022 • edited Loading

Reprex

Example 1: ieu-a-8

Example 2: Vuckovic2020

Literature

Possible explanations

To do

Al-Murphy commented Aug 4, 2022

bschilder commented Aug 4, 2022

Al-Murphy commented Aug 4, 2022

bschilder commented Aug 4, 2022

Al-Murphy commented Aug 4, 2022

Al-Murphy commented Aug 8, 2022

Al-Murphy commented Aug 12, 2022

ilevantis commented Sep 2, 2022

Al-Murphy commented Sep 2, 2022

bschilder commented Sep 17, 2022 • edited Loading

OpenGWAS report: using dbSNP155

Absolute number

Percent

Session info

Lloyd-LiuSiyi commented Nov 23, 2022

Lloyd-LiuSiyi commented Nov 23, 2022

cfbeuchel commented Feb 2, 2024

Al-Murphy commented Feb 2, 2024 • edited Loading

Al-Murphy commented Aug 15, 2024

bschilder commented Aug 4, 2022 •

edited

Loading

bschilder commented Sep 17, 2022 •

edited

Loading

Al-Murphy commented Feb 2, 2024 •

edited

Loading