Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of non-biallelic SNPs #111

Closed
3 of 4 tasks
bschilder opened this issue Aug 4, 2022 · 15 comments
Closed
3 of 4 tasks

Large number of non-biallelic SNPs #111

bschilder opened this issue Aug 4, 2022 · 15 comments
Assignees

Comments

@bschilder
Copy link
Collaborator

bschilder commented Aug 4, 2022

I'm munging a bunch of OpenGWAS data with MSS, and i noticed that the number of variants dropped due to being non-biallelic was quite high (hundreds of thousands-millions). In some cases, up to half the original SNPs are being dropped because they’re non-biallelic.

Reprex

gwas_paths <- MungeSumstats::import_sumstats(
  ids = "ieu-a-8", 
  save_dir = here::here("data/GWAS_sumstats"), 
  nThread = 30, # >30 causes issues with read_vcf_parallel
  parallel_across_ids = FALSE, 
  force_new_vcf = FALSE,
  force_new = FALSE,
  vcf_download = TRUE,
  vcf_dir = here::here("data/VCFs"), 
  #### Record logs
  log_folder_ind = TRUE,
  log_mungesumstats_msgs = TRUE,
  ) 

Example 1: ieu-a-8

In this case, a GWAS with 2.4M variants had 1.2M+ non-bilallelic variants removed:
mss.txt

Example 2: Vuckovic2020

Here’s another where 5M/44M variant were considered non-biallelic:
mss_vuckovic.txt

Literature

  • Campbell2016: Notes that even at the higher estimates we’d only expect multi-allelic sites in 3% of SNPs.
  • Phillips2020: Found only 0.3% of SNPs in 1KG Phase 3 were tri-allelic.
  • Phillips2016: Found only 961 SNPs in 1KG Phase 3 were tetra-allelic.

Possible explanations

  1. MSS is over-attributing multi-allelic status to variants somehow,
  2. The updates to Bioc's dbSNP ref contains lots more reported multi-allelic sites that in previous versions.
  3. Something went awry when making the latest Bioc dbSNP dataset.
  4. This is the true number of multi-allelic variants and it is far above the estimates from single-cohort studies (e.g. 1KGp3) bc it incorporates info from across many different datasets.

To do

  • Double check relevant MSS code. @Al-Murphy
  • Compare number of multi-allelic SNPs in old vs new Bioc dbSNP datasets. @Al-Murphy
  • Check the %/# non-ballelic variants that were filtered in previous versions of MSS. @bschilder
  • Check the %/# non-ballelic variants that were filtered using the latest version of MSS. @bschilder
@bschilder bschilder changed the title Large number o f non-biallelic SNPs Large number of non-biallelic SNPs Aug 4, 2022
@Al-Murphy
Copy link
Owner

@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020?

@bschilder
Copy link
Collaborator Author

@bschilder can you add the top 10 or so rows of the log file of the dropped non-biallelic SNPs from Example 1: ieu-a-8 or Example 2: Vuckovic2020?

Sure!

@Al-Murphy
Copy link
Owner

I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:

#MSS showing a lot is non-biallelic dbSNP 155 vs 144
sumstats_dt <- MungeSumstats::formatted_example()
rsids <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
                                              ref_genome = "GRCH38",
                                              dbSNP=144)
rsids155 <- MungeSumstats:::load_ref_genome_data(snps = sumstats_dt$SNP,
                                                 ref_genome = "GRCH38",
                                                 dbSNP=155)

# get chars for SNPs not bi/tri allelic
# or strand ambig from IUPAC_CODE_MAP
nonambig_IUPAC_CODE_MAP <-
  names(Biostrings::IUPAC_CODE_MAP[nchar(
    Biostrings::IUPAC_CODE_MAP
  ) < 3])

data.table::setkey(rsids, SNP)
data.table::setkey(rsids155, SNP)
bad_ids <- rsids[!alleles_as_ambig %in% nonambig_IUPAC_CODE_MAP]
print(nrow(bad_ids))
bad_ids155 <- rsids155[!alleles_as_ambig %in% nonambig_IUPAC_CODE_MAP]
print(nrow(bad_ids155))
print(bad_ids155)

There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on

@bschilder
Copy link
Collaborator Author

I have checked MSS code and it seems to work as I suspect it to, it appears that there are now just a lot more alternative alleles in bioconductor's dbSNP 155 vs 144. See the example below where dbSNP 144 finds no non-biallelic but 155 finds 48 from a total of 93:

Ok, well that at least narrows thing down. Thanks!

There is no github page for bioconductor's dbSNP 155 so I'll just message Herve to ask his opinion on what's going on

That would be great, thanks!

@Al-Murphy
Copy link
Owner

I randomly picked 3 SNPs from my example code above and checked them in NCBI dbSNP online portal (dbSNP 155) and they all appear to be non-biallelic as described:

rs9824539
rs9556958
rs7854982

I know this is by no means definitive but does seem like it is doing things correctly

@Al-Murphy
Copy link
Owner

Had a look into all the RS ID's from dbSNP 144 to check the proportion that are non-biallelic across the dbSNP 155 database and 144. I used an approach separate to MSS to ensure the issue isn't there. This approach was given by @hpages (see below).

It appears there is just a substantial number more non-biallelic SNPs in 155. This doesn't seem to be a MSS issue and just reflects the changes in the versions.

I'll post about this in https://github.com/hpages/SNPlocsForge in case anyone else has further insight but I believe the field's focus on biallelic SNPs may need to change as the number of non-biallelic SNPs increase. ~24% is a large proportion of SNPs to exclude.

I'll leave this issue open for now for others to see.

#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt", 
                            required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)

library(BSgenome)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
  pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
  pkgenv <- loadNamespace(pkgname)
  snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
  snps <- snpsById(snplocs, rsids, ifnotfound="drop")
  unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
                  mcols(snps)$RefSNP_id)[rsids])
}

dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
        match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
        match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
        match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
        match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match         N         prop
#1:          both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic  32266661 0.2382168183
#3:      both non-biallelic   2565775 0.0189424855
#4: dbSNP 144 non-biallelic     13116 0.0000968322

@Al-Murphy
Copy link
Owner

Posted on Biostars about this issue and got some replies too - https://www.biostars.org/p/9534249/#9534468

Generally, seems like people agree that the observed numbers are true

@ilevantis
Copy link

I'm not conviced that dropping anything marked as multi-allelic in dbSNP is a sensible idea. As I understand it, the reasons for removing multi-allelic SNPs could be the following:

  1. The downstream tool cannot handle a table row defining a SNP with multiple alternate alleles.
  2. The downstream tool assumes all input is bi-allelic and calculates things such as allele frequency as 1 - effect_allele_frq or uses a statistical model that would have assumptions violated by multi-allelic variants.

For point 1, it could well be valid to simply split the row into two rows f.e.g:

POS   REF   ALT
123   A     G
123   A     T

(Although in some cases having duplicated genomic location could be an issue.)

For point 2, bi-allelic vs multi-allelic should only matter within the dataset. If a study sample includes only two alleles at a genomic location, then it doesn't matter that dbSNP has a third allele on record from a different dataset, as this doesn't affect anything within the study.

@Al-Murphy
Copy link
Owner

I think those 2 points highly depend on the downstream analysis use case. I do agree though, I feel dropping non-bi-allelic SNPs is no longer a sensible option given the increasing numbers.

I believe the options of splitting multi-allelic SNPs where the ALT contains more than one alternative allele to multiple rows and removing multi-allelic SNPs where more than one of the alternative alleles are in the sumstats itself are both very sensible. If you would like to have a go at adding functionality to do these to MSS as alternative multi-allelic approaches and submit a pull request that would be great! I think it would really benefit the wider community (I'd be happy to add you to the list of contributors too).

@bschilder
Copy link
Collaborator Author

bschilder commented Sep 17, 2022

OpenGWAS report: using dbSNP155

Here's a sampling of 10,863 of the most well-powered GWAS in OpenGWAS.

Absolute number

nonbiallelic

The number of non-biallelic SNPs dropped ranged from 177 to 8.6 million (with a mean of 1.1 million) for a single GWAS.

Screenshot 2022-09-17 at 11 42 33

Percent

As a percentage of the total SNPs in each GWAS to begin with:

plot_zoom_png
Screenshot 2022-09-17 at 11 45 54

The average percentage of non-biallelic SNPS was 47%, ranging from ~0 to 63%.

Session info

R Under development (unstable) (2022-02-25 r81808) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
[1] C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] rappdirs_0.3.3 rtracklayer_1.57.0 scattermore_0.8
[4] R.methodsS3_1.8.2 SeuratObject_4.1.1 tidyr_1.2.0
[7] ggplot2_3.3.6 bit64_4.0.5 knitr_1.40
[10] irlba_2.3.5 DelayedArray_0.23.1 R.utils_2.12.0
[13] data.table_1.14.2 rpart_4.1.16 doParallel_1.0.17
[16] KEGGREST_1.37.3 RCurl_1.98-1.8 generics_0.1.3
[19] BiocGenerics_0.43.1 GenomicFeatures_1.49.6 RhpcBLASctl_0.21-247.1
[22] cowplot_1.1.1 RSQLite_2.2.17 RANN_2.6.1
[25] proxy_0.4-27 future_1.28.0 bit_4.0.4
[28] webshot_0.5.3 spatstat.data_2.2-0 xml2_1.3.3
[31] httpuv_1.6.5 SummarizedExperiment_1.27.2 assertthat_0.2.1
[34] orthogene_1.3.2 viridis_0.6.2 gargle_1.2.1
[37] scNLP_0.1.0 xfun_0.32 hms_1.1.2
[40] TSP_1.2-1 babelgene_22.3 evaluate_0.16
[43] promises_1.2.0.1 fansi_1.0.3 restfulr_0.0.15
[46] progress_1.2.2 readxl_1.4.1 dendextend_1.16.0
[49] caTools_1.18.2 dbplyr_2.2.1 igraph_1.3.4
[52] DBI_1.1.3 htmlwidgets_1.5.4 sparsesvd_0.2-1
[55] spatstat.geom_2.4-0 stats4_4.2.0 purrr_0.3.4
[58] ellipsis_0.3.2 dplyr_1.0.10 ggpubr_0.4.0
[61] backports_1.4.1 gprofiler2_0.2.1 aod_1.3.2
[64] biomaRt_2.53.2 deldir_1.0-6 MatrixGenerics_1.9.1
[67] MungeSumstats_1.5.13 vctrs_0.4.1 SingleCellExperiment_1.19.0
[70] Biobase_2.57.1 here_1.0.1 ROCR_1.0-11
[73] abind_1.4-5 cachem_1.0.6 BSgenome_1.65.2
[76] progressr_0.11.0 sctransform_0.3.4 GenomicAlignments_1.33.1
[79] treeio_1.21.2 prettyunits_1.1.1 goftest_1.2-3
[82] cluster_2.1.4 ExperimentHub_2.5.0 ontologyIndex_2.10
[85] ape_5.6-2 lazyeval_0.2.2 crayon_1.5.1
[88] pkgconfig_2.0.3 GenomeInfoDb_1.33.5 seriation_1.3.6
[91] nlme_3.1-159 ewceData_1.5.0 rlang_1.0.5
[94] globals_0.16.1 lifecycle_1.0.2 miniUI_0.1.1.1
[97] registry_0.5-1 filelock_1.0.2 BiocFileCache_2.5.0
[100] AnnotationHub_3.5.0 cellranger_1.1.0 rprojroot_2.0.3
[103] polyclip_1.10-0 matrixStats_0.62.0 lmtest_0.9-40
[106] Matrix_1.4-1 aplot_0.1.6 carData_3.0-5
[109] boot_1.3-28 zoo_1.8-10 ggridges_0.5.3
[112] png_0.1-7 viridisLite_0.4.1 rjson_0.2.21
[115] rootSolve_1.8.2.3 phenomix_0.99.4 bitops_1.0-7
[118] R.oo_1.25.0 KernSmooth_2.23-20 Biostrings_2.65.3
[121] blob_1.2.3 stringr_1.4.1 parallelly_1.32.1
[124] spatstat.random_2.2-0 rstatix_0.7.0 MAGMA.Celltyping_2.0.6
[127] gridGraphics_0.5-1 S4Vectors_0.35.3 ggsignif_0.6.3
[130] scales_1.2.1 memoise_2.0.1 magrittr_2.0.3
[133] plyr_1.8.7 ica_1.0-3 gplots_3.1.3
[136] zlibbioc_1.43.0 compiler_4.2.0 BiocIO_1.7.1
[139] RColorBrewer_1.1-3 lme4_1.1-30 fitdistrplus_1.1-8
[142] homologene_1.4.68.19.3.27 Rsamtools_2.13.4 cli_3.4.0
[145] XVector_0.37.1 listenv_0.8.0 patchwork_1.1.2
[148] pbapply_1.5-0 mgcv_1.8-40 MASS_7.3-58.1
[151] tidyselect_1.1.2 stringi_1.7.8 yaml_2.3.5
[154] ggrepel_0.9.1 GeneOverlap_1.33.0 grid_4.2.0
[157] VariantAnnotation_1.43.3 lmom_2.9 tools_4.2.0
[160] future.apply_1.9.0 parallel_4.2.0 rstudioapi_0.14
[163] RNOmni_1.0.1 piggyback_0.1.4 foreach_1.5.2
[166] gridExtra_2.3 gld_2.6.5 Rtsne_0.16
[169] HGNChelper_0.8.1 digest_0.6.29 BiocManager_1.30.18
[172] rgeos_0.5-9 shiny_1.7.2 Rcpp_1.0.9
[175] GenomicRanges_1.49.1 car_3.1-0 broom_1.0.1
[178] BiocVersion_3.16.0 later_1.3.0 RcppAnnoy_0.0.19
[181] httr_1.4.4 ggdendro_0.1.23 AnnotationDbi_1.59.1
[184] Rdpack_2.4 colorspace_2.0-3 XML_3.99-0.10
[187] fs_1.5.2 tensor_1.5 reticulate_1.26
[190] IRanges_2.31.2 splines_4.2.0 uwot_0.1.14
[193] yulab.utils_0.0.5 expm_0.999-6 tidytree_0.4.0
[196] spatstat.utils_2.3-1 sp_1.5-0 Exact_3.1
[199] ggplotify_0.1.0 plotly_4.10.0 xtable_1.8-4
[202] jsonlite_1.8.0 nloptr_2.0.3 ggtree_3.5.3
[205] heatmaply_1.3.0 ggfun_0.0.7 R6_2.5.1
[208] EWCE_1.5.7 pillar_1.8.1 htmltools_0.5.3
[211] mime_0.12 glue_1.6.2 fastmap_1.1.0
[214] minqa_1.2.4 BiocParallel_1.31.12 class_7.3-20
[217] interactiveDisplayBase_1.35.0 codetools_0.2-18 mvtnorm_1.1-3
[220] utf8_1.2.2 lattice_0.20-45 spatstat.sparse_2.1-1
[223] tibble_3.1.8 pbkrtest_0.5.1 curl_4.3.2
[226] DescTools_0.99.46 leiden_0.4.2 gtools_3.9.3
[229] survival_3.4-0 limma_3.53.6 rmarkdown_2.16
[232] googleAuthR_2.0.0 munsell_0.5.0 e1071_1.7-11
[235] GenomeInfoDbData_1.2.8 iterators_1.0.14 variancePartition_1.27.3
[238] reshape2_1.4.4 gtable_0.3.1 rbibutils_2.2.9
[241] spatstat.core_2.4-4 Seurat_4.1.1

@Lloyd-LiuSiyi
Copy link

Hi Murphy! I seem to face the same situation that a huge number of SNPs get removed for being non-biallelic. I was wondering if there is now an argument I can specify to keep the non-biallelic SNPs. Thanks in advance!

@Lloyd-LiuSiyi
Copy link

Sorry for my ignorance, non-biallelic SNPs can be kept by specifying the bi_allelic_filter argument.

@cfbeuchel
Copy link
Contributor

Hi!

I'm, sorry for writing in this old issue, but I am looking for a solution for this and saw the conversation in the SNPlocsForge Repo, which was concluded some time ago, but I failed to understand the solution. I have a high number of non bi-allelic SNP dropped when using MungeSumtats, with all up-to-date packages from Bioconductor 3.18? Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP?

Thanks for your help and the great work on this package!

Cheers!

@Al-Murphy
Copy link
Owner

Al-Murphy commented Feb 2, 2024

Hey,

using MungeSumtats, with all up-to-date packages from Bioconductor 3.18

Yes, this is the latest release version of MSS.

Is the solution that this is simply due to higher coverage in more up-to-date dbSNP versions and the way forward would be to drop the few individuals with the 3+rd allele instead of removing the whole SNP?

That's one solution, another one would be to keep non-bi-allelic SNPs (set bi_allelic_filter=FALSE when running format_sumstats()) - note this affects FRQ calculations if they need to be flipped etc. It really depends on what you plan on doing with the sumstats, some downstream tools like LDSC for example, will remove non-bi-allelic SNPs anyway (leaving just HapMap3 SNPs). Personally, I think you should keep non-bi-allelic SNPs unless you have a good reason to remove them.

Cheers,
Alan.

@Al-Murphy
Copy link
Owner

Closing as I believe we have looked into this enough at this stage and have a work around in place (set bi_allelic_filter=FALSE in format_sumstats())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants