Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High proportion of non-biallelic SNPs in dbSNP 155 vs 144 #1

Closed
Al-Murphy opened this issue Aug 8, 2022 · 2 comments
Closed

High proportion of non-biallelic SNPs in dbSNP 155 vs 144 #1

Al-Murphy opened this issue Aug 8, 2022 · 2 comments

Comments

@Al-Murphy
Copy link

Al-Murphy commented Aug 8, 2022

We have noted a substantial increase the proportion of non-biallelic SNPs across dbSNP 144 and 155 (increased from ~1% to 24%). This is explained in detail here and summarised in the code below (fuicntion written by @hpages).

Note that this is based on the dbSNP versions in Bioconductor (SNPlocs.Hsapiens.dbSNP) so we would like to confirm this aligns with dbSNP versions outside of Biocnductor but there doesn't seem to be any numbers available online about it. The only piece of work about this is quite old now and doesn't give specific numbers SNPs that come in threes. It would be great if anyone has any insight on this.

#wget "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/snp144.txt.gz"
#gunzip snp144.txt.gz
snps <- limma::read.columns("~/Downloads/snp144.txt", 
                            required.col=c("rs775809821"))$rs775809821
#missing header so add in SNP
snps <- c("rs775809821",snps)

library(BSgenome)
extractAlleles <- function(rsids, dbSNP, ref_genome="GRCh38")
{
  pkgname <- paste0("SNPlocs.Hsapiens.dbSNP", dbSNP, ".", ref_genome)
  pkgenv <- loadNamespace(pkgname)
  snplocs <- get(pkgname, envir=pkgenv, inherits=FALSE)
  snps <- snpsById(snplocs, rsids, ifnotfound="drop")
  unname(setNames(IUPAC_CODE_MAP[mcols(snps)$alleles_as_ambig],
                  mcols(snps)$RefSNP_id)[rsids])
}

dbSNP144_alleles <- extractAlleles(snps, 144)
dbSNP155_alleles <- extractAlleles(snps, 155)
alleles <- data.frame(rsid=snps, dbSNP144_alleles, dbSNP155_alleles)
alleles <- data.table::setDT(alleles)
#remove where alleles missing for either
alleles <- alleles[!(is.na(dbSNP144_alleles) | is.na(dbSNP155_alleles))]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)==2,
        match:="both biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)>2,
        match:="both non-biallelic"]
alleles[nchar(dbSNP144_alleles)>2 & nchar(dbSNP155_alleles)==2,
        match:="dbSNP 144 non-biallelic"]
alleles[nchar(dbSNP144_alleles)==2 & nchar(dbSNP155_alleles)>2,
        match:="dbSNP 155 non-biallelic"]
summ <- alleles[,.N,by=match]
summ[,prop:=N/sum(summ$N)]
#> summ
#match         N         prop
#1:          both biallelic 100605258 0.7427438640
#2: dbSNP 155 non-biallelic  32266661 0.2382168183
#3:      both non-biallelic   2565775 0.0189424855
#4: dbSNP 144 non-biallelic     13116 0.0000968322

And just to visually represent them:

library(ggplot2)
library(cowplot)
pastel_cols <- c("#9A8822","#F5CDB4","#F8AFA8",
                 "#FDDDA0","#74A089","#85D4E3",
                 #added extra to make 7
                 '#78A2CC')
ggplot(summ, aes(x = factor(match,levels=rev(c("both biallelic",
                                           "both non-biallelic",
                                          "dbSNP 155 non-biallelic",
                                          "dbSNP 144 non-biallelic"))), y = N))+
  geom_bar(stat = "identity", fill = pastel_cols[c(2,3,2,2)])+
  geom_text(label = with(summ, paste(prettyNum(N,big.mark=",",scientific=FALSE), 
                                     paste0('\n (', round(prop*100,2), '%)'))), 
            hjust = -.3)+
  ylim(0,max(summ$N)+20000000)+
  coord_flip() + 
  theme_cowplot()+
  theme(axis.title.y=element_blank())
  

image

@Al-Murphy Al-Murphy changed the title Hig proportion of non-biallelic SNPs in dbSNP 155 vs 144 High proportion of non-biallelic SNPs in dbSNP 155 vs 144 Aug 9, 2022
@hpages
Copy link
Owner

hpages commented Oct 20, 2023

Is it ok to close this @Al-Murphy? IIUC this doesn't seem to be an issue with the latest SNPlocs packages.

@hpages hpages closed this as completed Oct 23, 2023
@Al-Murphy
Copy link
Author

that's all good with me Herve, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants