-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assess whether MAGMA can take multi-allelic SNPs #143
Comments
Without multi-allelic SNPsMunge#### Run SNP-to-gene mapping ####
path_formatted1 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
save_dir = "~/Downloads/biallelic",
bi_allelic_filter = TRUE)
MapgenesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted1[[1]],
genome_build = "hg19")
With multi-allelic SNPsMunge path_formatted2 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
save_dir = "~/Downloads/multiallelic",
bi_allelic_filter = FALSE) Processing 1 datasets from Open GWAS.
========== Processing dataset : ieu-a-8 ==========
Downloading VCF ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8.vcf.gz
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz'
Content type 'application/gzip' length 85994291 bytes (82.0 MB)
==================================================
downloaded 82.0 MB
Downloading VCF index ==> https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz.tbi
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz.tbi'
Content type 'application/gzip' length 1449998 bytes (1.4 MB)
==================================================
downloaded 1.4 MB
Formatted summary statistics will be saved to ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/ieu-a-8.tsv.gz
Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate column(s).
1 sample detected: ieu-a-8
Constructing ScanVcfParam object.
VCF contains: 2,417,476 variant(s) x 1 sample(s)
Reading VCF file: single-threaded
Converting VCF to data.table.
Expanding VCF first, so number of rows may increase.
Dropping 1 duplicate column(s).
Checking for empty columns.
Unlisting 4 columns.
Time difference of 2.1 mins
VCF data.table contains: 2,417,476 rows x 12 columns.
Time difference of 2.6 mins
Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file:
SNP chr BP end REF ALT FILTER AF ES SE LP SS P
Summary statistics report:
- 2,417,476 rows
- 2,417,476 unique variants
- 160 genome-wide significant variants (P<5e-8)
- 23 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,476 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 130 seconds.
196 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,319 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 118 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicated rows.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
5 SNPs are on chromosomes X, Y, MT and will be removed
N already exists within sumstats_dt.
718,866 SNPs (29.7%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/ieu-a-8.tsv.gz
Summary statistics report:
- 2,417,314 rows (100% of original 2,417,476 rows)
- 2,417,314 unique variants
- 160 genome-wide significant variants (P<5e-8)
- 22 chromosomes
Done munging in 8.078 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
SNP CHR BP A1 A2 END FILTER FRQ BETA SE LP
1: rs12565286 1 721290 G C 721290 PASS 0.0538073 0.1282010 0.0695073 1.1862800
2: rs11804171 1 723819 T A 723819 PASS 0.0540315 0.1297870 0.0698828 1.1987200
3: rs3094315 1 752566 G A 752566 PASS 0.8249000 -0.0016609 0.0291187 0.0202171
4: rs3131968 1 754192 A G 754192 PASS 0.7698700 -0.0289617 0.0314537 0.4471260
N P
1: 27124 0.06512084
2: 26259 0.06328197
3: 36534 0.95451531
4: 22817 0.35716920
Returning path to saved data.
ieu-a-8 : Done in 9.101 minutes.
Done with all processing in 9.1 minutes. MapgenesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted2[[1]],
genome_build = "hg19"
Session info
```
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default locale: attached base packages: other attached packages: loaded via a namespace (and not attached):
|
ComparisonIn this GWAS, 1,260,488 / 2,417,476 (52%) SNPs are non-biallelic. Shockingly, this had 0 impact on the resulting MAGMA files. Like... the genes.out files are identical. I thought for sure i must have made a mistake somewhere but reran it multiple times and made sure i was importing the right files...and the result held. GWAS sumstatsIndeed, the multialellic version have 2x as many SNPs. ## biallelic
dt1 <- data.table::fread( path_formatted1[[1]])
nrow(dt1)
# 1156826
dt2 <- data.table::fread( path_formatted2[[1]])
nrow(dt2)
# 2417314 MAGMA files...and yet, g1 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath1)
g2 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath2)
all.equal(g1,g2)
# TRUE I think this may have to do with a standard reference being used in both (1000 Genomes, EUR population) during the mapping of SNPS to genes by MAGMA. Variants that don't exist in this reference are not considered during the mapping. While you could argue this throws away a lot of info, it also means that we can input multi-allelic/bi-allelic versions of the same dataset with little to no impact on the resulting MAGMA gene files. In conclusion, this means we can go ahead and munge GWAS using dbSNP155 + |
Probably worth checking MAGMA doesn't do any filtering for non-bi-allelic SNPs itself specifically. I would assume some of the bi-allelic SNPs (according to dbSNP 144) in the 1000 genomes EUR pop would now have found alternative alleles in dbSNP 155 so therefore there would be differences |
Very good point @Al-Murphy ! From the MAGMA docs (v1.10):
Since I think it counts "duplicates" as SNPs that occur in more than one row, this makes sense why the genes.out results would be identical with/without multi-allelic SNPs. Let me try switching the flag to "first". While arbitrary, this would at least prevent the multi-allelic SNPs from being dropped entirely. |
Just reran with the new argument #### Bi-allelic vs. Multi-allelic GWAS sumstats ####
### Run SNP-to-gene mapping ####
path_formatted1 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
save_dir = "~/Downloads/biallelic",
bi_allelic_filter = TRUE)
## Map
dt1 <- data.table::fread( path_formatted1[[1]])
genesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted1[[1]],
force_new = TRUE,
duplicate = "first",
genome_build = "hg19")
#### With multi-allelic SNPs
## Munge
path_formatted2 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
save_dir = "~/Downloads/mutliallelic",
bi_allelic_filter = FALSE)
dt2 <- data.table::fread( path_formatted2[[1]])
## Map
genesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted2[[1]],
force_new = TRUE,
duplicate = "first",
genome_build = "hg19")
g1 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath1)
g2 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath2)
all.equal(g1,g2)
# TRUE I'll check to see if there's an additional step somewhere in MAGMA that deals specifically with non-biallelic SNPs. |
Another arg to consider,
I'll add this flag as well. |
Repeated using #### Bi-allelic vs. Multi-allelic GWAS sumstats ####
### Run SNP-to-gene mapping ####
path_formatted1 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
dbSNP = 155,
save_dir = "~/Downloads/biallelic",
bi_allelic_filter = TRUE)
## Map
dt1 <- data.table::fread( path_formatted1[[1]])
genesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted1[[1]],
force_new = TRUE,
duplicate = "first",
synonym_dup = "skip-dup",
genome_build = "hg19")
#### With multi-allelic SNPs
## Munge
path_formatted2 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
dbSNP = 155,
save_dir = "~/Downloads/mutliallelic",
bi_allelic_filter = FALSE)
dt2 <- data.table::fread( path_formatted2[[1]])
## Map
genesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted2[[1]],
force_new = TRUE,
duplicate = "first",
synonym_dup = "skip-dup",
genome_build = "hg19")
g1 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath1)
g2 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath2)
all.equal(g1,g2)
# TRUE |
I also look through the docs for the annotation pre-step, but there's not many options and none of them would relate to dropping multi-allelic SNPs since we're note using the |
dbSNP 144Rerunning with dbSNP v144. Without multi-allelic SNPs### dbSNP version
dbSNP <- 144 # 155 Munge path_formatted1 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
dbSNP = dbSNP,
save_dir = "~/Downloads/biallelic",
bi_allelic_filter = TRUE)
Map genesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted1[[1]],
force_new = TRUE,
duplicate = "first",
synonym_dup = "skip-dup",
genome_build = "hg19")
With multi-allelic SNPsMungepath_formatted2 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
dbSNP = dbSNP,
save_dir = "~/Downloads/mutliallelic",
bi_allelic_filter = FALSE)
MapgenesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted2[[1]],
force_new = TRUE,
duplicate = "first",
synonym_dup = "skip-dup",
genome_build = "hg19")
Interestingly, this returns an error when using dbSNP 144, but didn't cause an error using dbSNP 155. Is this difference expected @Al-Murphy ?
Map (attempt 2)Rerunning with path_formatted2 <- MungeSumstats::import_sumstats(
ids = "ieu-a-8",
ref_genome = "GRCh37",
dbSNP = dbSNP,
save_dir = "~/Downloads/mutliallelic",
allele_flip_frq = FALSE,
bi_allelic_filter = FALSE)
Map genesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
path_formatted = path_formatted2[[1]],
force_new = TRUE,
duplicate = "first",
synonym_dup = "skip-dup",
genome_build = "hg19")
Compare GWAS filesSumstats without multi-allelic SNPs has less rows as expected. Also, the number of rows removed using dbSNP144 is far less than when using dbSNP155. This is also expected bc the latter annotates far more SNPs as multi-allelic. dt1 <- data.table::fread( path_formatted1[[1]])
dt2 <- data.table::fread( path_formatted2[[1]])
testthat::expect_lt(nrow(dt1), nrow(dt2))
# TRUE
# (2345520 rows vs. 2408215 rows) Check for duplicated RSIDSBoth versions of the GWAS sumstats (with/without multiallelic SNPs) do not have any RSIDs that appear in >1 row! This confirms the theory @Al-Murphy and I discussed yesterday that there is a difference between two different ways of thinking about multi-allelic SNPs:
testthat::expect_equal(sum(duplicated(dt1$SNP)),0)
# TRUE
testthat::expect_equal(sum(duplicated(dt2$SNP)),0)
# TRUE Check for 1KG SNPsMAGMA uses 1KG Phase 3 as a reference and only takes RSIDs found within that reference (or synonyms of those RSIDs) for further analysis. By reading in the MAGMA-distributed 1KG Phase 3 ref files, the number of SNPs found in each GWAS version (with/without multi-allelic SNPs) differs by 45k+.
Compare MAGMA filesAs before, the MAGMA files are identical ,with or without multi-allelic SNPs included. g1 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath1)
g2 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath2)
all.equal(g1,g2) |
Which bit throws an error? Can't really follow the above (sorry)! |
This bit:
|
Ah okay so that's because certain SNP ref and alt didn't match reference database:
The error is that we can't flip FRQ unless only bi-allelic SNPs are considered because otherwise the 1 minus current frq wouldn't work. So I would say there aren't any when you use the other dbSNP version, you can choose to not flip the FRQ column ( |
ConclusionAfter checking the 1KG Phase 3 paper, I found that they used dbSNP141.
Given that MAGMA relies on 1KG Phase 3 as a reference, I think it makes sense to align the GWAS sumstats to the dbSNP version that is closest to this. In the case of the Another bonus that @Al-Murphy reminded me of it that using dbSNP144 makes To summarise, here is a procedure I would currently recommended when preparing GWAS for input to Recommendations
|
This will help us to determine whether we can use dbSNP155 withOUT filtering out multi-biallelic SNPs, which can cause a large percentage of the data to be dropped (47% on average):
https://github.com/neurogenomics/MungeSumstats/issues/111
If MAGMA does NOT accept multi-allelic SNPs, we will have to go back to using dbSNP144, which is very outdated but as a consequence allows us to keep more SNPs bc less of them were known to be multi-allelic at the time.
Conclusions summary
This has now been assessed, see here for the full explanation.
To summarise,
MAGMA
/MAGMA.Celltyping
can indeed take GWAS with multi-allelic SNPs and/or duplicated RSIDs across rows in the GWAS sumstats. Though by default these SNPs are dropped internally (see themap_snps_to_genes(duplicate=)
arg documentation for alternative behaviours).However, we currently recommend removing the non-biallelic SNPs but ONLY if you're using
MungeSumstats
with the argdbSNP=144
as usingdbSNP=155
will drop a huge proportion of your SNPs.The text was updated successfully, but these errors were encountered: