Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess whether MAGMA can take multi-allelic SNPs #143

Closed
bschilder opened this issue May 15, 2023 · 13 comments
Closed

Assess whether MAGMA can take multi-allelic SNPs #143

bschilder opened this issue May 15, 2023 · 13 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@bschilder
Copy link
Collaborator

bschilder commented May 15, 2023

This will help us to determine whether we can use dbSNP155 withOUT filtering out multi-biallelic SNPs, which can cause a large percentage of the data to be dropped (47% on average):
https://github.com/neurogenomics/MungeSumstats/issues/111

If MAGMA does NOT accept multi-allelic SNPs, we will have to go back to using dbSNP144, which is very outdated but as a consequence allows us to keep more SNPs bc less of them were known to be multi-allelic at the time.

Conclusions summary

This has now been assessed, see here for the full explanation.

To summarise, MAGMA/MAGMA.Celltyping can indeed take GWAS with multi-allelic SNPs and/or duplicated RSIDs across rows in the GWAS sumstats. Though by default these SNPs are dropped internally (see the map_snps_to_genes(duplicate=) arg documentation for alternative behaviours).

However, we currently recommend removing the non-biallelic SNPs but ONLY if you're using MungeSumstats with the arg dbSNP=144 as using dbSNP=155 will drop a huge proportion of your SNPs.

@bschilder bschilder self-assigned this May 15, 2023
@bschilder bschilder added enhancement New feature or request documentation Improvements or additions to documentation labels May 15, 2023
@bschilder
Copy link
Collaborator Author

Without multi-allelic SNPs

Munge

#### Run SNP-to-gene mapping #### 
path_formatted1 <- MungeSumstats::import_sumstats(
  ids = "ieu-a-8", 
  ref_genome = "GRCh37",
  save_dir = "~/Downloads/biallelic",
  bi_allelic_filter = TRUE)  
Processing 1 datasets from Open GWAS.

========== Processing dataset : ieu-a-8 ==========

Using previously downloaded VCF.
Formatted summary statistics will be saved to ==>  /Users/schilder/Downloads//ieu-a-8/ieu-a-8.tsv.gz
Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate column(s).
1 sample detected: ieu-a-8
Constructing ScanVcfParam object.
VCF contains: 2,417,476 variant(s) x 1 sample(s)
Reading VCF file: single-threaded
Converting VCF to data.table.
Expanding VCF first, so number of rows may increase.
Dropping 1 duplicate column(s).
Checking for empty columns.
Unlisting 4 columns.
Time difference of 2.3 mins
VCF data.table contains: 2,417,476 rows x 12 columns.
Time difference of 2.8 mins
Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file: 
SNP	chr	BP	end	REF	ALT	FILTER	AF	ES	SE	LP	SS	P	
Summary statistics report:
   - 2,417,476 rows
   - 2,417,476 unique variants
   - 160 genome-wide significant variants (P<5e-8)
   - 23 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,476 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 140 seconds.
196 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,319 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 128 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
5 SNPs are on chromosomes X, Y, MT and will be removed
Checking for bi-allelic SNPs.
1,260,488 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
191,353 SNPs (16.5%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /Users/schilder/Downloads//ieu-a-8/ieu-a-8.tsv.gz
Summary statistics report:                                                                                                           
   - 1,156,826 rows (47.9% of original 2,417,476 rows)
   - 1,156,826 unique variants
   - 65 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Done munging in 8.597 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
          SNP CHR     BP A1 A2    END FILTER       FRQ       BETA        SE        LP
1: rs12565286   1 721290  G  C 721290   PASS 0.0538073  0.1282010 0.0695073 1.1862800
2: rs11804171   1 723819  T  A 723819   PASS 0.0540315  0.1297870 0.0698828 1.1987200
3:  rs4040617   1 779322  A  G 779322   PASS 0.1365300 -0.0017451 0.0268361 0.0231225
4: rs11240777   1 798959  G  A 798959   PASS 0.2061260  0.0089750 0.0302654 0.1153090
       N          P
1: 27124 0.06512084
2: 26259 0.06328197
3: 47791 0.94815098
4: 30919 0.76681571
Returning path to saved data.

ieu-a-8 : Done in 8.597 minutes.

Done with all processing in 8.6 minutes.

Map

genesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
  path_formatted = path_formatted1[[1]],
  genome_build = "hg19")  
Installed MAGMA version: v1.10
Skipping MAGMA installation.
The desired_version of MAGMA is currently installed: v1.10
Using: magma_v1.10_mac
Using existing genome_ref found in storage_dir.
Saving decompressed copy of path_formatted ==>  /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8.tsv

==== MAGMA Step 1: Generate genes.annot file ====

Welcome to MAGMA v1.10 (custom)
Using flags:
	--annotate window=35,10
	--snp-loc /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8.tsv
	--gene-loc /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc
	--out /Users/schilder/Downloads//ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN

Start time is 22:26:28, Monday 15 May 2023

Starting annotation...
Reading gene locations from file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc... 
	adding window: 35000bp (before), 10000bp (after)
	19161 gene locations read from file
	chromosome  1: 2016 genes
	chromosome  2: 1226 genes
	chromosome  3: 1050 genes
	chromosome  4: 745 genes
	chromosome  5: 856 genes
	chromosome  6: 750 genes
	chromosome  7: 906 genes
	chromosome  8: 669 genes
	chromosome  9: 775 genes
	chromosome 10: 723 genes
	chromosome 11: 1275 genes
	chromosome 12: 1009 genes
	chromosome 13: 320 genes
	chromosome 14: 595 genes
	chromosome 15: 586 genes
	chromosome 16: 817 genes
	chromosome 17: 1147 genes
	chromosome 18: 271 genes
	chromosome 19: 1389 genes
	chromosome 20: 527 genes
	chromosome 21: 215 genes
	chromosome 22: 442 genes
	chromosome  X: 805 genes
	chromosome  Y: 47 genes
Reading SNP locations from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8.tsv... 
	WARNING: on line 1, chromosome code 'CHR' not recognised; skipping SNP (ID = SNP)
	2417314 SNP locations read from file                                                            
	of those, 1287640 (53.27%) mapped to at least one gene
Writing annotation to file /Users/schilder/Downloads//ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot
	for chromosome  1, 95 genes are empty (out of 2016)
	for chromosome  2, 16 genes are empty (out of 1226)
	for chromosome  3, 3 genes are empty (out of 1050)
	for chromosome  4, 23 genes are empty (out of 745)
	for chromosome  5, 12 genes are empty (out of 856)
	for chromosome  6, 1 gene is empty (out of 750)
	for chromosome  7, 20 genes are empty (out of 906)
	for chromosome  8, 39 genes are empty (out of 669)
	for chromosome  9, 43 genes are empty (out of 775)
	for chromosome 10, 16 genes are empty (out of 723)
	for chromosome 11, 4 genes are empty (out of 1275)
	for chromosome 12, 2 genes are empty (out of 1009)
	for chromosome 14, 11 genes are empty (out of 595)
	for chromosome 15, 28 genes are empty (out of 586)
	for chromosome 16, 27 genes are empty (out of 817)
	for chromosome 17, 32 genes are empty (out of 1147)
	for chromosome 19, 1 gene is empty (out of 1389)
	for chromosome 21, 3 genes are empty (out of 215)
	for chromosome 22, 12 genes are empty (out of 442)
	for chromosome  X, 805 genes are empty (out of 805)
	for chromosome  Y, 47 genes are empty (out of 47)
	at least one SNP mapped to each of a total of 17921 genes (out of 19161)


End time is 22:26:33, Monday 15 May 2023 (elapsed: 00:00:05)

==== MAGMA Step 2: Generate genes.out ====

Welcome to MAGMA v1.10 (custom)
Using flags:
	--bfile /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur
	--pval /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8.tsv ncol=N
	--gene-annot /Users/schilder/Downloads//ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot
	--out /Users/schilder/Downloads//ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN

Start time is 22:26:33, Monday 15 May 2023

Loading PLINK-format data...
Reading file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.fam... 503 individuals read
Reading file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.bim... 22665064 SNPs read
Preparing file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.bed... 

Reading SNP synonyms from file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.synonyms (auto-detected)
	read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
	WARNING: detected 133 synonymous SNP pairs in the data
	         skipped all synonym entries involved, synonymous SNPs are kept in analysis
	         writing list of detected synonyms in data to supplementary log file
Reading SNP p-values from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8.tsv... 
	detected 13 variables in file
	using variable: SNP (SNP id)
	using variable: P (p-value)
	using variable: N (sample size; discarding SNPs with N < 50)
	read 2417315 lines from file, containing valid SNP p-values for 2387239 SNPs in data (98.76% of lines, 10.53% of SNPs in data)
Loading gene annotation from file /Users/schilder/Downloads//ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot... 
	17921 gene definitions read from file
	found 17909 genes containing valid SNPs in genotype data


Starting gene analysis... 
	using model: SNPwise-mean
	writing gene analysis results to file /Users/schilder/Downloads//ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.out
	writing intermediate output to file /Users/schilder/Downloads//ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.raw


End time is 22:31:31, Monday 15 May 2023 (elapsed: 00:04:58)
Warning message:
! "/Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc"
  already exists, skipping download.
Set `overwrite = TRUE` to overwrite files. 

With multi-allelic SNPs

Munge

 path_formatted2 <- MungeSumstats::import_sumstats(
          ids = "ieu-a-8", 
          ref_genome = "GRCh37",
          save_dir = "~/Downloads/multiallelic",
          bi_allelic_filter = FALSE) 
Processing 1 datasets from Open GWAS.

========== Processing dataset : ieu-a-8 ==========

Downloading VCF ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8.vcf.gz
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz'
Content type 'application/gzip' length 85994291 bytes (82.0 MB)
==================================================
downloaded 82.0 MB

Downloading VCF index ==> https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz.tbi
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz.tbi'
Content type 'application/gzip' length 1449998 bytes (1.4 MB)
==================================================
downloaded 1.4 MB

Formatted summary statistics will be saved to ==>  /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/ieu-a-8.tsv.gz
Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate column(s).
1 sample detected: ieu-a-8
Constructing ScanVcfParam object.
VCF contains: 2,417,476 variant(s) x 1 sample(s)
Reading VCF file: single-threaded
Converting VCF to data.table.
Expanding VCF first, so number of rows may increase.
Dropping 1 duplicate column(s).
Checking for empty columns.
Unlisting 4 columns.
Time difference of 2.1 mins
VCF data.table contains: 2,417,476 rows x 12 columns.
Time difference of 2.6 mins
Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file: 
SNP	chr	BP	end	REF	ALT	FILTER	AF	ES	SE	LP	SS	P	
Summary statistics report:
   - 2,417,476 rows
   - 2,417,476 unique variants
   - 160 genome-wide significant variants (P<5e-8)
   - 23 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,476 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 130 seconds.
196 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,319 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 118 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicated rows.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
5 SNPs are on chromosomes X, Y, MT and will be removed
N already exists within sumstats_dt.
718,866 SNPs (29.7%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/ieu-a-8.tsv.gz
Summary statistics report:                                                                                                           
   - 2,417,314 rows (100% of original 2,417,476 rows)
   - 2,417,314 unique variants
   - 160 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Done munging in 8.078 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
          SNP CHR     BP A1 A2    END FILTER       FRQ       BETA        SE        LP
1: rs12565286   1 721290  G  C 721290   PASS 0.0538073  0.1282010 0.0695073 1.1862800
2: rs11804171   1 723819  T  A 723819   PASS 0.0540315  0.1297870 0.0698828 1.1987200
3:  rs3094315   1 752566  G  A 752566   PASS 0.8249000 -0.0016609 0.0291187 0.0202171
4:  rs3131968   1 754192  A  G 754192   PASS 0.7698700 -0.0289617 0.0314537 0.4471260
       N          P
1: 27124 0.06512084
2: 26259 0.06328197
3: 36534 0.95451531
4: 22817 0.35716920
Returning path to saved data.

ieu-a-8 : Done in 9.101 minutes.

Done with all processing in 9.1 minutes.

Map

genesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
          path_formatted = path_formatted2[[1]],
          genome_build = "hg19"
Installed MAGMA version: v1.10
Skipping MAGMA installation.
The desired_version of MAGMA is currently installed: v1.10
Using: magma_v1.10_mac
Using existing genome_ref found in storage_dir.
Saving decompressed copy of path_formatted ==>  /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8.tsv
Loading required namespace: piggyback

==== MAGMA Step 1: Generate genes.annot file ====

Welcome to MAGMA v1.10 (custom)
Using flags:
	--annotate window=35,10
	--snp-loc /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T/RtmpRhRBhm/ieu-a-8.tsv
	--gene-loc /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc
	--out /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN

Start time is 22:08:55, Monday 15 May 2023

Starting annotation...
Reading gene locations from file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc... 
	adding window: 35000bp (before), 10000bp (after)
	19161 gene locations read from file
	chromosome  1: 2016 genes
	chromosome  2: 1226 genes
	chromosome  3: 1050 genes
	chromosome  4: 745 genes
	chromosome  5: 856 genes
	chromosome  6: 750 genes
	chromosome  7: 906 genes
	chromosome  8: 669 genes
	chromosome  9: 775 genes
	chromosome 10: 723 genes
	chromosome 11: 1275 genes
	chromosome 12: 1009 genes
	chromosome 13: 320 genes
	chromosome 14: 595 genes
	chromosome 15: 586 genes
	chromosome 16: 817 genes
	chromosome 17: 1147 genes
	chromosome 18: 271 genes
	chromosome 19: 1389 genes
	chromosome 20: 527 genes
	chromosome 21: 215 genes
	chromosome 22: 442 genes
	chromosome  X: 805 genes
	chromosome  Y: 47 genes
Reading SNP locations from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T/RtmpRhRBhm/ieu-a-8.tsv... 
	WARNING: on line 1, chromosome code 'CHR' not recognised; skipping SNP (ID = SNP)
	2417314 SNP locations read from file                                                            
	of those, 1287640 (53.27%) mapped to at least one gene
Writing annotation to file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot
	for chromosome  1, 95 genes are empty (out of 2016)
	for chromosome  2, 16 genes are empty (out of 1226)
	for chromosome  3, 3 genes are empty (out of 1050)
	for chromosome  4, 23 genes are empty (out of 745)
	for chromosome  5, 12 genes are empty (out of 856)
	for chromosome  6, 1 gene is empty (out of 750)
	for chromosome  7, 20 genes are empty (out of 906)
	for chromosome  8, 39 genes are empty (out of 669)
	for chromosome  9, 43 genes are empty (out of 775)
	for chromosome 10, 16 genes are empty (out of 723)
	for chromosome 11, 4 genes are empty (out of 1275)
	for chromosome 12, 2 genes are empty (out of 1009)
	for chromosome 14, 11 genes are empty (out of 595)
	for chromosome 15, 28 genes are empty (out of 586)
	for chromosome 16, 27 genes are empty (out of 817)
	for chromosome 17, 32 genes are empty (out of 1147)
	for chromosome 19, 1 gene is empty (out of 1389)
	for chromosome 21, 3 genes are empty (out of 215)
	for chromosome 22, 12 genes are empty (out of 442)
	for chromosome  X, 805 genes are empty (out of 805)
	for chromosome  Y, 47 genes are empty (out of 47)
	at least one SNP mapped to each of a total of 17921 genes (out of 19161)


End time is 22:09:01, Monday 15 May 2023 (elapsed: 00:00:06)

==== MAGMA Step 2: Generate genes.out ====

Welcome to MAGMA v1.10 (custom)
Using flags:
	--bfile /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur
	--pval /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T/RtmpRhRBhm/ieu-a-8.tsv ncol=N
	--gene-annot /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot
	--out /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN

Start time is 22:09:01, Monday 15 May 2023

Loading PLINK-format data...
Reading file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.fam... 503 individuals read
Reading file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.bim... 22665064 SNPs read
Preparing file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.bed... 

Reading SNP synonyms from file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.synonyms (auto-detected)
	read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
	WARNING: detected 133 synonymous SNP pairs in the data
	         skipped all synonym entries involved, synonymous SNPs are kept in analysis
	         writing list of detected synonyms in data to supplementary log file
Reading SNP p-values from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T/RtmpRhRBhm/ieu-a-8.tsv... 
	detected 13 variables in file
	using variable: SNP (SNP id)
	using variable: P (p-value)
	using variable: N (sample size; discarding SNPs with N < 50)
	read 2417315 lines from file, containing valid SNP p-values for 2387239 SNPs in data (98.76% of lines, 10.53% of SNPs in data)
Loading gene annotation from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot... 
	17921 gene definitions read from file
	found 17909 genes containing valid SNPs in genotype data


Starting gene analysis... 
	using model: SNPwise-mean
	writing gene analysis results to file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.out
	writing intermediate output to file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpRhRBhm/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.raw


End time is 22:13:57, Monday 15 May 2023 (elapsed: 00:04:56)
Warning message:
! "/Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc"
  already exists, skipping download.
Set `overwrite = TRUE` to overwrite files. 

Session info

``` R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Ventura 13.2.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2 BiocGenerics_0.44.0

loaded via a namespace (and not attached):
[1] backports_1.4.1
[2] AnnotationHub_3.6.0
[3] BiocFileCache_2.6.1
[4] plyr_1.8.8
[5] googleAuthR_2.0.1
[6] lazyeval_0.2.2
[7] splines_4.2.1
[8] orthogene_1.5.3
[9] ewceData_1.7.1
[10] BiocParallel_1.32.6
[11] ggplot2_3.4.2
[12] digest_0.6.31
[13] yulab.utils_0.0.6
[14] htmltools_0.5.5
[15] rsconnect_0.8.29
[16] RNOmni_1.0.1
[17] fansi_1.0.4
[18] magrittr_2.0.3
[19] memoise_2.0.1
[20] BSgenome_1.66.3
[21] limma_3.54.2
[22] Biostrings_2.66.0
[23] matrixStats_0.63.0
[24] R.utils_2.12.2
[25] timechange_0.2.0
[26] prettyunits_1.1.1
[27] colorspace_2.1-0
[28] blob_1.2.4
[29] rappdirs_0.3.3
[30] gitcreds_0.1.2
[31] dplyr_1.1.2
[32] crayon_1.5.2
[33] RCurl_1.98-1.12
[34] jsonlite_1.8.4
[35] lme4_1.1-33
[36] VariantAnnotation_1.44.1
[37] MAGMA.Celltyping_2.0.8
[38] ape_5.7-1
[39] glue_1.6.2
[40] gargle_1.4.0
[41] gtable_0.3.3
[42] zlibbioc_1.44.0
[43] XVector_0.38.0
[44] HGNChelper_0.8.1
[45] DelayedArray_0.24.0
[46] car_3.1-2
[47] SingleCellExperiment_1.20.1
[48] abind_1.4-5
[49] scales_1.2.1
[50] DBI_1.1.3
[51] rstatix_0.7.2
[52] Rcpp_1.0.10
[53] viridisLite_0.4.2
[54] xtable_1.8-4
[55] progress_1.2.2
[56] gridGraphics_0.5-1
[57] tidytree_0.4.2
[58] bit_4.0.5
[59] SNPlocs.Hsapiens.dbSNP155.GRCh37_0.99.23
[60] htmlwidgets_1.6.2
[61] httr_1.4.5
[62] ellipsis_0.3.2
[63] pkgconfig_2.0.3
[64] XML_3.99-0.14
[65] R.methodsS3_1.8.2
[66] dbplyr_2.3.2
[67] httr2_0.2.2
[68] utf8_1.2.3
[69] ggplotify_0.1.0
[70] tidyselect_1.2.0
[71] rlang_1.1.1
[72] reshape2_1.4.4
[73] later_1.3.1
[74] AnnotationDbi_1.60.2
[75] munsell_0.5.0
[76] BiocVersion_3.16.0
[77] tools_4.2.1
[78] cachem_1.0.8
[79] cli_3.6.1
[80] generics_0.1.3
[81] RSQLite_2.3.1
[82] ExperimentHub_2.6.0
[83] MungeSumstats_1.6.0
[84] broom_1.0.4
[85] ggdendro_0.1.23
[86] stringr_1.5.0
[87] fastmap_1.1.1
[88] yaml_2.3.7
[89] ggtree_3.6.2
[90] fs_1.6.2
[91] babelgene_22.9
[92] bit64_4.0.5
[93] purrr_1.0.1
[94] gh_1.4.0
[95] KEGGREST_1.38.0
[96] gprofiler2_0.2.1
[97] nlme_3.1-162
[98] mime_0.12
[99] R.oo_1.25.0
[100] grr_0.9.5
[101] aplot_0.1.10
[102] xml2_1.3.4
[103] biomaRt_2.54.1
[104] brio_1.1.3
[105] compiler_4.2.1
[106] rstudioapi_0.14
[107] plotly_4.10.1
[108] filelock_1.0.2
[109] curl_5.0.0
[110] png_0.1-8
[111] interactiveDisplayBase_1.36.0
[112] testthat_3.1.8
[113] ggsignif_0.6.4
[114] treeio_1.23.1
[115] tibble_3.2.1
[116] EWCE_1.9.0
[117] homologene_1.4.68.19.3.27
[118] stringi_1.7.12
[119] GenomicFeatures_1.50.4
[120] GenomicFiles_1.34.0
[121] lattice_0.21-8
[122] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1
[123] Matrix_1.5-4
[124] nloptr_2.0.3
[125] vctrs_0.6.2
[126] pillar_1.9.0
[127] lifecycle_1.0.3
[128] BiocManager_1.30.20
[129] data.table_1.14.8
[130] bitops_1.0-7
[131] httpuv_1.6.9
[132] patchwork_1.1.2
[133] rtracklayer_1.58.0
[134] GenomicRanges_1.50.2
[135] R6_2.5.1
[136] BiocIO_1.8.0
[137] promises_1.2.0.1
[138] codetools_0.2-19
[139] boot_1.3-28.1
[140] MASS_7.3-60
[141] assertthat_0.2.1
[142] SummarizedExperiment_1.28.0
[143] rjson_0.2.21
[144] GenomicAlignments_1.34.1
[145] Rsamtools_2.14.0
[146] GenomeInfoDbData_1.2.9
[147] parallel_4.2.1
[148] hms_1.1.3
[149] grid_4.2.1
[150] ggfun_0.0.9
[151] minqa_1.2.5
[152] tidyr_1.3.0
[153] MatrixGenerics_1.10.0
[154] carData_3.0-5
[155] ggpubr_0.6.0
[156] piggyback_0.1.4
[157] lubridate_1.9.2
[158] Biobase_2.58.0
[159] shiny_1.7.4
[160] restfulr_0.0.15

</details>

@bschilder
Copy link
Collaborator Author

bschilder commented May 16, 2023

Comparison

@NathanSkene @Al-Murphy

In this GWAS, 1,260,488 / 2,417,476 (52%) SNPs are non-biallelic.

Shockingly, this had 0 impact on the resulting MAGMA files. Like... the genes.out files are identical. I thought for sure i must have made a mistake somewhere but reran it multiple times and made sure i was importing the right files...and the result held.

GWAS sumstats

Indeed, the multialellic version have 2x as many SNPs.

## biallelic
dt1 <- data.table::fread( path_formatted1[[1]])
nrow(dt1)
# 1156826

dt2 <- data.table::fread( path_formatted2[[1]])
nrow(dt2)
# 2417314

MAGMA files

...and yet,

g1 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath1)
g2 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath2)
all.equal(g1,g2)
# TRUE

I think this may have to do with a standard reference being used in both (1000 Genomes, EUR population) during the mapping of SNPS to genes by MAGMA. Variants that don't exist in this reference are not considered during the mapping. While you could argue this throws away a lot of info, it also means that we can input multi-allelic/bi-allelic versions of the same dataset with little to no impact on the resulting MAGMA gene files.

In conclusion, this means we can go ahead and munge GWAS using dbSNP155 + bi_allelic_filter = FALSE without worrying about compatibility with MAGMA. Which is good news, because we can use the most updated dbSNP version and preserve multi-allelic variants in the munged GWAS.

@Al-Murphy
Copy link
Collaborator

Probably worth checking MAGMA doesn't do any filtering for non-bi-allelic SNPs itself specifically. I would assume some of the bi-allelic SNPs (according to dbSNP 144) in the 1000 genomes EUR pop would now have found alternative alleles in dbSNP 155 so therefore there would be differences

@bschilder
Copy link
Collaborator Author

bschilder commented May 16, 2023

Probably worth checking MAGMA doesn't do any filtering for non-bi-allelic SNPs itself specifically. I would assume some of the bi-allelic SNPs (according to dbSNP 144) in the 1000 genomes EUR pop would now have found alternative alleles in dbSNP 155 so therefore there would be differences

Very good point @Al-Murphy !

From the MAGMA docs (v1.10):

The duplicate modifier can be used to specify the desired behaviour for dealing with duplicate SNPs in the file, and can be set to one of four values: ‘drop’, ‘first’, ‘last’, and ‘error’. When set to ‘drop’, the corresponding SNP is removed from the analysis entirely. When set to ‘first’ or ‘last’, either the first or the last entry for that SNPs in the file is used. When set to ‘error’, the program terminates if encountering any duplicate SNPs. The default mode is ‘duplicate=drop’. Note that SNPs are only checked for duplication if they are present in the genotype data, and if they have a non-missing pvalue (and sample size, if ncol is set). When synonymous SNP IDs have been loaded, different SNP IDs referring to the same SNP are considered duplicates as well. Unless duplicate is set to ‘error’, a list of duplicate SNPs will be written to the supplementary log file.

Since I think it counts "duplicates" as SNPs that occur in more than one row, this makes sense why the genes.out results would be identical with/without multi-allelic SNPs.

Let me try switching the flag to "first". While arbitrary, this would at least prevent the multi-allelic SNPs from being dropped entirely.

@bschilder bschilder reopened this May 16, 2023
@bschilder
Copy link
Collaborator Author

Just reran with the new argument duplicate=first and the results are still identical:

 #### Bi-allelic vs. Multi-allelic GWAS sumstats ####
        ### Run SNP-to-gene mapping ####
        path_formatted1 <- MungeSumstats::import_sumstats(
          ids = "ieu-a-8",
          ref_genome = "GRCh37",
          save_dir = "~/Downloads/biallelic",
          bi_allelic_filter = TRUE)
        ## Map
        dt1 <- data.table::fread( path_formatted1[[1]])
        genesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
          path_formatted = path_formatted1[[1]],
          force_new = TRUE,
          duplicate = "first",
          genome_build = "hg19")

        #### With multi-allelic SNPs
        ## Munge
        path_formatted2 <- MungeSumstats::import_sumstats(
          ids = "ieu-a-8",
          ref_genome = "GRCh37",
          save_dir = "~/Downloads/mutliallelic",
          bi_allelic_filter = FALSE)
        dt2 <- data.table::fread( path_formatted2[[1]])
        ## Map
        genesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
          path_formatted = path_formatted2[[1]],
          force_new = TRUE,
          duplicate = "first",
          genome_build = "hg19")


        g1 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath1)
        g2 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath2)
        all.equal(g1,g2)
# TRUE

I'll check to see if there's an additional step somewhere in MAGMA that deals specifically with non-biallelic SNPs.

@bschilder
Copy link
Collaborator Author

bschilder commented May 16, 2023

Another arg to consider, synonym-dup:

When loading SNP ID synonyms, MAGMA may detect SNP IDs in the genotype data that are synonyms of each other. The synonym-dup modifier for the --bfile flag can be used to specify the desired behaviour for dealing with such SNPs. This modifier can be set to one of four values: ‘drop’, ‘drop-dup’, ‘skip’, ‘skip-dup’ and ‘error’. When set to ‘drop’, SNPs that have multiple synonyms in the data are removed from the analysis. Conversely, when set to ‘skip’ the SNPs are left in the data and the synonym entry in the synonym file is skipped. When set to ‘drop-dup’, for each synonym entry only the first listed in the synonym file is retained; for subsequent SNP IDs in the same entry that are found in the data are removed, and their IDs are mapped as synonyms to the first SNP. When set to ‘skipdup’ the genotype data for all synonymous SNPs is retained; SNP IDs not found in the data are mapped to the first SNP in the synonym entry that is. Finally, when set to ‘error’, the program will simply terminate when encountering synonymous SNPs in the data. The default mode is ‘synonym-dup=skip’. Unless synonym-dup is set to error, a list of synonymous SNPs in the data will be written to the supplementary log file.

I'll add this flag as well.

@bschilder
Copy link
Collaborator Author

Repeated using synonym_dup = "skip-dup" as well. Results are still identical.

 #### Bi-allelic vs. Multi-allelic GWAS sumstats ####
        ### Run SNP-to-gene mapping ####
        path_formatted1 <- MungeSumstats::import_sumstats(
          ids = "ieu-a-8",
          ref_genome = "GRCh37",
          dbSNP = 155,
          save_dir = "~/Downloads/biallelic",
          bi_allelic_filter = TRUE)
        ## Map
        dt1 <- data.table::fread( path_formatted1[[1]])
        genesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
          path_formatted = path_formatted1[[1]],
          force_new = TRUE,
          duplicate = "first",
          synonym_dup = "skip-dup",
          genome_build = "hg19")

        #### With multi-allelic SNPs
        ## Munge
        path_formatted2 <- MungeSumstats::import_sumstats(
          ids = "ieu-a-8",
          ref_genome = "GRCh37",
          dbSNP = 155,
          save_dir = "~/Downloads/mutliallelic",
          bi_allelic_filter = FALSE)
        dt2 <- data.table::fread( path_formatted2[[1]])
        ## Map
        genesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
          path_formatted = path_formatted2[[1]],
          force_new = TRUE,
          duplicate = "first",
          synonym_dup = "skip-dup",
          genome_build = "hg19")


        g1 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath1)
        g2 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath2)
        all.equal(g1,g2)
# TRUE

@bschilder
Copy link
Collaborator Author

I also look through the docs for the annotation pre-step, but there's not many options and none of them would relate to dropping multi-allelic SNPs since we're note using the filter flag anywhere (e.g. --annotate filter=example.bim).

@bschilder
Copy link
Collaborator Author

bschilder commented May 26, 2023

dbSNP 144

Rerunning with dbSNP v144.

Without multi-allelic SNPs

### dbSNP version
dbSNP <- 144 # 155 

Munge

        path_formatted1 <- MungeSumstats::import_sumstats(
          ids = "ieu-a-8",
          ref_genome = "GRCh37",
          dbSNP = dbSNP,
          save_dir = "~/Downloads/biallelic",
          bi_allelic_filter = TRUE) 
Processing 1 datasets from Open GWAS.

========== Processing dataset : ieu-a-8 ==========

Downloading VCF ==> /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpY4Jqxc/ieu-a-8.vcf.gz
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz'
Content type 'application/gzip' length 85994291 bytes (82.0 MB)
==================================================
downloaded 82.0 MB

Downloading VCF index ==> https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz.tbi
Downloading with download.file.
trying URL 'https://gwas.mrcieu.ac.uk/files/ieu-a-8/ieu-a-8.vcf.gz.tbi'
Content type 'application/gzip' length 1449998 bytes (1.4 MB)
==================================================
downloaded 1.4 MB

Formatted summary statistics will be saved to ==>  /Users/schilder/Downloads/biallelic/ieu-a-8/ieu-a-8.tsv.gz
Loading required namespace: GenomicFiles
Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate column(s).
1 sample detected: ieu-a-8
Constructing ScanVcfParam object.
VCF contains: 2,417,476 variant(s) x 1 sample(s)
Reading VCF file: single-threaded
Converting VCF to data.table.
Expanding VCF first, so number of rows may increase.
Dropping 1 duplicate column(s).
Checking for empty columns.
Unlisting 4 columns.
Time difference of 2.1 mins
VCF data.table contains: 2,417,476 rows x 12 columns.
Time difference of 2.8 mins
Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file: 
SNP	chr	BP	end	REF	ALT	FILTER	AF	ES	SE	LP	SS	P	
Summary statistics report:
   - 2,417,476 rows
   - 2,417,476 unique variants
   - 160 genome-wide significant variants (P<5e-8)
   - 23 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,476 SNPs using BSgenome::snpsById...
Loading required package: BiocGenerics

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl,
    intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste,
    pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames,
    sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max,
    which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: ‘S4Vectors’

The following objects are masked from ‘package:base’:

    expand.grid, I, unname

BSgenome::snpsById done in 60 seconds.
9,366 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,408,241 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 35 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 22 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 20 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicate SNPs from SNP ID.
Checking for SNPs with duplicated base-pair positions.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
4 SNPs are on chromosomes X, Y, MT and will be removed
Checking for bi-allelic SNPs.
62,695 SNPs are non-biallelic. These will be removed.
N already exists within sumstats_dt.
697,365 SNPs (29.7%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /Users/schilder/Downloads/biallelic/ieu-a-8/ieu-a-8.tsv.gz
Summary statistics report:                                                                                                           
   - 2,345,520 rows (97% of original 2,417,476 rows)
   - 2,345,520 unique variants
   - 154 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Done munging in 5.914 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
          SNP CHR     BP A1 A2    END FILTER       FRQ       BETA        SE        LP
1: rs12565286   1 721290  G  C 721290   PASS 0.0538073  0.1282010 0.0695073 1.1862800
2: rs11804171   1 723819  T  A 723819   PASS 0.0540315  0.1297870 0.0698828 1.1987200
3:  rs3094315   1 752566  G  A 752566   PASS 0.8249000 -0.0016609 0.0291187 0.0202171
4:  rs3131968   1 754192  A  G 754192   PASS 0.7698700 -0.0289617 0.0314537 0.4471260
       N          P
1: 27124 0.06512084
2: 26259 0.06328197
3: 36534 0.95451531
4: 22817 0.35716920
Returning path to saved data.

ieu-a-8 : Done in 6.113 minutes.

Done with all processing in 6.11 minutes.
Warning messages:
1: package ‘S4Vectors’ was built under R version 4.2.2 
2: package ‘GenomeInfoDb’ was built under R version 4.2.2

Map

        genesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
          path_formatted = path_formatted1[[1]],
          force_new = TRUE,
          duplicate = "first",
          synonym_dup = "skip-dup",
          genome_build = "hg19")
Installed MAGMA version: v1.10
Skipping MAGMA installation.
The desired_version of MAGMA is currently installed: v1.10
Using: magma_v1.10_mac
Using existing genome_ref found in storage_dir.
Saving decompressed copy of path_formatted ==>  /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpY4Jqxc/ieu-a-8.tsv
Loading required namespace: piggyback

==== MAGMA Step 1: Generate genes.annot file ====

Welcome to MAGMA v1.10 (custom)
Using flags:
	--annotate window=35,10
	--snp-loc /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T/RtmpY4Jqxc/ieu-a-8.tsv
	--gene-loc /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc
	--out /Users/schilder/Downloads/biallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN

Start time is 13:21:17, Friday 26 May 2023

Starting annotation...
Reading gene locations from file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc... 
	adding window: 35000bp (before), 10000bp (after)
	19161 gene locations read from file
	chromosome  1: 2016 genes
	chromosome  2: 1226 genes
	chromosome  3: 1050 genes
	chromosome  4: 745 genes
	chromosome  5: 856 genes
	chromosome  6: 750 genes
	chromosome  7: 906 genes
	chromosome  8: 669 genes
	chromosome  9: 775 genes
	chromosome 10: 723 genes
	chromosome 11: 1275 genes
	chromosome 12: 1009 genes
	chromosome 13: 320 genes
	chromosome 14: 595 genes
	chromosome 15: 586 genes
	chromosome 16: 817 genes
	chromosome 17: 1147 genes
	chromosome 18: 271 genes
	chromosome 19: 1389 genes
	chromosome 20: 527 genes
	chromosome 21: 215 genes
	chromosome 22: 442 genes
	chromosome  X: 805 genes
	chromosome  Y: 47 genes
Reading SNP locations from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T/RtmpY4Jqxc/ieu-a-8.tsv... 
	WARNING: on line 1, chromosome code 'CHR' not recognised; skipping SNP (ID = SNP)
	2345520 SNP locations read from file                                                            
	of those, 1251310 (53.35%) mapped to at least one gene
Writing annotation to file /Users/schilder/Downloads/biallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot
	for chromosome  1, 98 genes are empty (out of 2016)
	for chromosome  2, 16 genes are empty (out of 1226)
	for chromosome  3, 6 genes are empty (out of 1050)
	for chromosome  4, 23 genes are empty (out of 745)
	for chromosome  5, 12 genes are empty (out of 856)
	for chromosome  6, 1 gene is empty (out of 750)
	for chromosome  7, 20 genes are empty (out of 906)
	for chromosome  8, 39 genes are empty (out of 669)
	for chromosome  9, 43 genes are empty (out of 775)
	for chromosome 10, 16 genes are empty (out of 723)
	for chromosome 11, 4 genes are empty (out of 1275)
	for chromosome 12, 15 genes are empty (out of 1009)
	for chromosome 14, 11 genes are empty (out of 595)
	for chromosome 15, 28 genes are empty (out of 586)
	for chromosome 16, 27 genes are empty (out of 817)
	for chromosome 17, 32 genes are empty (out of 1147)
	for chromosome 19, 9 genes are empty (out of 1389)
	for chromosome 21, 3 genes are empty (out of 215)
	for chromosome 22, 12 genes are empty (out of 442)
	for chromosome  X, 805 genes are empty (out of 805)
	for chromosome  Y, 47 genes are empty (out of 47)
	at least one SNP mapped to each of a total of 17894 genes (out of 19161)


End time is 13:21:22, Friday 26 May 2023 (elapsed: 00:00:05)

==== MAGMA Step 2: Generate genes.out ====

Welcome to MAGMA v1.10 (custom)
Using flags:
	--bfile /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur synonym-dup=skip-dup
	--pval /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T/RtmpY4Jqxc/ieu-a-8.tsv ncol=N duplicate=first
	--gene-annot /Users/schilder/Downloads/biallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot
	--out /Users/schilder/Downloads/biallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN

Start time is 13:21:22, Friday 26 May 2023

Loading PLINK-format data...
Reading file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.fam... 503 individuals read
Reading file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.bim... 22665064 SNPs read
Preparing file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.bed... 

Reading SNP synonyms from file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.synonyms (auto-detected)
	read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
	WARNING: detected 133 synonymous SNP pairs in the data
	         mapped all IDs for each synonym entry to first ID occurring in data, except for other IDs present in data
	         writing list of detected synonyms in data to supplementary log file
Reading SNP p-values from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T/RtmpY4Jqxc/ieu-a-8.tsv... 
	detected 13 variables in file
	using variable: SNP (SNP id)
	using variable: P (p-value)
	using variable: N (sample size; discarding SNPs with N < 50)
	read 2345521 lines from file, containing valid SNP p-values for 2333043 SNPs in data (99.47% of lines, 10.29% of SNPs in data)
Loading gene annotation from file /Users/schilder/Downloads/biallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot... 
	17894 gene definitions read from file
	found 17882 genes containing valid SNPs in genotype data


Starting gene analysis... 
	using model: SNPwise-mean
	writing gene analysis results to file /Users/schilder/Downloads/biallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.out
	writing intermediate output to file /Users/schilder/Downloads/biallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.raw


End time is 13:26:03, Friday 26 May 2023 (elapsed: 00:04:41)
Warning message:
! "/Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc"
  already exists, skipping download.
Set `overwrite = TRUE` to overwrite files. 

With multi-allelic SNPs

Munge

path_formatted2 <- MungeSumstats::import_sumstats(
      ids = "ieu-a-8",
      ref_genome = "GRCh37",
      dbSNP = dbSNP,
      save_dir = "~/Downloads/mutliallelic",
      bi_allelic_filter = FALSE) 
Processing 1 datasets from Open GWAS.

========== Processing dataset : ieu-a-8 ==========

Using previously downloaded VCF.
Formatted summary statistics will be saved to ==>  /Users/schilder/Downloads/mutliallelic/ieu-a-8/ieu-a-8.tsv.gz
Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate column(s).
1 sample detected: ieu-a-8
Constructing ScanVcfParam object.
VCF contains: 2,417,476 variant(s) x 1 sample(s)
Reading VCF file: single-threaded
Converting VCF to data.table.
Expanding VCF first, so number of rows may increase.
Dropping 1 duplicate column(s).
Checking for empty columns.
Unlisting 4 columns.
Time difference of 2.1 mins
VCF data.table contains: 2,417,476 rows x 12 columns.
Time difference of 2.7 mins
Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file: 
SNP	chr	BP	end	REF	ALT	FILTER	AF	ES	SE	LP	SS	P	
Summary statistics report:
   - 2,417,476 rows
   - 2,417,476 unique variants
   - 160 genome-wide significant variants (P<5e-8)
   - 23 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.

Map

genesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
  path_formatted = path_formatted2[[1]],
  force_new = TRUE,
  duplicate = "first",
  synonym_dup = "skip-dup",
  genome_build = "hg19")
Processing 1 datasets from Open GWAS.

========== Processing dataset : ieu-a-8 ==========

Using previously downloaded VCF.
Formatted summary statistics will be saved to ==>  /Users/schilder/Downloads/mutliallelic/ieu-a-8/ieu-a-8.tsv.gz
Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate column(s).
1 sample detected: ieu-a-8
Constructing ScanVcfParam object.
VCF contains: 2,417,476 variant(s) x 1 sample(s)
Reading VCF file: single-threaded
Converting VCF to data.table.
Expanding VCF first, so number of rows may increase.
Dropping 1 duplicate column(s).
Checking for empty columns.
Unlisting 4 columns.
Time difference of 2.1 mins
VCF data.table contains: 2,417,476 rows x 12 columns.
Time difference of 2.7 mins
Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file: 
SNP	chr	BP	end	REF	ALT	FILTER	AF	ES	SE	LP	SS	P	
Summary statistics report:
   - 2,417,476 rows
   - 2,417,476 unique variants
   - 160 genome-wide significant variants (P<5e-8)
   - 23 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,476 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 47 seconds.
9,366 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,408,241 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 57 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 22 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 20 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Certain SNPs need to be flipped along with their effect columns and frequency column. However to flip the
FRQ column, only bi-allelic SNPs can be considered. It is recommended to set bi_allelic_filter to TRUE so
non-bi-allelic SNPs are removed. Otherwise, set allele_flip_frq to FALSE to not flip the FRQ column but note
this could lead to incorrect FRQ values.
Done with all processing in 5.31 minutes.

Interestingly, this returns an error when using dbSNP 144, but didn't cause an error using dbSNP 155. Is this difference expected @Al-Murphy ?

$`ieu-a-8`
<simpleError in check_allele_flip(sumstats_dt = sumstats_return$sumstats_dt,     path = path, ref_genome = ref_genome, rsids = rsids, allele_flip_check = allele_flip_check,     allele_flip_drop = allele_flip_drop, allele_flip_z = allele_flip_z,     allele_flip_frq = allele_flip_frq, bi_allelic_filter = bi_allelic_filter,     imputation_ind = imputation_ind, log_folder_ind = log_folder_ind,     check_save_out = check_save_out, tabix_index = tabix_index,     nThread = nThread, log_files = log_files, mapping_file = mapping_file,     dbSNP = dbSNP): Certain SNPs need to be flipped along with their effect columns and frequency column. However to flip the
FRQ column, only bi-allelic SNPs can be considered. It is recommended to set bi_allelic_filter to TRUE so
non-bi-allelic SNPs are removed. Otherwise, set allele_flip_frq to FALSE to not flip the FRQ column but note
this could lead to incorrect FRQ values.>

Map (attempt 2)

Rerunning with allele_flip_frq = FALSE works fine.

path_formatted2 <- MungeSumstats::import_sumstats(
          ids = "ieu-a-8",
          ref_genome = "GRCh37",
          dbSNP = dbSNP,
          save_dir = "~/Downloads/mutliallelic",
          allele_flip_frq = FALSE,
          bi_allelic_filter = FALSE)
Processing 1 datasets from Open GWAS.

========== Processing dataset : ieu-a-8 ==========

Using previously downloaded VCF.
Formatted summary statistics will be saved to ==>  /Users/schilder/Downloads/mutliallelic/ieu-a-8/ieu-a-8.tsv.gz
Using local VCF.
File already tabix-indexed.
Finding empty VCF columns based on first 10,000 rows.
Dropping 1 duplicate column(s).
1 sample detected: ieu-a-8
Constructing ScanVcfParam object.
VCF contains: 2,417,476 variant(s) x 1 sample(s)
Reading VCF file: single-threaded
Converting VCF to data.table.
Expanding VCF first, so number of rows may increase.
Dropping 1 duplicate column(s).
Checking for empty columns.
Unlisting 4 columns.
Time difference of 2.1 mins
VCF data.table contains: 2,417,476 rows x 12 columns.
Time difference of 2.6 mins
Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file: 
SNP	chr	BP	end	REF	ALT	FILTER	AF	ES	SE	LP	SS	P	
Summary statistics report:
   - 2,417,476 rows
   - 2,417,476 unique variants
   - 160 genome-wide significant variants (P<5e-8)
   - 23 chromosomes
Checking for multi-GWAS.
Checking for multiple RSIDs on one row.
Checking SNP RSIDs.
Checking for merged allele column.
Checking A1 is uppercase
Checking A2 is uppercase
Ensuring all SNPs are on the reference genome.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,417,476 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 37 seconds.
9,366 SNPs are not on the reference genome. These will be corrected from the reference genome.
Loading SNPlocs data.
Loading SNPlocs data.
Loading reference genome data.
Preprocessing RSIDs.
Validating RSIDs of 2,408,241 SNPs using BSgenome::snpsById...
BSgenome::snpsById done in 59 seconds.
Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 22 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 20 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.
Reordering so first three column headers are SNP, CHR and BP in this order.
Reordering so the fourth and fifth columns are A1 and A2.
Checking for missing data.
Checking for duplicate columns.
Ensuring that the N column is all integers.
The sumstats N column is not all integers, this could effect downstream analysis. These will be converted to integers.
Checking for duplicated rows.
INFO column not available. Skipping INFO score filtering step.
Filtering SNPs, ensuring SE>0.
Ensuring all SNPs have N<5 std dev above mean.
Removing 'chr' prefix from CHR.
Making X/Y/MT CHR uppercase.
4 SNPs are on chromosomes X, Y, MT and will be removed
N already exists within sumstats_dt.
717,034 SNPs (29.8%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
The FRQ column was mapped from one of the following from the inputted  summary statistics file:
FRQ, EAF, FREQUENCY, FRQ_U, F_U, MAF, FREQ, FREQ_TESTED_ALLELE, FRQ_TESTED_ALLELE, FREQ_EFFECT_ALLELE, FRQ_EFFECT_ALLELE, EFFECT_ALLELE_FREQUENCY, EFFECT_ALLELE_FREQ, EFFECT_ALLELE_FRQ, A1FREQ, A1FRQ, A2FREQ, A2FRQ, ALLELE_FREQUENCY, ALLELE_FREQ, ALLELE_FRQ, AF, MINOR_AF, EFFECT_AF, A2_AF, EFF_AF, ALT_AF, ALTERNATIVE_AF, INC_AF, A_2_AF, TESTED_AF, AF1, ALLELEFREQ, ALT_FREQ, EAF_HRC, EFFECTALLELEFREQ, FREQ.A1.1000G.EUR, FREQ.A1.ESP.EUR, FREQ.ALLELE1.HAPMAPCEU, FREQ.B, FREQ1, FREQ1.HAPMAP, FREQ_EUROPEAN_1000GENOMES, FREQ_HAPMAP, FREQ_TESTED_ALLELE_IN_HRS, FRQ_A1, FRQ_U_113154, FRQ_U_31358, FRQ_U_344901, FRQ_U_43456, POOLED_ALT_AF, AF_ALT, AF.ALT, AF-ALT, ALT.AF, ALT-AF, A2.AF, A2-AF, AF.EFF, AF_EFF, AF_EFF
As frq_is_maf=TRUE, the FRQ column will not be renamed. If the FRQ values were intended to represent major allele frequency,
set frq_is_maf=FALSE to rename the column as MAJOR_ALLELE_FRQ and differentiate it from minor/effect allele frequency.
Sorting coordinates.
Writing in tabular format ==> /Users/schilder/Downloads/mutliallelic/ieu-a-8/ieu-a-8.tsv.gz
Summary statistics report:                                                                                                           
   - 2,408,215 rows (99.6% of original 2,417,476 rows)
   - 2,408,215 unique variants
   - 160 genome-wide significant variants (P<5e-8)
   - 22 chromosomes
Done munging in 5.576 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
          SNP CHR     BP A1 A2    END FILTER       FRQ       BETA        SE        LP
1: rs12565286   1 721290  G  C 721290   PASS 0.0538073  0.1282010 0.0695073 1.1862800
2: rs11804171   1 723819  T  A 723819   PASS 0.0540315  0.1297870 0.0698828 1.1987200
3:  rs3094315   1 752566  G  A 752566   PASS 0.8249000 -0.0016609 0.0291187 0.0202171
4:  rs3131968   1 754192  A  G 754192   PASS 0.7698700 -0.0289617 0.0314537 0.4471260
       N          P
1: 27124 0.06512084
2: 26259 0.06328197
3: 36534 0.95451531
4: 22817 0.35716920
Returning path to saved data.

ieu-a-8 : Done in 5.576 minutes.

Done with all processing in 5.58 minutes.

Map

  genesOutPath2 <- MAGMA.Celltyping::map_snps_to_genes(
          path_formatted = path_formatted2[[1]],
          force_new = TRUE,
          duplicate = "first",
          synonym_dup = "skip-dup",
          genome_build = "hg19") 
Installed MAGMA version: v1.10
Skipping MAGMA installation.
The desired_version of MAGMA is currently installed: v1.10
Using: magma_v1.10_mac
Using existing genome_ref found in storage_dir.
Saving decompressed copy of path_formatted ==>  /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpY4Jqxc/ieu-a-8.tsv

==== MAGMA Step 1: Generate genes.annot file ====

Welcome to MAGMA v1.10 (custom)
Using flags:
	--annotate window=35,10
	--snp-loc /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpY4Jqxc/ieu-a-8.tsv
	--gene-loc /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc
	--out /Users/schilder/Downloads/mutliallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN

Start time is 13:43:37, Friday 26 May 2023

Starting annotation...
Reading gene locations from file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc... 
	adding window: 35000bp (before), 10000bp (after)
	19161 gene locations read from file
	chromosome  1: 2016 genes
	chromosome  2: 1226 genes
	chromosome  3: 1050 genes
	chromosome  4: 745 genes
	chromosome  5: 856 genes
	chromosome  6: 750 genes
	chromosome  7: 906 genes
	chromosome  8: 669 genes
	chromosome  9: 775 genes
	chromosome 10: 723 genes
	chromosome 11: 1275 genes
	chromosome 12: 1009 genes
	chromosome 13: 320 genes
	chromosome 14: 595 genes
	chromosome 15: 586 genes
	chromosome 16: 817 genes
	chromosome 17: 1147 genes
	chromosome 18: 271 genes
	chromosome 19: 1389 genes
	chromosome 20: 527 genes
	chromosome 21: 215 genes
	chromosome 22: 442 genes
	chromosome  X: 805 genes
	chromosome  Y: 47 genes
Reading SNP locations from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpY4Jqxc/ieu-a-8.tsv... 
	WARNING: on line 1, chromosome code 'CHR' not recognised; skipping SNP (ID = SNP)
	2345520 SNP locations read from file                                                            
	of those, 1251310 (53.35%) mapped to at least one gene
Writing annotation to file /Users/schilder/Downloads/mutliallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot
	for chromosome  1, 98 genes are empty (out of 2016)
	for chromosome  2, 16 genes are empty (out of 1226)
	for chromosome  3, 6 genes are empty (out of 1050)
	for chromosome  4, 23 genes are empty (out of 745)
	for chromosome  5, 12 genes are empty (out of 856)
	for chromosome  6, 1 gene is empty (out of 750)
	for chromosome  7, 20 genes are empty (out of 906)
	for chromosome  8, 39 genes are empty (out of 669)
	for chromosome  9, 43 genes are empty (out of 775)
	for chromosome 10, 16 genes are empty (out of 723)
	for chromosome 11, 4 genes are empty (out of 1275)
	for chromosome 12, 15 genes are empty (out of 1009)
	for chromosome 14, 11 genes are empty (out of 595)
	for chromosome 15, 28 genes are empty (out of 586)
	for chromosome 16, 27 genes are empty (out of 817)
	for chromosome 17, 32 genes are empty (out of 1147)
	for chromosome 19, 9 genes are empty (out of 1389)
	for chromosome 21, 3 genes are empty (out of 215)
	for chromosome 22, 12 genes are empty (out of 442)
	for chromosome  X, 805 genes are empty (out of 805)
	for chromosome  Y, 47 genes are empty (out of 47)
	at least one SNP mapped to each of a total of 17894 genes (out of 19161)


End time is 13:43:42, Friday 26 May 2023 (elapsed: 00:00:05)

==== MAGMA Step 2: Generate genes.out ====

Welcome to MAGMA v1.10 (custom)
Using flags:
	--bfile /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur synonym-dup=skip-dup
	--pval /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpY4Jqxc/ieu-a-8.tsv ncol=N duplicate=first
	--gene-annot /Users/schilder/Downloads/mutliallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot
	--out /Users/schilder/Downloads/mutliallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN

Start time is 13:43:42, Friday 26 May 2023

Loading PLINK-format data...
Reading file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.fam... 503 individuals read
Reading file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.bim... 22665064 SNPs read
Preparing file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.bed... 

Reading SNP synonyms from file /Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/g1000_eur.synonyms (auto-detected)
	read 6016767 mapped synonyms from file, mapping to 3921040 SNPs in the data
	WARNING: detected 133 synonymous SNP pairs in the data
	         mapped all IDs for each synonym entry to first ID occurring in data, except for other IDs present in data
	         writing list of detected synonyms in data to supplementary log file
Reading SNP p-values from file /var/folders/zq/h7mtybc533b1qzkys_ttgpth0000gn/T//RtmpY4Jqxc/ieu-a-8.tsv... 
	detected 13 variables in file
	using variable: SNP (SNP id)
	using variable: P (p-value)
	using variable: N (sample size; discarding SNPs with N < 50)
	read 2345521 lines from file, containing valid SNP p-values for 2333043 SNPs in data (99.47% of lines, 10.29% of SNPs in data)
Loading gene annotation from file /Users/schilder/Downloads/mutliallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.annot... 
	17894 gene definitions read from file
	found 17882 genes containing valid SNPs in genotype data


Starting gene analysis... 
	using model: SNPwise-mean
	writing gene analysis results to file /Users/schilder/Downloads/mutliallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.out
	writing intermediate output to file /Users/schilder/Downloads/mutliallelic/ieu-a-8/MAGMA_Files/ieu-a-8.tsv.gz.35UP.10DOWN/ieu-a-8.tsv.gz.35UP.10DOWN.genes.raw


End time is 13:48:30, Friday 26 May 2023 (elapsed: 00:04:48)
Warning message:
! "/Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/NCBI37.3.gene.loc"
  already exists, skipping download.
Set `overwrite = TRUE` to overwrite files. 

Compare GWAS files

Sumstats without multi-allelic SNPs has less rows as expected. Also, the number of rows removed using dbSNP144 is far less than when using dbSNP155. This is also expected bc the latter annotates far more SNPs as multi-allelic.

dt1 <- data.table::fread( path_formatted1[[1]])
dt2 <- data.table::fread( path_formatted2[[1]])
testthat::expect_lt(nrow(dt1), nrow(dt2))
# TRUE 
# (2345520 rows vs. 2408215 rows)

Check for duplicated RSIDS

Both versions of the GWAS sumstats (with/without multiallelic SNPs) do not have any RSIDs that appear in >1 row! This confirms the theory @Al-Murphy and I discussed yesterday that there is a difference between two different ways of thinking about multi-allelic SNPs:

  1. Multi-allelic SNPs observed in the data. In other words, the RSID appears in >1 row. In this particular GWAS, there are no instances of this. I confirmed with @Al-Murphy that multi-allelic SNPs aren't simply getting dropped at an earlier MungeSumstats step (filtering duplicate SNPs) bc that steps takes into account RSID+A1+A2, which would make each multi-allelic SNP unique.
  2. Multi-allelic SNPs observed in the reference. Even when an RSID only appears once per row in the GWAS, it can still be counted as multi-allelic bc MungeSumstats uses dbSNP to determine which RSIDs count as a bi-allelic vs. multi-allelic SNPs. This is exactly what is happening in the example here.
  3. Multi-allelic SNPs observed in the reference (single vs. all populations). Another related issue is that SNPs can be biallelic in one population, but multi-allelic in another population. Currently, MungeSumstats does not differentiate between source populations when determining biallelic/multiallelic status. Thus, a SNP can be multi-allelic in a global population sense, but not within the population being used within a given GWAS. @Al-Murphy I'll make a note of this in the MungeSumstats repo to investigate this further.
testthat::expect_equal(sum(duplicated(dt1$SNP)),0)
# TRUE
testthat::expect_equal(sum(duplicated(dt2$SNP)),0)
# TRUE

Check for 1KG SNPs

MAGMA uses 1KG Phase 3 as a reference and only takes RSIDs found within that reference (or synonyms of those RSIDs) for further analysis.

By reading in the MAGMA-distributed 1KG Phase 3 ref files, the number of SNPs found in each GWAS version (with/without multi-allelic SNPs) differs by 45k+.

        f <- list.files(
          path = "/Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/",
          pattern = "bed$|bim$|fam$",
          full.names = TRUE)
        bed <- plinkr::read_plink_bed_file_from_files(
          bed_filename = f[1],
          bim_filename = f[2], 
          fam_filename = f[3])
        f_syn <- list.files(
          path = "/Users/schilder/Library/Caches/org.R-project.R/R/MAGMA.Celltyping/g1000_eur/",
          pattern = "synonyms$", 
          full.names = TRUE)
        syn <- readLines(f_syn)[-c(1,2)]
        snps_syn <- paste0("rs",unlist(strsplit(syn," "))) 
        snps <- unique(c(rownames(bed),snps_syn))
        snps1 <- intersect(snps,dt1$SNP)
        snps2 <- intersect(snps,dt2$SNP)

>         length(snps1)
[1] 2333043
>         length(snps2)
[1] 2378383
>         length(snps2)- length(snps1)
[1] 45340

Compare MAGMA files

As before, the MAGMA files are identical ,with or without multi-allelic SNPs included.

g1 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath1)
g2 <- MAGMA.Celltyping:::read_magma_genes_out(genesOutPath2)
all.equal(g1,g2)

@Al-Murphy
Copy link
Collaborator

Which bit throws an error? Can't really follow the above (sorry)!

@bschilder
Copy link
Collaborator Author

Which bit throws an error? Can't really follow the above (sorry)!

This bit:

$ieu-a-8
<simpleError in check_allele_flip(sumstats_dt = sumstats_return$sumstats_dt, path = path, ref_genome = ref_genome, rsids = rsids, allele_flip_check = allele_flip_check, allele_flip_drop = allele_flip_drop, allele_flip_z = allele_flip_z, allele_flip_frq = allele_flip_frq, bi_allelic_filter = bi_allelic_filter, imputation_ind = imputation_ind, log_folder_ind = log_folder_ind, check_save_out = check_save_out, tabix_index = tabix_index, nThread = nThread, log_files = log_files, mapping_file = mapping_file, dbSNP = dbSNP): Certain SNPs need to be flipped along with their effect columns and frequency column. However to flip the
FRQ column, only bi-allelic SNPs can be considered. It is recommended to set bi_allelic_filter to TRUE so
non-bi-allelic SNPs are removed. Otherwise, set allele_flip_frq to FALSE to not flip the FRQ column but note
this could lead to incorrect FRQ values.>

@Al-Murphy
Copy link
Collaborator

Ah okay so that's because certain SNP ref and alt didn't match reference database:

Checking for correct direction of A1 (reference) and A2 (alternative allele).
There are 22 SNPs where neither A1 nor A2 match the reference genome. These will be removed.
There are 20 SNPs where A1 doesn't match the reference genome.
These will be flipped with their effect columns.

The error is that we can't flip FRQ unless only bi-allelic SNPs are considered because otherwise the 1 minus current frq wouldn't work. So I would say there aren't any when you use the other dbSNP version, you can choose to not flip the FRQ column (allele_flip_frq parameter) if you aren't using FRQ this would be fine or otherwise you can choose to turn off flip checks allele_flip_check.
I guess it would be good to also have an extra parameter on this to have an option to remove these cases i.e. if flipping needed but can't flip because non-bi-allelic used but it's not something I have added

@bschilder
Copy link
Collaborator Author

bschilder commented May 26, 2023

Conclusion

After checking the 1KG Phase 3 paper, I found that they used dbSNP141.

The project has now contributed or validated 80 million of the 100 million variants in the public dbSNP catalogue (version 141 includes 40 million SNPs and indels newly contributed by this analysis).

Given that MAGMA relies on 1KG Phase 3 as a reference, I think it makes sense to align the GWAS sumstats to the dbSNP version that is closest to this.

In the case of the MungeSumstats, our best available option would be dbSNP=144, as a Bioc distribution of SNPlocs.Hsapiens.dbSNP141.GRCh37 does not exist for some reason (despite the fact that SNPlocs.Hsapiens.dbSNP141.GRCh38 does).

Another bonus that @Al-Murphy reminded me of it that using dbSNP144 makes MungeSumstats substantially faster (sometimes from ~1hr --> minutes). I observed this to a more limited degree in the above tests (~9m --> ~6m).

To summarise, here is a procedure I would currently recommended when preparing GWAS for input to MAGMA and/or MAGMA.Celltyping.

Recommendations

MungeSumstats args

  • dbSNP=144: Use dbSNP144 to more closely align with 1KGp3.
  • bi_allelic_filter=TRUE (default): When using dbSNP144, this lets you keep the vast majority of SNPs (97% when bi_allelic_filter=TRUE vs. 99.6% when bi_allelic_filter=FALSE). However, use caution if you choose to use dbSNP155 as setting bi_allelic_filter=TRUE may cause you to lose ~50% of your SNPs.

MAGMA.Celltyping args

  • duplicate = "drop" (default in both MAGMA and MAGMA.Celltyping): This argument won't affect anything unless there are duplicate RSIDs across multiple rows of the GWAS file, which can't happen when bi_allelic_filter=TRUE. However, if users do use bi_allelic_filter=FALSE, they may want to consider setting duplicate = "first" to avoid dropping multi-allelic SNPs from the analysis entirely, and potentially also setting allele_flip_frq = FALSE to avoid errors due to the inherent inability to flip allele frequencies for multi-allelic SNPs.
  • synonym_dup = "skip" (default in both MAGMA and MAGMA.Celltyping): Same as above.
         ## Munge
        path_formatted1 <- MungeSumstats::import_sumstats(
          ids = "ieu-a-8",
          ref_genome = "GRCh37", 
          dbSNP = 144,
          bi_allelic_filter = TRUE,
          save_dir = "~/Downloads/multiallelic") 
        ## Map
        genesOutPath1 <- MAGMA.Celltyping::map_snps_to_genes(
          path_formatted = path_formatted1[[1]],
          # force_new = TRUE,
          duplicate = "drop",
          synonym_dup = "skip",
          genome_build = "hg19")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants