Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to specify effect allele as A1 instead. #160

Closed
svenkatesh25 opened this issue Jul 19, 2023 · 16 comments
Closed

Option to specify effect allele as A1 instead. #160

svenkatesh25 opened this issue Jul 19, 2023 · 16 comments

Comments

@svenkatesh25
Copy link

Can an option be added to the main function to consider A1 as the effect allele instead? I have attempted to this manually, but it is inelegant without manipulating scripts contained within the MungeSumstats library.

@Al-Murphy
Copy link
Owner

Hey!

The easiest way to do this is just to rename A1 and A2 after MSS outputs the formatted file. Two things on doing this in MSS:

  1. MSS uses reference datasets for numerous of it's checks (for example check for correct effect column directions) which all depend on A1 being the reference and A2 the alternative allele. So I could no incorporate such a change fully to the code.
  2. I don't agree about making the change anyway - the point of MSS was to help standardise the formatting of summary statistics including the meaning of A1 and A2. As described in our manuscript we chose to standardise A1 as the ref and A2 as the alt in line with other large scale standardisation efforts like IEU Open GWAS. So based on all this, I would want to add functionality to MSS to allow this as then formatted sumstats from MSS won't even correspond to one another

Also it is worth noting that a lot of downstream tools expect A1 to be the ref and A2 the alt like LDSC so I would say to really consider whether this is a route you want to take.

Hope this makes sense and you understand my point of view? I'm happy to chat about it more if you like.

Cheers,
Alan.

@zichengwang98
Copy link

I encountered a similar question about the definition of A1 and A2. In LDSC, A1 is defined as the effect allele and A2 is defined as the non-effect allele (see bulik/ldsc#133 or ). However, in MungeSumstats, A2 is defined as the effect allele. Should I flip the column names of A1 and A2 before I use the munged summary statistics for LDSC?

@Al-Murphy
Copy link
Owner

Hi this was already answered here but see below for my explanation:

For LDSC, the required input format for a sumstats describe A1 as the effect columns and A2 as the other https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format. The term effect allele is confusing here since we, with MSS, mean effect allele to be the alt allele but they mean it to be the ref. You can see this if you look in their munge_sumstats.py file - https://github.com/bulik/ldsc/blob/master/munge_sumstats.py which shows other names for A1, one of which being reference. Thus A1 and A2 are the same for LDSC as we use in MSS.

Hope this clears things up for you!

Alan.

@zichengwang98
Copy link

Thank you for your detailed response! I checked the source code of munge_sumstats.py, and indeed they specified A1=EFFECT_ALLELE=REFERENCE_ALLELE. In the same script, they also included the description: 'BETA': '[linear/logistic] regression coefficient (0 --> no effect; above 0 --> A1 is trait/risk increasing)'.

My understanding is that LDSC assumes effect allele is the reference allele, and beta measures the effect of reference allele (A1). MungeSumstats assumes effect allele is the alternative allele, and beta measures the effect of alternative allele (A2). Thus, if I want to use the output of MungeSumstats as the input of LDSC, I have to multiply the beta and Z columns by -1 . If I don't care about the true "reference allele" in the reference panel, I can flip the column names of A1 and A2 (the estimates from LDSC are valid as long as beta measures the effect of A1). If I do nothing, the sign of rg would be the opposite but h2 would be the same.

Could you let me know which part is wrong? Thanks a lot for your time and patience!

@Al-Murphy
Copy link
Owner

Hi,

I see what you mean but I think this is going to have to be tested with running LDSC where we know there is an association and using the same sumstats just with A1 as the major and the minor allele and see which gives the correct result. I thought it would be easiest to do this with an example from the LDSC (sLDSC?) paper but I'm struggling to find any examples that aren't now behind a pay wall. I'll keep looking but let me know if you have any either?

@zichengwang98
Copy link

Unfortunately, I don't have the GWAS summary statistics for SCZ and BIP that were used for the tutorial. I downloaded some munged summary statistics released by the same group instead. The columns in PASS_Alzheimers_Jansen2019.sumstats.gz (A1, A2, Z) are consistent with the columns in the original GWAS. According to the readme file, A1 is defined as the effect allele.

Please feel free to check whether other LDSC-format summary statistics were munged in the same way. I'm not very sure about this situation, and many people have the same question (see bulik/ldsc#375).

My real concern is that for GWAS with column names EFFECT_ALLELE and OTHER_ALLELE (which is quite common), MungeSumstats would define A2 as the effect allele, and munge_sumstats.py from LDSC would define A1 as the effect allele (we know that from the script). Thus, it might be better to make sure that every GWAS is munged by the same software when using LDSC.

@NathanSkene
Copy link

If there is a clear difference from how munge_sumstats.py interprets A1 / A2, then we do need to deal with this. Thanks for pointing this out @zichengwang98. LDSC is one of the major use cases for MungeSumstats, so we need to be sure it's working correctly for it. Thoughts on options for dealing with this?

@Al-Murphy
Copy link
Owner

It's an easy fix to deal with but I'm not convinced MSS approach is wrong yet so I need to test running LDSC with both formats (i.e. with effect columns realting to 1. A1 and 2. A2). I think people have differing meanings for effect/major/minor/etc alleles so I want to make surethe format is right before changing anything. However, the issue is the google bucket where you download data to rerun the results from the LDSC paper appears to be down:

$ gsutil -u engaged-style-308012 cp gs://broad-alkesgroup-public-requester-pays/LDSCORE/LDSC_SEG_ldscores/Cahoy_1000Gv3_ldscores.tgz .
Copying gs://broad-alkesgroup-public-requester-pays/LDSCORE/LDSC_SEG_ldscores/Cahoy_1000Gv3_ldscores.tgz...
AccessDeniedException: 403 HttpError accessing <https://storage.googleapis.com/download/storage/v1/b/broad-alkesgroup-public-requester-pays/o/LDSCORE%2FLDSC_SEG_ldscores%2FCahoy_1000Gv3_ldscores.tgz?generation=1679508265270442&alt=media&userProject=engaged-style-308012>: response: <{'x-guploader-uploadid': 'ABPtcPpLkVONBmMaqeI6FvEVzlvcjpCV3dufC0hvz4bDb7lOvRMp0puGCi7Vz5_pXamCUNNMD6s', 'content-type': 'text/html; charset=UTF-8', 'date': 'Fri, 12 Jan 2024 07:46:27 GMT', 'vary': 'Origin, X-Origin', 'expires': 'Fri, 12 Jan 2024 07:46:27 GMT', 'cache-control': 'private, max-age=0', 'content-length': '70', 'server': 'UploadServer', 'alt-svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000', 'status': '403'}>, content <The billing account for the owning project is disabled in state closed>

Key error message: The billing account for the owning project is disabled in state closed

Others seem to have noted this too, not sure of another way around this but I'll keep working on it.

@zichengwang98
Copy link

Thank you, Alan and Nathan, for your time and effort! I found this email from the LDSC user group. Raymond Walters from the Neale lab confirmed that in LDSC, A1 should be the effect allele/minor allele. I hope this information helps.

Best,
Zicheng

@Al-Murphy
Copy link
Owner

Hi,

Thanks for this, I will update MSS when using LDSC format so that A1 and A2 are flipped to match this. I'll comment here when the update is done. However, I don't think this will have an effect on people's analysis, I ran a test on the cell type specific tutorial from LDSC with no MSS formatting, the current MSS formatting and flipping A1 and A2 and found the result was the exact same:

**NO MSS**								
Name    Coefficient     Coefficient_std_error   Coefficient_P_value								
Neuron  1.33021438504885e-11    2.8486851192986768e-09  0.498137116566384								
Oligodendrocyte -2.3708481905287363e-09 2.2878529147566418e-09  0.8499634147751827								
Astrocyte       -2.586495108996289e-09  2.3816110423640045e-09  0.8612665892065984								
								
**MSS**								
Name    Coefficient     Coefficient_std_error   Coefficient_P_value								
Neuron  1.33021438504885e-11    2.8486851192986768e-09  0.498137116566384								
Oligodendrocyte -2.3708481905287363e-09 2.2878529147566418e-09  0.8499634147751827								
Astrocyte       -2.586495108996289e-09  2.3816110423640045e-09  0.8612665892065984								
								
**FLIPPED**								
Name    Coefficient     Coefficient_std_error   Coefficient_P_value								
Neuron  1.33021438504885e-11    2.8486851192986768e-09  0.498137116566384								
Oligodendrocyte -2.3708481905287363e-09 2.2878529147566418e-09  0.8499634147751827								
Astrocyte       -2.586495108996289e-09  2.3816110423640045e-09  0.8612665892065984

I guess this may be because, although we name A1 and A2 the opposite way to LDSC (our A1 is the reference genome, i.e. not the effect allele whereas A1 for LDSC is the effect allele, the non-reference genome), the effect column (Z) is the same direction. If you notice any differences @zichengwang98, please do let me know though.

@Al-Murphy
Copy link
Owner

Updated in v1.11.3 available from github master branch now or from Bioconductor devel in a few days

@zichengwang98
Copy link

Hi Alan,

I do not expect to observe a difference in stratified-LDSC because the enrichment coefficient τ is not related to the sign of the Z score in the summary statistics. It measures whether we observe enrichment (higher than expected partitioned h2) or depletion (lower than expected partitioned h2) in the tissue type. As Raymond explained in the email, flipped signed-statistics would not have an impact on the h2 estimation.

To show the consequences of the flipped signed statistics, you would need to perform the genetic correlation (rg) analysis, following this tutorial. Reversing your A1/A2 in one of the traits will flip the sign of the correlation. If they are reversed in both traits, then rg would be back to normal.

Since the original SCZ and BIP summary statistics are not available, here's what you can try:

  1. Find a pair of genetically correlated traits.
  2. Munge both traits with munge_sumstats.py from LDSC (ideally without specifying any flags), and calculate the rg between the traits.
  3. Munge one of the traits with MSS (ideally without specifying any flags), then re-calculate the rg_MSS.

I would expect to obtain an rg_MSS with flipped sign from step 3. The magnitude and p-value of rg_MSS should be the same.

Best,
Zicheng

@Al-Murphy
Copy link
Owner

Al-Murphy commented Jan 15, 2024

That makes sense, it would be great if you could try this with both the old and new MSS LDSC approaches to show the difference @zichengwang98?

Thanks,
Alan.

@Al-Murphy
Copy link
Owner

closing because of inactivity, @zichengwang98 if you get back to testing this, please reopen. However, I still believe the current specification of A1/A2 matches LDSC, see here.

@XuanxiaoLi
Copy link

I encountered a similar question about the definition of A1 and A2. In LDSC, A1 is defined as the effect allele and A2 is defined as the non-effect allele (see bulik/ldsc#133 or ). However, in MungeSumstats, A2 is defined as the effect allele. Should I flip the column names of A1 and A2 before I use the munged summary statistics for LDSC?
I have a question: it was mentioned here that "in MungeSumstats, A2 is defined as the effect allele." How is this determined? In Munge_sumstats.py, I found explanations regarding A1 and A2, and here A2 is defined as the non-effect allele, which makes me a bit confused.

截屏2024-11-07 09 22 40

@Al-Murphy
Copy link
Owner

Hi @XuanxiaoLi,

I encountered a similar question about the definition of A1 and A2. In LDSC, A1 is defined as the effect allele and A2 is defined as the non-effect allele (see bulik/ldsc#133 or ). However, in MungeSumstats, A2 is defined as the effect allele. Should I flip the column names of A1 and A2 before I use the munged summary statistics for LDSC?

If using LDSC, we advise setting save_format="LDSC". See our FAQ section on this but in short it will flip the A1 and A2 columns so A1 now represents the effect allele.

I have a question: it was mentioned here that "in MungeSumstats, A2 is defined as the effect allele."

See this section of the FAQs

In Munge_sumstats.py, I found explanations regarding A1 and A2, and here A2 is defined as the non-effect allele, which makes me a bit confused.

Yes Munge_sumstats.py from LDSC takes A1 as the effect allele, hence why we flip the A1/A2 columns when save_format="LDSC". Note that Munge_sumstats.py is not very robust and will not capture many of the common issues with summary statistics.

Thanks,
Alan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants