feat(sumstat qc): adding methods for QC of summary statistics #455

addramir · 2024-01-26T20:05:32Z

Adding three methods of the QC for GWAS summary statistics.

codecov-commenter · 2024-02-03T00:26:23Z

Codecov Report

Attention: 183 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (569cf82) 85.68%.
Report is 110 commits behind head on dev.

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #455      +/-   ##
==========================================
+ Coverage   85.67%   85.68%   +0.01%     
==========================================
  Files          89       99      +10     
  Lines        2101     2865     +764     
==========================================
+ Hits         1800     2455     +655     
- Misses        301      410     +109

Files	Coverage Δ
src/airflow/dags/common_airflow.py	`90.38% <100.00%> (ø)`
src/airflow/dags/dag_preprocess.py	`100.00% <ø> (ø)`
src/airflow/dags/finngen_preprocess.py	`100.00% <100.00%> (ø)`
src/airflow/dags/gwas_curation_update.py	`100.00% <100.00%> (ø)`
src/gentropy/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/data/__init__.py	`100.00% <ø> (ø)`
src/gentropy/assets/schemas/__init__.py	`100.00% <ø> (ø)`
src/gentropy/cli.py	`91.66% <100.00%> (ø)`
src/gentropy/common/Liftover.py	`80.64% <ø> (ø)`
... and 73 more

... and 34 files with indirect coverage changes

DSuveges · 2024-03-04T22:47:58Z

My comments at this point is purely design related. I'm happy to discuss them further and happy to chip in with coding as well.

Moving code to a separate module

I would recommend moving all qc related code under /methods/summary_statistics_quality_controls.py, because I feel this logic, although tightly linked to sumstats, is big enough, and complex enough to separete into its own module. Similar to what we do with window based clumping. Also would provide space to expand the documenation on quality controls. Then the main qc method can be wired into SummaryStatistics and the usage could be something like this in the summary_statistics.py file (assuming all the time we want to run all metrics):

from gentropy.methods.summary_statistics_quality_controls import quality_metrics

class SummaryStatistics:
    def get_quality_control_metrics(self: SummaryStatistics) -> Dataframe:
        return quality_metrics(self._df)

Any check has to be windowed

SummaryStatistics datasets might contain data from multiple studies. Not accounting for this, could lead to nonsense output. So input summary statistics needs to be aggregated by studyId and each metrics should return a (set of) values for each study in the dataset.

The main QC function can call all separate metrics and then join the resulting columns together like this:

|--------|--------|--------|--------|
|studyId | QC_1   | QC_2   | QC_3   |
|--------|--------|--------|--------|
| S1     | Passed | Passed | Passed |
| S2     | Passed | Passed | Failed |
|--------|--------|--------|--------|

Making sure the methods can handle multiple studies at the same time is quite important in my opinion even if these functions are meant to be called on data from a single study.

Run metrics - judge later

I think the QC methods should just return the metrics, I don't really like that the same functions decide if these values are OK or not. I believe there should be a central QC configuration (in a yaml) telling which values are tolerated in which metrics. It would be more transparent what criteria needs to be met. Also I think it would be easier to test and change configuration. I'm not overly concerned about this issue if the thresholds are super solid in the scientific field.

Thinking about the process

My big concern here is how the results of the QC would be picked up, how it would impact the ingestion process... I see it as a major complexity. Even if the study fails the QC, it is not enough to not promote the dataset downstream, we need to propagate this information to the study index so the users of the released data would know why certain summary statistics are missing. It means the QC needs to yield some sort of manifest that needs to be picked up by the release- or the pre-process.

addramir · 2024-03-05T09:51:42Z

Thank you for comments Daniel, lets discuss when on campus. I think, I can address 3 out of 4.

addramir · 2024-03-08T19:15:35Z

@DSuveges I have corrected the code according to your comments. It is now a separate class with several separate functions, including the wrapper that performs all the checks and returns a Pyspark dataframe with QC metrics for each studyId. I have also added the documentation. I left sanity_filter in SummryStats because it belongs there.

DSuveges · 2024-03-08T19:58:21Z

Thanks @addramir, I was thinking about these modifications exactly. I'm looking into the code in details now. Do you have any idea on the time it takes to run the branch vs it used to without the aggregation?

DSuveges · 2024-03-08T20:15:16Z

src/gentropy/dataset/summary_statistics.py

+        self._df = gwas_df
+        return self


This is very dangerous. What this statement does is that it returns the updated self. How it works:

unfiltered_summary_stats.df.count() # -> 1000 filtered_summary_stats = unfiltered_summary_stats.sanity_filter() filtered_summary_stats.df.count() # -> 500 unfiltered_summary_stats.df.count() # -> 500

Because there will be no unfiltered_summary_stats anymore, just the filtered. This behaviour can lead to problems that are very hard to track. (been there, done that.)

Instead we have to do this:

return SummaryStatistics( _df=( self.df.dropna( subset=["beta", "standardError", "pValueMantissa", "pValueExponent"] ) .filter( (f.col("pValueMantissa") * 10 ** f.col("pValueExponent") != 1) & (f.col("beta") != 0) & (f.col("standardError") != 0) ) ), _schema=SummaryStatistics.get_schema(), )

The above statement returns a completely new dataset so the old, the unfiltered still remains.

addramir · 2024-03-10T16:56:51Z

@DSuveges I updated PZ check and removed linear regression, should be a bit faster now. And made small correction in neff check. All should work now.

addramir · 2024-03-12T15:00:13Z

@DSuveges there is no more actions from my side. Please have a look, when have time.

addramir added 8 commits January 23, 2024 11:18

feat(sumstat qc): class for GWAS sumstat QC

d3af228

feat: small fixes

ec220b0

feat: modifying sumarry_statistics class

d8b86d1

feat: adding qc_pz_check

373c902

Merge branch 'dev' into yt_sumstat_qc

dd9dd03

fix: test fix

cd0c3d5

Merge branch 'dev' into yt_sumstat_qc

45a0264

feat: adding se_n check

90a97a8

addramir added 8 commits February 8, 2024 12:23

Merge branch 'dev' into yt_sumstat_qc

569cf82

Merge branch 'dev' into yt_sumstat_qc

d713a4c

chore: fixing functions

ca9f000

feat: adding gc lambda check

479b0d1

Merge branch 'dev' into yt_sumstat_qc

dbc0cfd

test: adding tests

a844c12

fix: fixing tests

233b796

fix: fixing tests

03b664e

github-actions bot added the size-M label Mar 4, 2024

addramir added 2 commits March 4, 2024 16:19

fix: fixing tests and adding rounding

778eaa6

Merge branch 'dev' into yt_sumstat_qc

c2a18dd

addramir marked this pull request as ready for review March 4, 2024 16:25

addramir requested a review from DSuveges March 4, 2024 16:25

addramir and others added 6 commits March 7, 2024 15:10

Merge branch 'dev' into yt_sumstat_qc

eb0ff6b

chore: pre-commit auto fixes [...]

85d307b

feat(sumstat qc): adding new method class

f55441c

test: fix tests

70cf838

Merge branch 'dev' into yt_sumstat_qc

75d9dad

fix(sumstat qc): fixing tests

c177d58

addramir added 3 commits March 8, 2024 18:12

fix: remove number_of_snps from susmtat class

14e96b5

test: add multiple studyId test

1e38c58

docs: adding documentation for sumstat QC

abb0347

DSuveges reviewed Mar 8, 2024

View reviewed changes

fix: fixes according to comments

d03a84b

addramir added 2 commits March 11, 2024 10:00

Merge branch 'dev' into yt_sumstat_qc

860e410

Merge branch 'dev' into yt_sumstat_qc

1276265

Merge branch 'dev' into yt_sumstat_qc

c703e43

addramir added the documentation Improvements or additions to documentation label Mar 12, 2024

Merge branch 'dev' into yt_sumstat_qc

2151851

github-actions bot added Method Dataset Feature labels Mar 13, 2024

addramir added 2 commits April 9, 2024 20:18

Merge branch 'dev' into yt_sumstat_qc

42a77d3

Merge branch 'dev' into yt_sumstat_qc

5492a34

DSuveges approved these changes Apr 24, 2024

View reviewed changes

addramir added 2 commits April 24, 2024 09:39

Merge branch 'dev' into yt_sumstat_qc

edee1ef

Merge branch 'dev' into yt_sumstat_qc

8c40a9a

addramir merged commit bcc9a36 into dev Apr 24, 2024
4 checks passed

addramir deleted the yt_sumstat_qc branch April 24, 2024 08:55

project-defiant mentioned this pull request Oct 15, 2024

GWAS Catalog harmonisation of summary statistics opentargets/orchestration#42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sumstat qc): adding methods for QC of summary statistics #455

feat(sumstat qc): adding methods for QC of summary statistics #455

addramir commented Jan 26, 2024

codecov-commenter commented Feb 3, 2024 •

edited

Loading

DSuveges commented Mar 4, 2024

addramir commented Mar 5, 2024

addramir commented Mar 8, 2024

DSuveges commented Mar 8, 2024 •

edited

Loading

DSuveges Mar 8, 2024

addramir Mar 10, 2024

addramir commented Mar 10, 2024

addramir commented Mar 12, 2024

feat(sumstat qc): adding methods for QC of summary statistics #455

feat(sumstat qc): adding methods for QC of summary statistics #455

Conversation

addramir commented Jan 26, 2024

codecov-commenter commented Feb 3, 2024 • edited Loading

Codecov Report

DSuveges commented Mar 4, 2024

Moving code to a separate module

Any check has to be windowed

Run metrics - judge later

Thinking about the process

addramir commented Mar 5, 2024

addramir commented Mar 8, 2024

DSuveges commented Mar 8, 2024 • edited Loading

DSuveges Mar 8, 2024

Choose a reason for hiding this comment

addramir Mar 10, 2024

Choose a reason for hiding this comment

addramir commented Mar 10, 2024

addramir commented Mar 12, 2024

codecov-commenter commented Feb 3, 2024 •

edited

Loading

DSuveges commented Mar 8, 2024 •

edited

Loading