Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sumstat qc): adding methods for QC of summary statistics #455

Merged
merged 36 commits into from
Apr 24, 2024

Conversation

addramir
Copy link
Contributor

Adding three methods of the QC for GWAS summary statistics.

@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2024

Codecov Report

Attention: 183 lines in your changes are missing coverage. Please review.

Comparison is base (42b366c) 85.67% compared to head (569cf82) 85.68%.
Report is 110 commits behind head on dev.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #455      +/-   ##
==========================================
+ Coverage   85.67%   85.68%   +0.01%     
==========================================
  Files          89       99      +10     
  Lines        2101     2865     +764     
==========================================
+ Hits         1800     2455     +655     
- Misses        301      410     +109     
Files Coverage Δ
src/airflow/dags/common_airflow.py 90.38% <100.00%> (ø)
src/airflow/dags/dag_preprocess.py 100.00% <ø> (ø)
src/airflow/dags/finngen_preprocess.py 100.00% <100.00%> (ø)
src/airflow/dags/gwas_curation_update.py 100.00% <100.00%> (ø)
src/gentropy/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/data/__init__.py 100.00% <ø> (ø)
src/gentropy/assets/schemas/__init__.py 100.00% <ø> (ø)
src/gentropy/cli.py 91.66% <100.00%> (ø)
src/gentropy/common/Liftover.py 80.64% <ø> (ø)
... and 73 more

... and 34 files with indirect coverage changes

@github-actions github-actions bot added the size-M label Mar 4, 2024
@addramir addramir marked this pull request as ready for review March 4, 2024 16:25
@addramir addramir requested a review from DSuveges March 4, 2024 16:25
@DSuveges
Copy link
Contributor

DSuveges commented Mar 4, 2024

My comments at this point is purely design related. I'm happy to discuss them further and happy to chip in with coding as well.

Moving code to a separate module

I would recommend moving all qc related code under /methods/summary_statistics_quality_controls.py, because I feel this logic, although tightly linked to sumstats, is big enough, and complex enough to separete into its own module. Similar to what we do with window based clumping. Also would provide space to expand the documenation on quality controls. Then the main qc method can be wired into SummaryStatistics and the usage could be something like this in the summary_statistics.py file (assuming all the time we want to run all metrics):

from gentropy.methods.summary_statistics_quality_controls import quality_metrics

class SummaryStatistics:
    def get_quality_control_metrics(self: SummaryStatistics) -> Dataframe:
        return quality_metrics(self._df)

Any check has to be windowed

SummaryStatistics datasets might contain data from multiple studies. Not accounting for this, could lead to nonsense output. So input summary statistics needs to be aggregated by studyId and each metrics should return a (set of) values for each study in the dataset.

The main QC function can call all separate metrics and then join the resulting columns together like this:

|--------|--------|--------|--------|
|studyId | QC_1   | QC_2   | QC_3   |
|--------|--------|--------|--------|
| S1     | Passed | Passed | Passed |
| S2     | Passed | Passed | Failed |
|--------|--------|--------|--------|

Making sure the methods can handle multiple studies at the same time is quite important in my opinion even if these functions are meant to be called on data from a single study.

Run metrics - judge later

I think the QC methods should just return the metrics, I don't really like that the same functions decide if these values are OK or not. I believe there should be a central QC configuration (in a yaml) telling which values are tolerated in which metrics. It would be more transparent what criteria needs to be met. Also I think it would be easier to test and change configuration. I'm not overly concerned about this issue if the thresholds are super solid in the scientific field.

Thinking about the process

My big concern here is how the results of the QC would be picked up, how it would impact the ingestion process... I see it as a major complexity. Even if the study fails the QC, it is not enough to not promote the dataset downstream, we need to propagate this information to the study index so the users of the released data would know why certain summary statistics are missing. It means the QC needs to yield some sort of manifest that needs to be picked up by the release- or the pre-process.

@addramir
Copy link
Contributor Author

addramir commented Mar 5, 2024

Thank you for comments Daniel, lets discuss when on campus. I think, I can address 3 out of 4.

@addramir
Copy link
Contributor Author

addramir commented Mar 8, 2024

@DSuveges I have corrected the code according to your comments. It is now a separate class with several separate functions, including the wrapper that performs all the checks and returns a Pyspark dataframe with QC metrics for each studyId. I have also added the documentation. I left sanity_filter in SummryStats because it belongs there.

@DSuveges
Copy link
Contributor

DSuveges commented Mar 8, 2024

Thanks @addramir, I was thinking about these modifications exactly. I'm looking into the code in details now. Do you have any idea on the time it takes to run the branch vs it used to without the aggregation?

Comment on lines 138 to 139
self._df = gwas_df
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very dangerous. What this statement does is that it returns the updated self. How it works:

unfiltered_summary_stats.df.count() # -> 1000
filtered_summary_stats = unfiltered_summary_stats.sanity_filter()

filtered_summary_stats.df.count() # -> 500
unfiltered_summary_stats.df.count() # -> 500

Because there will be no unfiltered_summary_stats anymore, just the filtered. This behaviour can lead to problems that are very hard to track. (been there, done that.)

Instead we have to do this:

        return SummaryStatistics(
            _df=(
                self.df.dropna(
                    subset=["beta", "standardError", "pValueMantissa", "pValueExponent"]
                )
                .filter(
                    (f.col("pValueMantissa") * 10 ** f.col("pValueExponent") != 1)
                    & (f.col("beta") != 0)
                    & (f.col("standardError") != 0)
                )
            ),
            _schema=SummaryStatistics.get_schema(),
        )

The above statement returns a completely new dataset so the old, the unfiltered still remains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@addramir
Copy link
Contributor Author

@DSuveges I updated PZ check and removed linear regression, should be a bit faster now. And made small correction in neff check. All should work now.

@addramir
Copy link
Contributor Author

@DSuveges there is no more actions from my side. Please have a look, when have time.

@addramir addramir added the documentation Improvements or additions to documentation label Mar 12, 2024
@addramir addramir merged commit bcc9a36 into dev Apr 24, 2024
4 checks passed
@addramir addramir deleted the yt_sumstat_qc branch April 24, 2024 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dataset documentation Improvements or additions to documentation Feature Method size-M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants