-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(sumstat qc): adding methods for QC of summary statistics #455
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## dev #455 +/- ##
==========================================
+ Coverage 85.67% 85.68% +0.01%
==========================================
Files 89 99 +10
Lines 2101 2865 +764
==========================================
+ Hits 1800 2455 +655
- Misses 301 410 +109
|
My comments at this point is purely design related. I'm happy to discuss them further and happy to chip in with coding as well. Moving code to a separate moduleI would recommend moving all qc related code under from gentropy.methods.summary_statistics_quality_controls import quality_metrics
class SummaryStatistics:
def get_quality_control_metrics(self: SummaryStatistics) -> Dataframe:
return quality_metrics(self._df) Any check has to be windowedSummaryStatistics datasets might contain data from multiple studies. Not accounting for this, could lead to nonsense output. So input summary statistics needs to be aggregated by The main QC function can call all separate metrics and then join the resulting columns together like this:
Making sure the methods can handle multiple studies at the same time is quite important in my opinion even if these functions are meant to be called on data from a single study. Run metrics - judge laterI think the QC methods should just return the metrics, I don't really like that the same functions decide if these values are OK or not. I believe there should be a central QC configuration (in a yaml) telling which values are tolerated in which metrics. It would be more transparent what criteria needs to be met. Also I think it would be easier to test and change configuration. I'm not overly concerned about this issue if the thresholds are super solid in the scientific field. Thinking about the processMy big concern here is how the results of the QC would be picked up, how it would impact the ingestion process... I see it as a major complexity. Even if the study fails the QC, it is not enough to not promote the dataset downstream, we need to propagate this information to the study index so the users of the released data would know why certain summary statistics are missing. It means the QC needs to yield some sort of manifest that needs to be picked up by the release- or the pre-process. |
Thank you for comments Daniel, lets discuss when on campus. I think, I can address 3 out of 4. |
@DSuveges I have corrected the code according to your comments. It is now a separate class with several separate functions, including the wrapper that performs all the checks and returns a Pyspark dataframe with QC metrics for each studyId. I have also added the documentation. I left sanity_filter in SummryStats because it belongs there. |
Thanks @addramir, I was thinking about these modifications exactly. I'm looking into the code in details now. Do you have any idea on the time it takes to run the branch vs it used to without the aggregation? |
self._df = gwas_df | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very dangerous. What this statement does is that it returns the updated self. How it works:
unfiltered_summary_stats.df.count() # -> 1000
filtered_summary_stats = unfiltered_summary_stats.sanity_filter()
filtered_summary_stats.df.count() # -> 500
unfiltered_summary_stats.df.count() # -> 500
Because there will be no unfiltered_summary_stats
anymore, just the filtered. This behaviour can lead to problems that are very hard to track. (been there, done that.)
Instead we have to do this:
return SummaryStatistics(
_df=(
self.df.dropna(
subset=["beta", "standardError", "pValueMantissa", "pValueExponent"]
)
.filter(
(f.col("pValueMantissa") * 10 ** f.col("pValueExponent") != 1)
& (f.col("beta") != 0)
& (f.col("standardError") != 0)
)
),
_schema=SummaryStatistics.get_schema(),
)
The above statement returns a completely new dataset so the old, the unfiltered still remains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@DSuveges I updated PZ check and removed linear regression, should be a bit faster now. And made small correction in neff check. All should work now. |
@DSuveges there is no more actions from my side. Please have a look, when have time. |
Adding three methods of the QC for GWAS summary statistics.