Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ease-of-use/readibility vs data integrity #6

Open
marcora opened this issue Sep 8, 2022 · 4 comments
Open

Ease-of-use/readibility vs data integrity #6

marcora opened this issue Sep 8, 2022 · 4 comments

Comments

@marcora
Copy link

marcora commented Sep 8, 2022

I fully understand the need for a format with low bioinformatics requirement for data consumers, but sacrificing data integrity (e.g., by separating sumstats and metadata) to achieve that goal seems dangerous to me.

Providing a simple tool (or a "download format" option on the GWAS catalog website) that can convert a format with superior bioinformatics qualities (e.g., GWAS-VCF) but inferior readibility and ease-of-use to a format for the general population (e.g., MS Excel-compatible CSV or XLSX) would be a better solution in my opinion.

@seandavi
Copy link

I have to agree with this comment and, in particular, with the idea of repurposing VCF. Excellent, performant tooling exists for VCF formats. Integration with existing annotation sources, including other VCFs seems a common use case that is quickly and easily doable using VCF tooling. Conversion from VCF to TSV is straightforward as needed. Creating a tab-delimited format as a "standard" seems like a step backward, though the information content described in the spec document is clearly very well-thought-out.

@ljwh2
Copy link
Collaborator

ljwh2 commented Mar 20, 2023

Thanks for the comments.

The current state of the field is that many summary statistics files are lacking key information (particularly effect allele, EAF) which hinder downstream use of the data, or are not shared at all. The main goal of GWAS-SSF is to identify key mandatory and non-mandatory data and metadata fields for usability and encourage data sharing. We believe at this point in time, the community will benefit from definition of these data fields which can be applied to the simple tsv format described here, or GWAS-VCF, or any other file format. We are updating the manuscript to focus on the data content and make this clearer.

It’s clear that including metadata in the header is an optimal choice for data integrity. With respect to the GWAS Catalog, we heard in our working groups that it could be a big stretch for some users to use this format, presenting an additional overhead and barrier to sharing and/or use of the data, which would be counterproductive. We believe that the risks in separating the data and metadata are already limited by sharing data via a FAIR resource. Therefore we don’t feel it’s appropriate to commit resources to change our ingest pipelines to adopt a file format with metadata in the header at the current time. However we will continue to monitor the situation as the field evolves and more tooling becomes available.

@marcora
Copy link
Author

marcora commented Mar 21, 2023

It takes one command to convert GWAS-VCF to a more "non-bioinformatician user"-friendly format. In my opinion, GWAS Catalog should offer summary stats in various formats for various users (since it seems you are aiming to satisfy non-expert users, I would recommend Excel with additional tab for metadata and GWAS-VCF for bioinformaticians). But whatever you propose as "standard" is going to become the de-facto standard in the community of bioinformaticians and tool developers, and in my opinion that should be the format with data integrity (and therefore reproducibility) as the foremost priority.

@ljwh2
Copy link
Collaborator

ljwh2 commented Mar 21, 2023

Yes, we would love to provide different formats for different users and this could be a future goal. For now, OpenGWAS (as I’m sure you know) are providing GWAS Catalog summary statistics in GWAS-VCF format, and the new mandatory fields should increase the number of data files that are suitable for MR and hence for them to ingest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants