Skip to content

Commit

Permalink
Add page on data discoverability (#1517)
Browse files Browse the repository at this point in the history
* Create data_discoverability.md

* Update data_discoverability.md

* Update tool_and_resource_list.yml

* Add files via upload

* Add content to data discoverability page

* Update data_discoverability.md

* Add tools to the tools table

* Update tool_and_resource_list.yml

* Add new tools in the text

* improve size of pic + part of list

* add to sidebar

* typo

* Add Aina to contributors

* Update pages/your_tasks/data_discoverability.md

Co-authored-by: Bert Droesbeke <44875756+bedroesb@users.noreply.github.com>

* Add related pages

* Update data_discoverability.md

replaced 'like' with 'such as'

* Update data_discoverability.md

Small updates in style and spelling

* Add changes based on Nazeefa's comments

* Update text based on Nazeefa's comments

* Changes based on Nazeefa's comments

* add news item

* capital in title

* adding acronym to tool snippet

* no task page linkage

* Remove B4OMOP documentation

* Update links

---------

Co-authored-by: bedroesb <bert.droesbeke@vib.be>
Co-authored-by: Bert Droesbeke <44875756+bedroesb@users.noreply.github.com>
Co-authored-by: Nazeeefa <fatima.nazeefa@gmail.com>
Co-authored-by: Federico Bianchini <72258479+bianchini88@users.noreply.github.com>
  • Loading branch information
5 people authored Oct 11, 2024
1 parent abfcf46 commit 8b14d10
Show file tree
Hide file tree
Showing 7 changed files with 93 additions and 0 deletions.
5 changes: 5 additions & 0 deletions _data/CONTRIBUTORS.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -641,3 +641,8 @@ Arshiya Merchant:
Xènia Pérez Sitjà:
affiliation: Earlham Institute / ELIXIR-UK
git: sitjart
Aina Jené Cortada:
git: ainajene
email: aina.jene@crg.eu
orcid: 0000-0001-7721-7097
affiliation: European Genome-phenome Archive (EGA) / CRG
4 changes: 4 additions & 0 deletions _data/news.yml
Original file line number Diff line number Diff line change
Expand Up @@ -186,3 +186,7 @@
date: 2024-10-07
linked_pr: 1437
description: Our [contribute section](how_to_contribute) got an overhaul, with clear ways in how to contribute and improved documentation.
- name: "New page: Data discoverability"
date: 2024-10-10
linked_pr: 1517
description: A "your task" page about how to make your data more discoverable was added. [Discover the page here](data_discoverability).
2 changes: 2 additions & 0 deletions _data/sidebars/data_management.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,8 @@ subitems:
url: /data_analysis
- title: Data brokering
url: /data_brokering
- title: Data discoverability
url: /data_discoverability
- title: Data management coordination
url: /dm_coordination
- title: Data management plan
Expand Down
12 changes: 12 additions & 0 deletions _data/tool_and_resource_list.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2665,3 +2665,15 @@
id: better-bibtex
name: Better BibTeX (BBT)
url: https://retorque.re/zotero-better-bibtex/
- description: Beacon for OMOP (B4OMOP) software allows for the integration of a Beacon onto any OMOP Common Data Model (CDM) database.
id: b4omop
name: Beacon for OMOP (B4OMOP)
url: https://gitlab.bsc.es/impact-data/impd-beacon_omopcdm
- description: Developed through the Global Alliance for Genomics and Health (GA4GH) Discovery workstream with support from ELIXIR, Beacon is a data discovery protocol defining an open standard for discovering genomic and phenoclinic data in research and clinical applications.
id: beacon-v2
name: Beacon v2
url: https://docs.genomebeacons.org/implementations-options/
- description: An open-source out-of-the-box toolkit to initiate a Beacon v2. B2RI includes tools for loading metadata, such as phenotypic data and genomic variants into a MongoDB database, and features a Beacon query engine (REST API).
id: beacon-ri
name: Beacon v2 Reference Implementation (B2RI)
url: https://github.com/EGA-archive/beacon2-ri-tools-v2
Binary file added images/beacon-api.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/beacon-ri.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
70 changes: 70 additions & 0 deletions pages/your_tasks/data_discoverability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: Data discoverability
description: How to make data discoverable
contributors: [Aina Jené Cortada, Laura Portell Silva]
page_id: data_discoverability
---

## How can you make your data more discoverable?

### Description

Data discovery involves processes and tools that help users understand what data is available, where it is stored, and how to access it. It includes querying datasets to find specific information based on given conditions. However, for data to be discoverable, it must be well-prepared. Making your data discoverable maximises its impact and utility, enabling others to find, access, and use it effectively. Discoverable data promotes transparency, reproducibility, and scientific progress. Achieving this requires detailed metadata and documentation, depositing data in public and institutional repositories, and using standardised formats for interoperability.

### Considerations

* Detailed Metadata: Provide comprehensive metadata for your datasets, including titles, descriptions, keywords, creators, dates, and other relevant information.
* Documentation: Include thorough documentation that explains the dataset, its structure, how it was collected and processed, and any limitations.
* Standard Schemas: Use standardised metadata schemas to ensure consistency and interoperability.
* Standard Formats: Use widely accepted data formats to ensure compatibility and ease of use.
* Public Repositories: Deposit your data in reputable public data repositories that are indexed by search engines and widely used by the research community.

### Solutions

* There are several appropriate tools to create detailed (or comprehensive) metadata and document data properly for the project. Check the [Documentation and metadata](metadata_management) page for more information.
* Some scientific communities utilise platforms such as {% tool "cedar" %}, {% tool "semares" %}, {% tool "fairdom-seek" %}, {% tool "fairdomhub" %}, and {% tool "copo" %} for managing metadata and data.
* Various standards exist for different data types, from general dataset descriptions such as DCAT, Dublin Core, and (bio)schema.org, to those tailored for specific data types, such as MIABIS for biosamples. Hence, selecting the appropriate standard at the project's outset is crucial. Typically, if you choose a suitable data repository for your data, it will come with an integrated metadata scheme, simplifying your work by eliminating the need to develop a separate metadata profile.
* Decide at the beginning of the project the right repository for your data type. To search for it, you can use the {% tool "elixir-deposition-databases-for-biomolecular-data" %}, {% tool "re3data" %} or {% tool "fairsharing" %} at “Databases”.
* If your chosen repository lacks some of the metadata fields you wish to include and you need to add a separate file with this information (such as in Zenodo), you should adhere to the appropriate metadata schema. To identify the correct schema, you have several options:
* {% tool "rda-standards" %}
* {% tool "fairsharing" %} at “Standards” and “Collections”
* {% tool "data-curation-centre-metadata-list" %}
* The ideal file formats vary based on the type of data, the availability and common acceptance of open file formats, and the research domain. There isn't a universal solution, so selecting the most suitable format for your specific needs is essential. The [Data Organisation](data_organisation) page provides a table with recommended file formats and best practices for research data management.

## How can you discover controlled access data?

### Description

Discovering research data for re-analysis can occur at different levels of granularity. Initially, researchers browse online catalogues that describe studies, datasets, related publications, variables, and some data distributions. This basic discovery may suffice if the datasets meet all the criteria. However, to find dataset that meet specific combinations of attributes — such as identifying datasets with particular combinations of attributes, like 'adults diagnosed with COVID-19 in the last year, fully vaccinated, with no underlying health conditions' (for example) — researchers must either contact the authors or request data access and verify themselves. This process is feasible for a small number of datasets and cooperative data controllers but it is usually time-consuming and uncertain. To streamline this, data discovery at the source allows users to query data non-disclosively, determining its relevance before requesting full access.

### Considerations

* Detailed Metadata: ensure comprehensive metadata for your datasets, including detailed descriptions of studies, datasets, variables, and any available distributions.
* Data Catalogs and Repositories: use well-maintained online catalogs and repositories that support controlled access data, and check for advanced search features to filter datasets by specific attributes.
* Data Access Policies: get familiar with the data access policies of different repositories and datasets, understanding the requirements and procedures for requesting access to controlled data.
* Ethical and Legal Compliance: ensure compliance with ethical guidelines and legal regulations governing data use and sharing, and obtain necessary approvals from institutional review boards or ethics committees if required. Check the [GDPR compliance](gdpr_compliance) and [Ethical aspects](ethics) pages for more information.
* Data Access Request Process: be aware that the process for requesting and obtaining data access can be time-consuming, and prepare detailed justifications for data access requests, including research objectives and intended analyses.
* Privacy and Security Measures: implement robust privacy and security measures to protect sensitive data during discovery and after access is granted, ensuring data handling practices comply with data protection regulations. Check the [Data sensitivity](data_sensitivity) page for more information.

### Solutions

{% tool "beacon" %}, developed through the Global Alliance for Genomics and Health (GA4GH) Discovery workstream, and with substantial support from ELIXIR, serves as a data discovery protocol and specification defining an open standard for discovering genomic and phenoclinic data in biomedical research and clinical applications.

The latest version, {% tool "beacon-v2" %}, introduced expanded query options, enabling the retrieval of biological or technical (meta)data through filters defined via CURIEs. This includes, but not limited to, parameters such as phenotypes, disease codes, sex, or age, providing researchers with a nuanced approach to data inquiries.

Beacon v2 is organised in two main blocks: the Beacon Framework and the Beacon Model.
The Framework defines the format for the requests and responses, whereas the Model defines the structure of the biological data response.

This dual-system approach not only broadens the scope for diverse Models – using different domains such as images, pathogens, or infectious diseases – but also reinforces the adaptability of the Framework. The overall function of these components is to provide the instructions to design a REST API that could be implemented as a stand-alone product or, preferably, extending existing data management solutions.

Consequently, the 'beaconised' data represents a significant enhancement in data discoverability with minimal risks. Currently, there are two ways to implement a Beacon:

* API on top of existing tools: This APIs is targeted to those organizations equipped with well-organised and structured data housed in databases, whether SQL or NoSQL, and possess the necessary resources and expertise to interpret and implement the Beacon v2 specification and construct an API on top of an existing tool.

{% include image.html file="beacon-api.JPG" inline=true caption="Figure 1. Beacon API functionality ([Source](https://docs.genomebeacons.org/implementations-options/))" alt="Beacon API" max-width="30em"%}

* {% tool "beacon-ri" %}: an out-of-the-box example implementation of the Beacon v2 protocol. It is an open-source toolkit based on Python programming language and consists of tools for loading metadata, e.g. phenotypic data, from a CSV file and genomic variants from a VCF file into a MongoDB database. It also features the Beacon query engine (REST API) and comes bundled with an example dataset (CINECA synthetic cohort EUROPE UK1) comprising synthetic data. You can find the GitHub Repository for Beacon v2 [here](https://github.com/EGA-archive/beacon2-ri-tools-v2).

{% include image.html file="beacon-ri.JPG" inline=true caption="Figure 2. Beacon RI functionality. ([Source](https://docs.genomebeacons.org/implementations-options/))" alt="Beacon RI" max-width="30em" %}

Building on B2RI, the {% tool "b4omop" %} software allows for the integration of a Beacon onto any OMOP Common Data Model (CDM) database. This enables organizations using the OMOP CDM to leverage the Beacon framework for querying and sharing genomic and phenotypic data.

0 comments on commit 8b14d10

Please sign in to comment.