Skip to content

Commit

Permalink
Update data_discoverability.md
Browse files Browse the repository at this point in the history
Small updates in style and spelling
  • Loading branch information
bianchini88 authored Aug 7, 2024
1 parent 00f8c93 commit b2506c7
Showing 1 changed file with 15 additions and 15 deletions.
30 changes: 15 additions & 15 deletions pages/your_tasks/data_discoverability.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,20 @@ related_pages:

### Description

Data discovery involves processes and tools that help users understand what data is available, where it is stored, and how to access it. It includes querying datasets to find specific information based on given conditions. However, for data to be discoverable, it must be well-prepared. Making your data discoverable maximizes its impact and utility, enabling others to find, access, and use it effectively. Discoverable data promotes transparency, reproducibility, and scientific progress. Achieving this requires detailed metadata and documentation, depositing data in public and institutional repositories, and using standardized formats for interoperability.
Data discovery involves processes and tools that help users understand what data is available, where it is stored, and how to access it. It includes querying datasets to find specific information based on given conditions. However, for data to be discoverable, it must be well-prepared. Making your data discoverable maximises its impact and utility, enabling others to find, access, and use it effectively. Discoverable data promotes transparency, reproducibility, and scientific progress. Achieving this requires detailed metadata and documentation, depositing data in public and institutional repositories, and using standardised formats for interoperability.

### Considerations

* Detailed Metadata: Provide comprehensive metadata for your datasets, including titles, descriptions, keywords, creators, dates, and other relevant information.
* Documentation: Include thorough documentation that explains the dataset, its structure, how it was collected, processed, and any limitations.
* Standard Schemas: Use standardized metadata schemas to ensure consistency and interoperability.
* Documentation: Include thorough documentation that explains the dataset, its structure, how it was collected and processed, and any limitations.
* Standard Schemas: Use standardised metadata schemas to ensure consistency and interoperability.
* Standard Formats: Use widely accepted data formats to ensure compatibility and ease of use.
* Public Repositories: Deposit your data in reputable public data repositories that are indexed by search engines and widely used by the research community.

### Solutions

* There are many appropriate tools for detailed metadata and data documentation during the project. Check the [Documentation and metadata page](metadata_management) for more information.
* Some scientific communities utilize platforms such as {% tool "cedar" %}, {% tool "semares" %}, {% tool "fairdom-seek" %}, {% tool "fairdomhub" %}, and {% tool "copo" %} for managing metadata and data.
* Some scientific communities utilise platforms such as {% tool "cedar" %}, {% tool "semares" %}, {% tool "fairdom-seek" %}, {% tool "fairdomhub" %}, and {% tool "copo" %} for managing metadata and data.
* Various standards exist for different data types, from general dataset descriptions such as DCAT, Dublin Core, and (bio)schema.org, to those tailored for specific data types, such as MIABIS for biosamples. Hence, selecting the appropriate standard at the project's outset is crucial. Typically, if you choose a suitable data repository for your data, it will come with an integrated metadata scheme, simplifying your work by eliminating the need to develop a separate metadata profile.
* Decide at the beginning of the project the right repository for your data type. To search for it, you can use the {% tool "elixir-deposition-databases-for-biomolecular-data" %}, {% tool "re3data" %} or {% tool "fairsharing" %} at “Databases”.
* If your chosen repository lacks some of the metadata fields you wish to include and you need to add a separate file with this information (such as in Zenodo), you should adhere to the appropriate metadata schema. To identify the correct schema, you have several options:
Expand All @@ -41,33 +41,33 @@ Discovering research data for re-analysis can occur at different levels of granu

### Considerations

* Detailed Metadata: Ensure comprehensive metadata for your datasets, including detailed descriptions of studies, datasets, variables, and any available distributions.
* Data Catalogs and Repositories: Use well-maintained online catalogs and repositories that support controlled access data, and check for advanced search features to filter datasets by specific attributes.
* Data Access Policies: Get familiar with the data access policies of different repositories and datasets, understanding the requirements and procedures for requesting access to controlled data.
* Ethical and Legal Compliance: Ensure compliance with ethical guidelines and legal regulations governing data use and sharing, and obtain necessary approvals from institutional review boards or ethics committees if required.
* Data Access Request Process: Be aware that the process for requesting and obtaining data access can be time-consuming, and prepare detailed justifications for data access requests, including research objectives and intended analyses.
* Privacy and Security Measures: Implement robust privacy and security measures to protect sensitive data during discovery and after access is granted, ensuring data handling practices comply with data protection regulations.
* Detailed Metadata: ensure comprehensive metadata for your datasets, including detailed descriptions of studies, datasets, variables, and any available distributions.
* Data Catalogs and Repositories: use well-maintained online catalogs and repositories that support controlled access data, and check for advanced search features to filter datasets by specific attributes.
* Data Access Policies: get familiar with the data access policies of different repositories and datasets, understanding the requirements and procedures for requesting access to controlled data.
* Ethical and Legal Compliance: ensure compliance with ethical guidelines and legal regulations governing data use and sharing, and obtain necessary approvals from institutional review boards or ethics committees if required.
* Data Access Request Process: be aware that the process for requesting and obtaining data access can be time-consuming, and prepare detailed justifications for data access requests, including research objectives and intended analyses.
* Privacy and Security Measures: implement robust privacy and security measures to protect sensitive data during discovery and after access is granted, ensuring data handling practices comply with data protection regulations.

### Solutions

{% tool "beacon" %}, developed through the Global Alliance for Genomics and Health (GA4GH) Discovery workstream, and with substantial support from ELIXIR, serves as a data discovery protocol and specification defining an open standard for discovering genomic and phenoclinic data in biomedical research and clinical applications.

The latest version, {% tool "beacon-v2" %}, introduced expanded query options, enabling the retrieval of biological or technical (meta)data through filters defined via CURIEs. This includes, but not limited to, parameters such as phenotypes, disease codes, sex, or age, providing researchers with a nuanced approach to data inquiries.

Beacon v2 is organized in two main blocks: the Beacon Framework and the Beacon Model.
Beacon v2 is organised in two main blocks: the Beacon Framework and the Beacon Model.
The Framework defines the format for the requests and responses, whereas the Model defines the structure of the biological data response.

This dual-system approach not only broadens the scope for diverse Models – using different domains such as images, pathogens, or infectious diseases – but also reinforces the adaptability of the Framework. The overall function of these components is to provide the instructions to design a REST API that could be implemented as a stand-alone product or, preferably, extending existing data management solutions.

Consequently, the 'beaconized' data represents a significant enhancement in data discoverability with minimal risks.
Consequently, the 'beaconised' data represents a significant enhancement in data discoverability with minimal risks.

Currently there are two ways to implement a Beacon:
Currently, there are two ways to implement a Beacon:

* On top of existing tools (API): Targeted to those organizations equipped with well-organized and structured data housed in databases, whether SQL or NoSQL, and possess the necessary resources and expertise to interpret and implement the Beacon v2 specification and construct an API on top of an existing tool.
* On top of existing tools (API): targeted to those organizations equipped with well-organised and structured data housed in databases, whether SQL or NoSQL, and possess the necessary resources and expertise to interpret and implement the Beacon v2 specification and construct an API on top of an existing tool.

{% include image.html file="beacon-api.JPG" inline=true caption="Figure 1. Beacon API functionality ([Source](https://docs.genomebeacons.org/implementations-options/))" alt="Beacon API" max-width="30em"%}

* {% tool "beacon-ri" %}: Beacon v2 Reference Implementation, B2RI, is an out-of-the-box example implementation of the Beacon v2 protocol. It is an open-source toolkit based on Python programming language and consists of tools for loading metadata, e.g.phenotypic data, from a CSV file and genomic variants from a VCF file into a MongoDB database. It also features the Beacon query engine (REST API) and comes bundled with an example dataset (CINECA synthetic cohort EUROPE UK1) comprising synthetic data.
* {% tool "beacon-ri" %}: Beacon v2 Reference Implementation, B2RI, is an out-of-the-box example implementation of the Beacon v2 protocol. It is an open-source toolkit based on Python programming language and consists of tools for loading metadata, e.g. phenotypic data, from a CSV file and genomic variants from a VCF file into a MongoDB database. It also features the Beacon query engine (REST API) and comes bundled with an example dataset (CINECA synthetic cohort EUROPE UK1) comprising synthetic data.

{% include image.html file="beacon-ri.JPG" inline=true caption="Figure 2. Beacon RI functionality. ([Source](https://docs.genomebeacons.org/implementations-options/))" alt="Beacon RI" max-width="30em" %}

Expand Down

0 comments on commit b2506c7

Please sign in to comment.