-
Notifications
You must be signed in to change notification settings - Fork 1
Overview of uniFAIR process (idea) [unfinished]
sveinugu edited this page May 21, 2021
·
12 revisions
Born out of the needs of the FAIRtracks project (Webpage: https://fairtracks.github.io/. Publication: https://doi.org/10.12688/f1000research.28449.1), uniFAIR (the Universal FAIRification Engine) is a general data and metadata processing framework with reusable modules (e.g., import/export, identifier matching, ontology mapping, batch string error correction, model conversion, validation) that allows users to create pipelines customized to their specific FAIRification process.
As initial use cases, we will consider the following two scenarios:
- Transforming ENCODE metadata into FAIRtracks format
- Transforming TCGA metadata into FAIRtracks format
- uniFAIR is designed to work with content which could be classified both as data and metadata in their original context. For simplicity, we will refer to all such content as "data".
-
-
- Input: API endpoint producing JSON data
- Output: Pandas DataFrames (possibly with embedded JSON objects/lists)
- Description: General interface to support various API endpoints. Import all data by crawling API endpoints providing JSON content
- Generalizable: Partly (at least reuse of utility functions)
- Manual/automatic: Automatic
-
Details:
- TCGA:
- More details to come...
- ENCODE:
- Identify where to start (Cart? Experiment?)
- To get all data for a table (double-check this):
https://www.encodeproject.org/experiments/@@listing?format=json&frame=object
- Download all tables directly.
- TCGA:
-
- Input: JSON content as files
- Output: Pandas DataFrames (possibly with embedded JSON objects/lists)
- Description: Import data from files. Requires specific parsers to be implemented.
- Generalizable: Fully
- Manual/automatic: Automatic
-
- Input: File content in some supported format (e.g. GSuite)
- Output: Pandas DataFrames (possibly containing lists of identifiers as Pandas Series) + reference metadata
- Description: Import data from files. Requires specific parsers to be implemented.
- Generalizable: Partly (generating reference metadata might be tricky)
- Manual/automatic: Automatic
-
- Input: Direct access to relational database
- Output: Pandas DataFrames (possibly containing lists of identifiers as Pandas Series) + reference metadata
- Description: Import data from database
- Generalizable: Fully
- Manual/automatic: Automatic
-
-
- Input: Pandas DataFrames (possibly with embedded JSON objects/lists)
- Output: Pandas DataFrames (possibly containing lists of identifiers as Pandas Series) + reference metadata
- Description: Replace embedded objects with identifiers (possibly as lists)
- Generalizable: Partly (generating reference metadata might be tricky)
- Manual/automatic: Depending on original input
-
Details:
- If there are embedded objects from other tables:
- ENCODE update:
- By using the
frame=object
parameter, we will not get any embedded objects from the APIs.
- By using the
- If the original table of the embedded objects can be retrieved directly from an API, replace such embedded objects with unique identifiers to the object in another table (maintaining a reference to the name of the table, if needed)
- Record the reference metadata
(table_from, attr_from) -> (table_to, attr_to)
for joins:- Example:
(table: "experiment", column: "replicates") -> (table: "replicate", column: "@id")
- Example:
- Record the reference metadata
- If the original table of the embedded objects are not directly available from an API, one needs to fill out the other table with the content that is embedded in the current object, creating the table if needed.
- ENCODE update:
- If there are lists of identifiers:
- Record the reference metadata
(table_from, attr_from) -> (table_to, attr_to)
for joins. - Convert into Pandas Series
- Record the reference metadata
- If there are embedded objects from other tables:
-
- Input: Pandas DataFrames (possibly containing lists of identifiers as Pandas Series) + reference metadata
- Output: Pandas DataFrames (original tables without reference column) [1NF] + reference tables + reference metadata
- Description: Move references into separate tables, transforming the tables in first normal form (1NF)
- Generalizable: Fully
- Manual/automatic: Automatic
-
Details:
- For each reference pair:
- Create a reference table
- For each item in the "from"-reference column:
- Add new rows in the reference table for each "to"-identifier, using the same "from"-identifier
- Example: Table "experiment-replicate" with columns "experiment.@id", "replicate.@id"
- Add new rows in the reference table for each "to"-identifier, using the same "from"-identifier
- Delete the complete column from the original table
- For each reference pair:
-
- Input: Pandas DataFrames (original tables without reference column) [1NF] + reference tables
- Output: Pandas DataFrames (original tables without reference column) [2NF] + reference tables
- Description: Automatic transformation of original tables into second normal form (2NF):
- Generalizable: Fully (if not, we skip it)
- Manual/automatic: Automatic
-
Details:
- Use existing library.
-
- Input: Pandas DataFrames (original tables without reference column) [2NF] + reference tables
- Output: Pandas DataFrames (original tables without reference column) [3NF] + reference tables
- Description: Automatic transformation of original tables into second normal form (3NF):
- Generalizable: Fully (if not, we skip it)
- Manual/automatic: Automatic
-
Details:
- Use existing library.
-
- Input: Pandas DataFrames (original tables without reference column) [Any NF] + reference tables + FAIRtracks JSON schemas
- Output: Model map [some data structure (to be defined) mapping FAIRtracks objects and attributes to tables/columns in the original data]
- Description: Manual mapping of FAIRtracks objects and attributes to corresponding tables and columns in the original data.
- Generalizable: Fully
- Manual/automatic: Manual
- Details:
- For each FAIRtracks object:
- Define a start table in the original data
- For each FAIRtracks attribute:
- Manually find the path (or paths) to the original table/column that this maps to
-
Example:
Experiments:organism (FAIRtracks) -> Experiments.Biosamples.Organism.scientific_name
-
Example:
- Manually find the path (or paths) to the original table/column that this maps to
- For each FAIRtracks object:
-
- Input: Pandas DataFrames (original tables without reference column) [Any NF] + reference tables + Model map
-
Output: Pandas DataFrames (initial FAIRtracks tables, possibly with multimapped attributes)
- Example:
Experiment.target_from_origcolumn1
andExperimentl.target_from_origcolumn2
contain content from two different attributes from the original data that both corresponds toExperiment.target
- Example:
- Description: Generate initial FAIRtracks tables by applying the model map, mapping FAIRtracks attributes with one or more attributes (columns) in the original table.
- Generalizable: Fully
- Manual/automatic: Automatic
-
Details:
- For every FAIRtracks object:
- Create a new pandas DataFrame
- For every FAIRtracks attribute:
- From the model map, get the path to the corresponding original table/column, or a list of such paths in case of multimapping
- For each path:
- Automatically join tables to get primary keys and attribute value in the same table:
-
Example:
experiment-biosample JOIN biosample-organism JOIN organism
will create mapping table with two columns:Experiments.local_id
andOrganism.scientific_name
-
Example:
- Add column to FAIRtracks DataFrame
- In case of multimodeling, record the relation between FAIRtracks attribute and corresponding multimapped attributes, e.g. by generating unique attribute names for each path, such as
Experiment.target_from_origcolumn1
andExperiment.target_from_origcolumn2
, which one can derive directly from the model map.
- Automatically join tables to get primary keys and attribute value in the same table:
- For every FAIRtracks object:
-
- Input: Pandas DataFrames (initial FAIRtracks tables, possibly with multimapped attributes) + model map
- Output: Pandas DataFrames (initial FAIRtracks tables)
- Description: Harmonize multimapped attributes manually, or possibly by applying scripts
- Generalizable: Limited (mostly by reusing util functions)
- Manual/automatic: Mixed (possibly scriptable)
-
Details:
- For all multimapped attributes:
- Manually review values (in batch mode) and generate a single output value for each combination:
- Hopefully Open Refine can be used for this. If so, one needs to implement data input/output mechanisms.
- Manually review values (in batch mode) and generate a single output value for each combination:
- For all multimapped attributes:
-
- For all FAIRtracks attributes with ontology terms: Convert terms using required ontologies
- Other FAIRtracks specific value conversion
- Manual batch correction of values (possibly with errors), probably using Open Refine
- Validation of FAIRtracks document
Suggestion: we will use Pandas DataFrames as the core data structure for tables, given that the library provides the required features (specifically Foreign key and Join capabilities)