-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata #79
Comments
forum thread where metadata was discussed. |
Summary of CDM WG discussion from Aug 1st:
|
What's the difference between the proposal and modified proposal? Which one is under consideration? What would be the use of the date and datetime fields? Wouldn't these always be the time of ETL? |
Hi @schuemie, The modified proposal adds the date and datetime fields, and is the one under consideration. While it's certainly likely metadata records would be captured at ETL time, further observations of the data set could happen during the lifecycle of that particular CDM, as it becomes more utilized through various studies by multiple researchers. I feel it would be useful to track the date of the metadata record to help with traceability of the site's evolving understanding of the data set's nuances. I'm envisioning in AchillesWeb or eventually Atlas a set of metadata reports that would be able to expose this new information to all users of the CDM. Temporal attributes help in consuming this information more easily. |
I feel that this approach is simple enough to allow us to get started with Metadata. There are more complex ways to think about Metadata but from what I've seen they are more complicated than it is worth. Here at Janssen we have been using a similar table for awhile now and it has been helpful for tracking high level information about CDMs (like in the examples above). |
All, In a poster proposal for the OHDSI Symposium, @vojtechhuser, @ericaVoss, and I wrote about the benefits of data quality annotations using this very metadata table proposal. With source data rife with temporal trend shifts, contextual nuance, data collection idiosyncrasies, and vocabulary changes, not to mention the ETL design choices needed to wrangle the data, it can be easy to employ poor study design that overlooks these issues, because the knowledge about them is not stored in a central repository, but usually sitting in a few key researchers' brains. This is troublesome in a study utilizing one data set, but considerably more problematic when conducting a network study in which many data sets are leveraged, but none of the known nuances within each data set are acknowledged or adjusted for in some way. A few case studies we discussed include (1) the loss of Social Security Death Master File access in a claims data set, which resulted in a significant drop-off in death records per patient in November 2011; and (2) the change in prevalence of the condition "malaise and fatigue" in that same claims data set due to the migration from ICD9CM to ICD10CM. Both are captured in the proposed metadata table below:
The ability to have a central place to store human-authored observations like these can benefit researchers utilizing this data set, particularly if they are exposed via our main web applications: Additionally, the user can be warned of potential "dangers" in attempting to utilize these problematic concepts in their cohort design: Another application enhancement made possible is the ability for data custodians to catalog design choices in building the CDM version of the data set, such as in Achilles Heel: We cannot begin to implement these standard practices of data set annotation without a metadata repository. The one being proposed here, while not as robust as something like ISO-11179, does satisfy the core requirements of a metadata repository as specified by the National Information Standards Organization (NISO):
I feel we should adopt this proposal, begin using it, and then share our experiences with its usage through a working group. |
Visit is one domain, but now has two tables - visit_detail and visit_occurrence. Should METADATA_ CONCEPT_ID represent the table or domain or both? |
We have two OMOPs (one inpatient and one outpatient) that we want to merge. This might be outside of this proposal, but I wonder if it would be worth adding an optional column later named row_id that will have the row level metadata. The main issue is that the metadata table will be very large. I know this is out of scope for this proposal, but I wanted to see if others have an option. Thanks. |
We'd have to create concepts for all the CDM details: in addition to domains and vocabularies we'd need tables and fields. |
Don't we already have a row_id? Like in condition_occurrence_id? |
@cgreich you're right. That would be the purpose of row_id. We'd have the domain, so we know it's for the condition_occurrence table and then row_id would represent the condition_occurrence_id. |
Oh. You want a shadow record for every other record to put some metadata on it? Really? To show provenance? Put it into the *_type_concept_id. Why do you care anyway? You have VISIT to declare whether it was outpatient or inpatient. What's the use case? |
Right. We are already using the type_concept_id. There are data points, such as labs, that don't have a visit, but are in both instances. This issue of provenance is also needed for the All of Us project where we are merging essentially 20 OMOPs into one. One option is to append the source information into the source value field, but it gets pretty messy and when you want to query for that information. Looking at this more carefully, it seems like my suggestion might now fully work b/c I was thinking the metadata_concept_id was the domain_id. |
Ok. Why don't you come to the CDM WG and describe your problem and possible solutions. I wouldn't try to overload fields that are not supposed to carry that information. |
@cgreich I agree about overloading fields. I'll ask to bring this up at the next WG meeting. In the meantime, I'll make a forum post. |
We did tease this up a bit in the forums, but it didn't seem like many folks were storing multiple CDMs in one database. I think this makes it clear that we should allow for delineation of metadata records by data source. Perhaps CDM_SOURCE should be the place to identify multiple sources within a CDM, and then METADATA would need a cdm_source_id foreign key field. |
Will this table have a unique identifier field (i.e. |
@clairblacketer The Metadata table is present in the DDL files and on the github wiki pages for v5.3 and v5.3.1, but not in the PDF files of these releases. |
Metadata Proposal
Owner: @vojtechhuser
Discussion: forum post
Table CDM_SOURCE provides metadata.
Use case
CDM changes
The proposal is adding a single table to the CDM specs. In phase 1, we are trying to provide a mechanism for sites to capture metadata. The concept level standardization is planned in phase 2.
new METADATA table
Tablename: METADATA
This table is relying on concept_ids that exist for CDM tables. In Atlas, search for those using advanced search and selecting Metadata.
Modified proposal
Example records:
The proposal encourages all CDM adopters to fully populate and utilize the existing CDM_SOURCE table.
END OF PROPOSAL
Text below only reflects some historical notes related to the proposal above.
Details 1
Proposing person: Patrick Ryan, Martijn Schuemie, Ajit Londhe, & Erica Voss
(may need to be updated)
Additionally we would like the CDM_SOURCE table to store metadata about each of the domains. Our idea is to implement it by adding an additional column for each domain in the CDM to the CDM_SOURCE table (i.e. CDM_SOURCE.VISIT_OCCURRENCE, CDM_SOURCE.PERSON, etc). The value this brings is this will allow us to display information about a specific domain on an ACHILLES report. For example, VISIT_OCCURRENCE logic in PREMIER is fairly complex and displaying a description of that logic at the point where someone is reviewing the data in ACHILLES would be beneficial.
Here is an example of some text for JMDC:
Database as a whole
(already has a column) JMDC database consists of data from 60 Society-Managed Health Insurances covering workers aged 18 to 65 and their dependents (children younger than 18 years old and elderly people older than 65 years old). The old people (particularly those aged 66 or older) are less representative as compared with whole population in the nation. When estimated among the people who are younger than 66 years old, the proportion of children younger than 18 years old in JMDC is approximately the same as the proportion in the whole nation. JMDC data includes data on membership status of the insured people and claims data provided by insurers under contract. Claims data are derived from monthly claims issued by clinics, hospitals and community pharmacies.
Person
JMDC covers workers aged 18 to 65 and their dependents (children younger than 18 years old and elderly people older than 65 years old). The old people (particularly those aged 66 or older) are less representative as compared with whole population in the nation. When estimated among the people who are younger than 66 years old, the proportion of children younger than 18 years old in JMDC is approximately the same as the proportion in the whole nation.
Only the year of birth is available, so not the day or month.
Observation_period
The observation period is defined as the time of enrollment in the health insurance. If the member is a dependent, the enrollment depends on the enrollment of the main beneficiary.
Care_site
Care sites in JMDC are institutions where care is provided, typically a department in a hospital.
Details 2
debate about CDM_SOURCE table
CDM_SOURCE table
improve the guidance for this table
(superceded by inclusion of the below information in the METADATA table)
Advanced Data Quality checks (inside Achilles Heel) would take advantage of this information in this new column.
DATASET_TYPE_CONCEPT_ID
Predominantly means if at least 51% of significant records comes from a given source.
Inpatient vs outpatient data can be determined from visit types and does not need to be classified above.
Details 3
Proposing person: Ajit Londhe, & Erica Voss
We would like to propose the following table to hold metadata:
Tablename: METADATA
Example records:
NOTES
original table was
The text was updated successfully, but these errors were encountered: