Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata #79

Closed
clairblacketer opened this issue Jul 10, 2017 · 18 comments
Closed

Metadata #79

clairblacketer opened this issue Jul 10, 2017 · 18 comments

Comments

@clairblacketer
Copy link
Contributor

clairblacketer commented Jul 10, 2017

Metadata Proposal

Owner: @vojtechhuser

Discussion: forum post

Table CDM_SOURCE provides metadata.

Use case

  • display metadata within Atlas-Achilles Web (when reviewing data characterization plots and tables)
  • allow organizations with multiple OMOP CDM datasets to have a mechanism to store dataset metadata (analysis of this use will provide input for phase 2 of metadata standardization)
  • only run certain data quality checks when they are appropriate to the dataset (e.g., general population dataset; this use case depends on proper concept level standardization)

CDM changes

The proposal is adding a single table to the CDM specs. In phase 1, we are trying to provide a mechanism for sites to capture metadata. The concept level standardization is planned in phase 2.

new METADATA table

Tablename: METADATA

This table is relying on concept_ids that exist for CDM tables. In Atlas, search for those using advanced search and selecting Metadata.

Column Description Data_type
METADATA_CONCEPT_ID OMOP Vocabulary CONCEPT_ID that identifies the information you with to track (e.g. 8 for metadata about a Visit) INT
METADATA_TYPE_CONCEPT_ID OMOP Vocabulary CONCEPT_ID that identifies the type information you with to track (e.g. 1 for metadata about Domains such as a Visit) INT
NAME Name of the CONCEPT_ID stored in METADATA_CONCEPT_ID or in the event there is not an applicable CONCEPT_ID NAME can be used to represent the data stored (e.g. CDM_BUILDER VERSION) VARCHAR(250)
VALUE Store the metadata value you wish to capture NVCHAR

Modified proposal

Column Description Data_type Required
METADATA_CONCEPT_ID OMOP Vocabulary CONCEPT_ID that identifies the information you with to track (e.g. 8 for metadata about a Visit) INT
METADATA_TYPE_CONCEPT_ID OMOP Vocabulary CONCEPT_ID that identifies the type information you with to track (e.g. 1 for metadata about Domains such as a Visit) INT
NAME Name of the CONCEPT_ID stored in METADATA_CONCEPT_ID or in the event there is not an applicable CONCEPT_ID NAME can be used to represent the data stored (e.g. CDM_BUILDER VERSION) VARCHAR(250)
VALUE_AS_STRING Store the metadata value (string) NVCHAR
VALUE_AS_CONCEPT_ID OMOP Vocabulary CONCEPT_ID that reflects the metadata value int No
METADATA_DATETIME The date and time associated with metadata datetime No
METADATA_DATE date date No

Example records:

METADATA_ CONCEPT_ID METADATA_TYPE_ CONCEPT_ID NAME VALUE
51 1 PERSON Person information is pulled from insurance enrollment data where the individual both has medical and prescription benefits. The month of birth is not provided however for enrollees who start their enrollment the year they are born we extrapolate their month of birth from the month where their enrollment starts, for the majority of patients only year of birth is available. Persons who change gender over their enrollments or change year of birth are excluded.
0 1 OBSERVATION PERIOD An observation period is a representation of when a patient was enrolled in a health insurance plan and had prescription benefits. Periods of continuous enrollment are consolidated by combining monthly records as long as the time between the end of one enrollment period and the start of the next is 32 days or less.
57 1 CARE SITE There is not clear care site information in this source so no data will be captured within this table.
8 1 VISIT For the outpatient visits, all activity that is recorded on a single day for a person is considered to have occurred during one visit with the visit start and end date corresponding to this date.
55 1 PROVIDER Unique list of health care providers (physicians). Truven does provide some provider information however some of the providers listed by Truven may also be considered care sites or organizations. Since there is not clear way to decipher between all items identified as providers by Truven, regardless if they are truly organizations or care sites, they will be added to this table.
0 1 DEATH Death in Truven can be captured at discharge from an inpatient visits or in some cases by diagnosis code. The death data in this source should not be considered complete, for example if a patient left a hospital and later died at home that would not be captured. Additionally if a death was recorded however if the patient continues to have services charges after 30 days of the death date we assume the death data was faulty.
19 1 CONDITION Condition records are primarily recorded as codified claims data (e.g. ICD9 or ICD10 records that are submitted associated with a service). Additional condition information comes from patients who also have Health Risk Assessment data from Truven.
13 1 DRUG Drug exposure records are primarily recorded as codified claims data (e.g. an NDC code or a procedure code that includes a drug). If the OMOP Vocabulary deems a code of a non-traditional drug centric vocabulary is in fact a drug exposure, the record will move to this table (e.g. CPT4- 90690- “Typhoid vaccine, live, oral” maps to drug concept in the OMOP Vocabularies so the CDM_BUILDER will move the record to the DRUG_EXPOSURE table instead of the procedure table). Additional drug exposure information comes from patients who also have Health Risk Assessment data from Truven.
10 1 PROCEDURE Procedure occurrence records are recorded as codified claims data (e.g. a CPT4 code or ICD9 procedure code). If the OMOP Vocabulary deems a procedure code to be of a type of another domain (e.g. CPT4- 90690- “Typhoid vaccine, live, oral” maps to drug concept in the OMOP Vocabularies so the CDM_BUILDER will move the record to the DRUG_EXPOSURE table instead of the procedure table) however in the case of the primary procedure code those will always write a record to this table in order to maintain cost data.
21 1 MEASUREMENT Measurement data traditionally comes from lab data supplied from laboratory service vendors however data vendors such as Truven do not have 100% representation with their lab results (e.g. they will only receive lab data of vendors they have contracted with like a Quest Diagnostics). If the OMOP Vocabulary deems a code of a non-traditional measurement centric vocabulary is in fact a measurement, the record will move to this table (e.g. ICD9- V85.22- “Body Mass Index 26.0-26.9, adult” usually thought of as a diagnosis code maps to a measurement concept in the OMOP Vocabularies so the CDM_BUILDER will move the record to the MEASUREMENT table). Additional measurement information comes from patients who also have Health Risk Assessment data from Truven.
27 1 OBSERVATION Codified data or Health Risk Assessment data that is not a diagnosis, drug exposure, procedure, or measurement will become an observation.
0 0 CDM_BUILDER VERSION 1.8.0.9
0 0 DATASET_TYPE Clinical Trial Data

The proposal encourages all CDM adopters to fully populate and utilize the existing CDM_SOURCE table.

END OF PROPOSAL




Text below only reflects some historical notes related to the proposal above.

Details 1

Proposing person: Patrick Ryan, Martijn Schuemie, Ajit Londhe, & Erica Voss

(may need to be updated)

Additionally we would like the CDM_SOURCE table to store metadata about each of the domains. Our idea is to implement it by adding an additional column for each domain in the CDM to the CDM_SOURCE table (i.e. CDM_SOURCE.VISIT_OCCURRENCE, CDM_SOURCE.PERSON, etc). The value this brings is this will allow us to display information about a specific domain on an ACHILLES report. For example, VISIT_OCCURRENCE logic in PREMIER is fairly complex and displaying a description of that logic at the point where someone is reviewing the data in ACHILLES would be beneficial.

Here is an example of some text for JMDC:

Database as a whole

(already has a column) JMDC database consists of data from 60 Society-Managed Health Insurances covering workers aged 18 to 65 and their dependents (children younger than 18 years old and elderly people older than 65 years old). The old people (particularly those aged 66 or older) are less representative as compared with whole population in the nation. When estimated among the people who are younger than 66 years old, the proportion of children younger than 18 years old in JMDC is approximately the same as the proportion in the whole nation. JMDC data includes data on membership status of the insured people and claims data provided by insurers under contract. Claims data are derived from monthly claims issued by clinics, hospitals and community pharmacies.

Person

JMDC covers workers aged 18 to 65 and their dependents (children younger than 18 years old and elderly people older than 65 years old). The old people (particularly those aged 66 or older) are less representative as compared with whole population in the nation. When estimated among the people who are younger than 66 years old, the proportion of children younger than 18 years old in JMDC is approximately the same as the proportion in the whole nation.
Only the year of birth is available, so not the day or month.

Observation_period

The observation period is defined as the time of enrollment in the health insurance. If the member is a dependent, the enrollment depends on the enrollment of the main beneficiary.

Care_site

Care sites in JMDC are institutions where care is provided, typically a department in a hospital.


Details 2

debate about CDM_SOURCE table

CDM_SOURCE table

improve the guidance for this table

(superceded by inclusion of the below information in the METADATA table)

  • capture DATASET_TYPE_CONCEPT_ID Definition: Reference to concept_id in OHDSI/OMOP Terminology (class = "Dataset Type") that indicates what type of data is in the dataset. Set to NULL if none of the concepts correctly characterizes the data. For large samples of specialized population by insurance (e.g., US Medicaide, use general population concepts)
    • Values are: General population EHR data, General population claims data, General Population EHR + Claims Data, Clinical Trial Data

Advanced Data Quality checks (inside Achilles Heel) would take advantage of this information in this new column.

DATASET_TYPE_CONCEPT_ID

  • if you don't want to (or can't) declare the type of data, use concept 0 (*)
  • Clinical trial data (dataset type) (*)
  • Multiple sources (dataset type)
  • Registry data (dataset type)
  • Predominantly Electronic Health Record data (dataset type)
  • Predominantly Administrative/Claims data (dataset type)
  • Predominantly Health Information Exchange data (dataset type)
  • Data limited to a single medical specialty/clinical domain, not covering general population (dataset type) (*)

Predominantly means if at least 51% of significant records comes from a given source.
Inpatient vs outpatient data can be determined from visit types and does not need to be classified above.


Column Description Data type
DATASET_TYPE_CONCEPT_ID Type of dataset. Reference to OMOP Concept that provides dataset type classification. integer

Details 3

Proposing person: Ajit Londhe, & Erica Voss

We would like to propose the following table to hold metadata:

Tablename: METADATA

Column Description Data_type
METADATA_CONCEPT_ID OMOP Vocabulary CONCEPT_ID that identifies the information you with to track (e.g. 8 for metadata about a Visit) INT
METADATA_TYPE_CONCEPT_ID OMOP Vocabulary CONCEPT_ID that identifies the type information you with to track (e.g. 1 for metadata about Domains such as a Visit) INT
NAME Name of the CONCEPT_ID stored in METADATA_CONCEPT_ID or in the event there is not an applicable CONCEPT_ID NAME can be used to represent the data stored (e.g. CDM_BUILDER VERSION) VARCHAR(250)
VALUE Store the metadata value you wish to capture NVCHAR

Example records:

METADATA_CONCEPT_ID METADATA_TYPE_CONCEPT_ID NAME VALUE
8 1 VISIT For the outpatient visits, all activity that is recorded on a single day for a person is considered to have occurred during one visit with the visit start and end date corresponding to this date.
0 0 CDM_BUILDER VERSION 1.8.0.9

NOTES
original table was

Column Description Data type
DATASET_TYPE_CONCEPT_ID Type of dataset. Reference to OMOP Concept that provides dataset type classification. integer
PERSON text
OBSERVATION_PERIOD text
VISIT_OCCURRENCE Description of the logic used to populate the table (column name indicates the table). text
PROCEDURE_OCCURRENCE Description of the logic used to populate the table (column name indicates the table). text
CONDITION_OCCURRENCE Description of the logic used to populate the table (column name indicates the table). text
DRUG_EXPOSURE Description of the logic used to populate the table (column name indicates the table). text
MEASUREMENT Description of the logic used to populate the table (column name indicates the table). text
@vojtechhuser
Copy link
Collaborator

forum thread where metadata was discussed.
http://forums.ohdsi.org/t/metadata-extension-to-cdm/1746

@vojtechhuser
Copy link
Collaborator

vojtechhuser commented Aug 1, 2017

Summary of CDM WG discussion from Aug 1st:

  • vote next time
  • phase 1 (table), phase 2 (conventions, better semantic restriction)
  • metadata: what can not be derived from data (mostly textual data)
  • call for people to comment on the current proposal (here on github (preferred) or forum thread)
  • use case clarification (I can not edit the top post by clair)

@schuemie
Copy link
Member

schuemie commented Aug 7, 2017

What's the difference between the proposal and modified proposal? Which one is under consideration?

What would be the use of the date and datetime fields? Wouldn't these always be the time of ETL?

@alondhe
Copy link
Collaborator

alondhe commented Aug 7, 2017

Hi @schuemie,

The modified proposal adds the date and datetime fields, and is the one under consideration. While it's certainly likely metadata records would be captured at ETL time, further observations of the data set could happen during the lifecycle of that particular CDM, as it becomes more utilized through various studies by multiple researchers.

I feel it would be useful to track the date of the metadata record to help with traceability of the site's evolving understanding of the data set's nuances. I'm envisioning in AchillesWeb or eventually Atlas a set of metadata reports that would be able to expose this new information to all users of the CDM. Temporal attributes help in consuming this information more easily.

@ericaVoss
Copy link

I feel that this approach is simple enough to allow us to get started with Metadata. There are more complex ways to think about Metadata but from what I've seen they are more complicated than it is worth. Here at Janssen we have been using a similar table for awhile now and it has been helpful for tracking high level information about CDMs (like in the examples above).

@alondhe
Copy link
Collaborator

alondhe commented Sep 5, 2017

All,

In a poster proposal for the OHDSI Symposium, @vojtechhuser, @ericaVoss, and I wrote about the benefits of data quality annotations using this very metadata table proposal.

With source data rife with temporal trend shifts, contextual nuance, data collection idiosyncrasies, and vocabulary changes, not to mention the ETL design choices needed to wrangle the data, it can be easy to employ poor study design that overlooks these issues, because the knowledge about them is not stored in a central repository, but usually sitting in a few key researchers' brains. This is troublesome in a study utilizing one data set, but considerably more problematic when conducting a network study in which many data sets are leveraged, but none of the known nuances within each data set are acknowledged or adjusted for in some way.

A few case studies we discussed include (1) the loss of Social Security Death Master File access in a claims data set, which resulted in a significant drop-off in death records per patient in November 2011; and (2) the change in prevalence of the condition "malaise and fatigue" in that same claims data set due to the migration from ICD9CM to ICD10CM. Both are captured in the proposed metadata table below:

metadata concept id metadata type concept id name value as string value as concept id metadata datetime metadata date
44819056 1 Death Loss of access to Social Security Administration Death Master File NULL 2011-11-01 2017-08-23
439926 19 Malaise and fatigue ICD9CM to ICD10CM migration NULL 2015-10-01 2017-08-23

The ability to have a central place to store human-authored observations like these can benefit researchers utilizing this data set, particularly if they are exposed via our main web applications:

image

image

Additionally, the user can be warned of potential "dangers" in attempting to utilize these problematic concepts in their cohort design:

image

image

Another application enhancement made possible is the ability for data custodians to catalog design choices in building the CDM version of the data set, such as in Achilles Heel:

image

We cannot begin to implement these standard practices of data set annotation without a metadata repository. The one being proposed here, while not as robust as something like ISO-11179, does satisfy the core requirements of a metadata repository as specified by the National Information Standards Organization (NISO):

  1. descriptive: human-written observation about something (source, domain, concept)
  2. administrative: technologies utilized (.Net CDM Builder, SqlRender, JCdmBuilder, etc); versioning of source, CDM, or OMOP Vocab
  3. structural: relationships of parts of resources (mapping considerations like source tables to domains)
  4. markup: other structural or semantic features within context (perhaps summary statistics on a concept that are significant, but not within the scope of Achilles)

I feel we should adopt this proposal, begin using it, and then share our experiences with its usage through a working group.

@gowthamrao
Copy link
Member

Visit is one domain, but now has two tables - visit_detail and visit_occurrence.
METADATA_ CONCEPT_ID only has VISIT.

Should METADATA_ CONCEPT_ID represent the table or domain or both?

@cukarthik
Copy link
Contributor

We have two OMOPs (one inpatient and one outpatient) that we want to merge. This might be outside of this proposal, but I wonder if it would be worth adding an optional column later named row_id that will have the row level metadata. The main issue is that the metadata table will be very large. I know this is out of scope for this proposal, but I wanted to see if others have an option. Thanks.

@cgreich
Copy link
Contributor

cgreich commented Sep 7, 2017

@gowthamrao:

We'd have to create concepts for all the CDM details: in addition to domains and vocabularies we'd need tables and fields.

@cgreich
Copy link
Contributor

cgreich commented Sep 7, 2017

@cukarthik:

Don't we already have a row_id? Like in condition_occurrence_id?

@cukarthik
Copy link
Contributor

@cgreich you're right. That would be the purpose of row_id. We'd have the domain, so we know it's for the condition_occurrence table and then row_id would represent the condition_occurrence_id.

@cgreich
Copy link
Contributor

cgreich commented Sep 7, 2017

@cukarthik:

Oh. You want a shadow record for every other record to put some metadata on it? Really? To show provenance? Put it into the *_type_concept_id. Why do you care anyway? You have VISIT to declare whether it was outpatient or inpatient. What's the use case?

@cukarthik
Copy link
Contributor

cukarthik commented Sep 7, 2017

@cgreich

Right. We are already using the type_concept_id. There are data points, such as labs, that don't have a visit, but are in both instances. This issue of provenance is also needed for the All of Us project where we are merging essentially 20 OMOPs into one. One option is to append the source information into the source value field, but it gets pretty messy and when you want to query for that information.

Looking at this more carefully, it seems like my suggestion might now fully work b/c I was thinking the metadata_concept_id was the domain_id.

@cgreich
Copy link
Contributor

cgreich commented Sep 7, 2017

@cukarthik:

Ok. Why don't you come to the CDM WG and describe your problem and possible solutions. I wouldn't try to overload fields that are not supposed to carry that information.

@cukarthik
Copy link
Contributor

cukarthik commented Sep 7, 2017

@cgreich I agree about overloading fields. I'll ask to bring this up at the next WG meeting. In the meantime, I'll make a forum post.

@alondhe
Copy link
Collaborator

alondhe commented Sep 7, 2017

We did tease this up a bit in the forums, but it didn't seem like many folks were storing multiple CDMs in one database. I think this makes it clear that we should allow for delineation of metadata records by data source. Perhaps CDM_SOURCE should be the place to identify multiple sources within a CDM, and then METADATA would need a cdm_source_id foreign key field.

@mark-velez
Copy link

Will this table have a unique identifier field (i.e. metadata_id) like most of the other CDM tables? We will want a way to unambiguously refer to specific records. Composite key may be an option, but aside from being unwieldy, it could prevent us from supporting multiple annotations on the same object.

@clairblacketer clairblacketer added this to the CDM vTBD milestone Oct 11, 2017
@clairblacketer clairblacketer modified the milestones: CDM vTBD, CDM v5.3.0 Oct 25, 2017
clairblacketer added a commit that referenced this issue Nov 9, 2017
… is still being updated to reflect these changes.
@Andreyyiv
Copy link

@clairblacketer The Metadata table is present in the DDL files and on the github wiki pages for v5.3 and v5.3.1, but not in the PDF files of these releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants