-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOTE NLP table #85
Comments
Too late now I guess, but me and some other folks have requested the following (but not in the right places apparently, just in some e-mail communication): Please split up
Others can go into a |
@schuemie Just to be sure I'm clear, would the new table look something like this: NOTE_NLP
|
@schuemie: Have you talked to Hua? Because he was running the entire subgroup that came up with the definition. |
I don't have data with freetext, so I don't consider myself a knowledgeable
voice, but perhaps someone can clarify a basic point for me: The
NOTE_NLP table appears that it'll contain a row for every token from every
note. In that case, this table will be very long (probably billions of
records). So, just from a performance perspective, this dataset will get
really big if it has a lot of required fields. Some of these fields, like
NLP_SYSTEM, NLP_DATE, and NLP_DATETIME would be data that could be
contained on the NOTE table, and therefore could potentially be considered
redundant and could be all Required = No. NOTE_NLP_ID is following the CDM
standard of having a unique record id, but I wonder if that's necessary in
this case (is there analytical use case the workgroup had in mind?).
…On Fri, Jul 14, 2017 at 8:15 AM, clairblacketer ***@***.***> wrote:
@schuemie <https://github.com/schuemie> Just to be sure I'm clear, would
the new table look something like this:
NOTE_NLP
Field Required Type Description
note_nlp_id Yes Big Integer A unique identifier for each term extracted
from a note.
note_id Yes integer A foreign key to the Note table note the term was
extracted from.
section_concept_id No integer A foreign key to the predefined Concept in
the Standardized Vocabularies representing the section of the extracted
term.
snippet No varchar(250) A small window of text surrounding the term.
offset No varchar(50) Character offset of the extracted term in the input
note.
lexical_variant Yes varchar(250) Raw text extracted from the NLP tool.
note_nlp_concept_id No integer A foreign key to the predefined Concept in
the Standardized Vocabularies reflecting the normalized concept for the
extracted term. Domain of the term is represented as part of the Concept
table.
note_nlp_source_concept_id no integer A foreign key to a Concept that
refers to the code in the source vocabulary used by the NLP system
nlp_system No varchar(250) Name and version of the NLP system that
extracted the term.Useful for data provenance.
nlp_date Yes date The date of the note processing.Useful for data
provenance.
nlp_date_time No datetime The date and time of the note processing.
Useful for data provenance.
term_exists No varchar(1) A summary modifier that signifies presence or
absence of the term for a given patient. Useful for quick querying. *
term_temporal No varchar(50) An optional time modifier associated with
the extracted term. (for now “past” or “present” only). Standardize it
later.
*term_negated* *No* *varchar(50)*
*term_subject* *No* *varchar(50)*
*term_certainty* *No* *varchar(50)*
*other_term_modifiers* *No* *varchar(2000)*
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#85 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAsrGuUOAkoX6L1s5bKp-MPKGcLlRxsPks5sN1vwgaJpZM4OXZ7M>
.
|
Hey, guys. Hold on here. The NLP workgroup, under Hua and with this task led by Noemie, spend a year reviewing research NLP systems and commercial NLP systems and also reviewing phenotyping from groups like eMERGE to see how they used NLP, taking the union of all modifiers, looking at their values, and voted to move forward with the planned table. The strategy was that a couple of fields could actually be agreed upon, and the rest was chaos. So those two fields that were universal would be coded (there was disagreement on a third “value” column). The rest would go into term_modifiers. We would use it for one year or so, and then decide what could be reasonably pulled out.
Adding three new columns based on an email after all the work that was put into this doesn’t make sense. If we are adding those three, then I certainly want my value column back (I was the proponent on that one, and let this email be my request). And of course several datetime columns were suggested just this week that could also go in. I remember several other columns that people requested, too.
If we think the Note_NLP table should not go forward, then let’s make that decision at the CDM workgroup level, but it already voted yes on this. Then these proposals would go back to the NLP workgroup, led by NLP experts, and decide what to do with the columns.
More specifically, I don’t see substantial value to a term_certainty column that has no agreement on values. It avoids a parse, yes, but then there is nothing but misunderstanding. It would only be useful to the site that filled it. That’s the kind of thing that the NLP workgroup decided to put into term_modifiers.
George
From: clairblacketer [mailto:notifications@github.com]
Sent: Friday, July 14, 2017 8:16 AM
To: OHDSI/CommonDataModel <CommonDataModel@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [OHDSI/CommonDataModel] NOTE NLP table (#85)
@schuemie<https://github.com/schuemie> Just to be sure I'm clear, would the new table look something like this:
NOTE_NLP
Field
Required
Type
Description
note_nlp_id
Yes
Big Integer
A unique identifier for each term extracted from a note.
note_id
Yes
integer
A foreign key to the Note table note the term was extracted from.
section_concept_id
No
integer
A foreign key to the predefined Concept in the Standardized Vocabularies representing the section of the extracted term.
Snippet
No
varchar(250)
A small window of text surrounding the term.
Offset
No
varchar(50)
Character offset of the extracted term in the input note.
lexical_variant
Yes
varchar(250)
Raw text extracted from the NLP tool.
note_nlp_concept_id
No
integer
A foreign key to the predefined Concept in the Standardized Vocabularies reflecting the normalized concept for the extracted term. Domain of the term is represented as part of the Concept table.
note_nlp_source_concept_id
no
integer
A foreign key to a Concept that refers to the code in the source vocabulary used by the NLP system
nlp_system
No
varchar(250)
Name and version of the NLP system that extracted the term.Useful for data provenance.
nlp_date
Yes
date
The date of the note processing.Useful for data provenance.
nlp_date_time
No
datetime
The date and time of the note processing. Useful for data provenance.
term_exists
No
varchar(1)
A summary modifier that signifies presence or absence of the term for a given patient. Useful for quick querying. *
term_temporal
No
varchar(50)
An optional time modifier associated with the extracted term. (for now “past” or “present” only). Standardize it later.
term_negated
No
varchar(50)
term_subject
No
varchar(50)
term_certainty
No
varchar(50)
other_term_modifiers
No
varchar(2000)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#85 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGRVXtox655OPvvdBiYR0OeVzGc-Cf9Rks5sN1vxgaJpZM4OXZ7M>.
|
@schuemie, @cgreich, @pbr6cornell, @hripcsa I did contact Hua, Noemie and Rimma about this issue with these additional columns and I will post their response once I hear back from them. |
As George mentioned, we have gone through extensive discussion with the modifier fields for the NLP table. We have voted and decided on the current version. So the plan is to go forward with the current table and we start implementing it. After we learn more, we can discuss and make changes in the future versions. thanks. |
So we will keep the table as proposed with the one modifier field. |
Sorry, didn't mean to upset the decision making process. I don't agree with the decision (I can make a good case at least for a bit-field |
The fact that we should go forward with what we have doesn't mean we should stop discussion. It will help us with the next iteration a year from now.
What is the semantics of term negated and how does it relate to presence and certainty? Presence is an aggregate measure that summarizes what most phenotype measures want to know. Is it actually present? Uncertainty feeds into presence. In many systems uncertainty effectively goes from -1 through 0 to +1. Usually using words instead of numbers, but going from definitely absent to uncertain to definitely present. Are you suggesting that that is better split between uncertainty that goes 0 to 1 and term negated that is effectively a minus sign? I assume we shouldn't have both options.
Real phenotypes I see use presence or it's equivalent all the time (presence includes other stuff like not rule out, not conditional, not another subject, not future, etc). I cannot think of any phenotypes that look for negation in clinical notes other than to skip it. I.e., you don't usually call for negation. Usually because negated in one note is not helpful as it depends on all the other notes. That's why it seemed there was time to sort this out.
George
On Jul 15, 2017, at 7:10 AM, Martijn Schuemie <notifications@github.com<mailto:notifications@github.com>> wrote:
Sorry, didn't mean to upset the decision making process.
I don't agree with the decision (I can make a good case at least for a bit-field term_negated), but I'm not a member of the NLP working group so I will respect its decision.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#85 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGRVXvu-Vvoirf0ArgkHPr9NT7eZULouks5sOJ3igaJpZM4OXZ7M>.
|
The context in which I was thinking of using the note_nlp table and negation flag is specifically the construction of features which would subsequently be used in (for example) prediction models. It seems to me that although the most important features would be things that are present, there would be considerable information in things that a doctor (or whoever wrote the note) took the trouble to negate, so I would additionally create negated features. The semantics of negation being a negative statement about something, so "... observed no rash..." or "... rule out pneumonia ..." would be examples of things (rash and pneumonia, resp.) that are negated. Whether those features are informative I don't know, I would have to see if the prediction algorithm selects them into the model. But my hypothesis is that they might be. But with the current structure of note_nlp, where negation isn't standardized, I cannot create these features. |
(Sorry I haven’t figured out how to quote from within email as opposed to on the forum.)
First, I do agree that machine learning is a little different because it takes advantage of the information whatever it is. As opposed to hand-written rules where you want it to mean what is says. So fields that are less useful for hand-written rules might still be useful for machine learning. So the use case is good.
But second, I am glad you gave those examples. We would absolutely NOT want to call "rule out pneumonia" a negated feature. Quite the opposite, it means possible pneumonia, because someone suspects it. Some systems may call that low certainty, and others literally call it “rule out”, and others call it conditional or something (actually not conditional, but some other concept; conditional is “if the patient gets pneumonia, go to the ED"). That’s very different from "the patient has no pneumonia," which IS a negation. My point in bringing it up is that it is not simple. We want something like negation, but it is actually lot more complicated than that. Unless you meant “RULED out pneumonia,” which could be negation.
Even “observed no rash” is not so simple. Sometimes when a clinician says something was NOT OBSERVED, they mean they did not look for it. That is, they are warning you that they have no information. As opposed to “the patient had no rash,” which would be negation. Depending on the context you might not want to mix those two in a negation flag. Again, it comes down to each system and each researcher interpreting “negation” differently, which undoes its usefulness until we can come to some agreement.
George
On Jul 15, 2017, at 7:53 PM, Martijn Schuemie <notifications@github.com<mailto:notifications@github.com>> wrote:
The context in which I was thinking of using the note_nlp table and negation flag is specifically the construction of features which would subsequently be used in (for example) prediction models. It seems to me that although the most important features would be things that are present, there would be considerable information in things that a doctor (or whoever wrote the note) took the trouble to negate, so I would additionally create negated features. The semantics of negation being a negative statement about something, so "... observed no rash..." or "... rule out pneumonia ..." would be examples of things (rash and pneumonia, resp.) that are negated. Whether those features are informative I don't know, I would have to see if the prediction algorithm selects them into the model. But my hypothesis is that they might be. But with the current structure of note_nlp, where negation isn't standardized, I cannot create these features.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#85 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGRVXnbbyC9n_b8sKMvyYRvWcq4kO4vKks5sOVDcgaJpZM4OXZ7M>.
|
I think the issues you mention are generic to NLP: we have no NLP that can figure out the full semantics of natural language, and we are likely to get the 'present' label wrong many times for the same reasons. Anyone using NLP output has to consider its noisy nature. Despite my poor attempt at defining negation (I guess I meant 'ruled out'), it is a common concept in NLP, for example as implemented in NegEx. And although the boundaries of what negation means are perhaps vague, it suggests quite different semantics than non-negated things, and that distinction may be informative for example for a machine learning algorithm. |
Friends: Usually it helps in these debates when you do concrete use cases. Then it is much easier to vote on adding the feature or not. |
All modifiers (including negation) will be stored in the term-modifiers field. Therefore it is possible to conduct machine learning studies that use these modifiers. You just need an extra step to parse needed modifiers from the field. The reason for storing all modifiers in one field is that we are not sure how these modifiers will be used. We want to see more actual use cases before we decide on next version. For example, one way to use NLP outputs is to export concepts together with modifiers to corresponding tables (e.g., lab tests and value modifiers go to MEASURES) and label their source as text document.
Keep in mind that another issue of modifiers is that we do not have a set of modifiers and allowable value sets that we all agree on. As people use different NLP systems, we expect that different modifiers will be stored in the term-modifiers field at this time. We are working on a recommendation of common modifiers and their allowable values (probably leverage existing work such as those at Wendy Chapman's lab).
The current Note_NLP table is just a start. We expect more changes will be suggested based on use cases. But for now, let's fix this version so that we can conduct some studies using it and identify potential improvements. Thx
Hua
Sent from my iPhone.
On Jul 16, 2017, at 13:06, Martijn Schuemie <notifications@github.com<mailto:notifications@github.com>> wrote:
I think the issues you mention are generic to NLP: we have no NLP that can figure out the full semantics of natural language, and we are likely to get the 'present' label wrong many times for the same reasons. Anyone using NLP output has to consider its noisy nature.
Despite my poor attempt at defining negation (I guess I meant 'ruled out'), it is a common concept in NLP, for example as implemented in NegEx<https://urldefense.proofpoint.com/v2/url?u=http-3A__blulab.chpc.utah.edu_content_contextnegex&d=DwMFaQ&c=6vgNTiRn9_pqCD9hKx9JgXN1VapJQ8JVoF8oWH1AgfQ&r=4EF1bHjt478LuK19NL5BPEFfM_E_GlSYoLU1Soe0WGM&m=ufF5G7hUfVqw7JogsSyORQ_joSlUDcd2vPULn22E2PQ&s=W7sKL1VLMwrF5uaCz_CF5lRXVY2FGKiPXZRXfO0jkV0&e=>. And although the boundaries of what negation means are perhaps vague, it suggests quite different semantics than non-negated things, and that distinction may be informative for example for a machine learning algorithm.
-
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_OHDSI_CommonDataModel_issues_85-23issuecomment-2D315626116&d=DwMFaQ&c=6vgNTiRn9_pqCD9hKx9JgXN1VapJQ8JVoF8oWH1AgfQ&r=4EF1bHjt478LuK19NL5BPEFfM_E_GlSYoLU1Soe0WGM&m=ufF5G7hUfVqw7JogsSyORQ_joSlUDcd2vPULn22E2PQ&s=vx1IHqmTv2BflXpt-BKsTk9wavVPjNEoQBY7JKbqvfo&e=>, or mute the thread<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ALfyJeZWhrvxjAdr4g3TXfI8Wyyuzll4ks5sOlC-5FgaJpZM4OXZ7M&d=DwMFaQ&c=6vgNTiRn9_pqCD9hKx9JgXN1VapJQ8JVoF8oWH1AgfQ&r=4EF1bHjt478LuK19NL5BPEFfM_E_GlSYoLU1Soe0WGM&m=ufF5G7hUfVqw7JogsSyORQ_joSlUDcd2vPULn22E2PQ&s=ctkbfIFyOruwmrQDRy-FmLz_7i8t7jpQWoqtiCdRNvs&e=>.
|
@cgreich, the specific use case I have is this: We want to use NLP features in predictive models. More specifically, right now we want to fit propensity models in a Dutch GP EHR system. We have an algorithm for identifying negations, and I want to implement a covariate builder in the FeatureExtraction package that creates separate features for negated and non-negated terms, because I hypothesize there may be value in that (better predictions). I can then plug the covariate builder into CohortMethod. Right now we would have to come up with a string we would put in the |
Two things.
1. Hua, what did we decide was the format for the term_modifier string? There is a modifier delimiter to separate modifiers (semicolon ; or something) and a delimiter after the modifier name (colon : or something). So “negation: negated; uncertainty: certain” would be the kind of syntax. Once you pick a modifier name like “negation” then there is no difference between the column and the term_modifier string. Both are equally specified. More problematically, neither has a semantics or a value syntax associated with it, so sharing is unlikely with or without the column. If you planned to define a syntax for the column, then we can also define it for the string.
2. You could also just use Term_exists as input to ML. It’s mostly negation. Rarely it could include a disease on a different person (family history), but the bulk will be negation of the patient’s conditions. You are using machine learning, so a precise definition is not needed. It will learn its value to the prediction.
George
On Jul 16, 2017, at 3:17 PM, Martijn Schuemie <notifications@github.com<mailto:notifications@github.com>> wrote:
@cgreich<https://github.com/cgreich>, the specific use case I have is this: We want to use NLP features in predictive models. More specifically, right now we want to fit propensity models in a Dutch GP EHR system. We have an algorithm for identifying negations<https://www.ncbi.nlm.nih.gov/pubmed/?term=schuemie+negation>, and I want to implement a covariate builder in the FeatureExtraction package<https://github.com/OHDSI/FeatureExtraction> that creates separate features for negated and non-negated terms, because I hypothesize there may be value in that (better predictions). I can then plug the covariate builder into CohortMethod<https://github.com/OHDSI/CohortMethod>.
Right now we would have to come up with a string we would put in the term_modifiers field, and FeatureExtraction would have to look for that string when creating features. But since that string is not standardized, another site will probably use a different string, so we can't create a covariate builder that automatically runs everywhere.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#85 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AGRVXgnpx2oZmVl8-gqJIrt0oB1-JZC7ks5sOmHegaJpZM4OXZ7M>.
|
Yes, we suggested a format like “negation: negated; uncertainty: certain”. You can query negation from this concatenated field. I agree the more problematic issue is about standardizing the modifiers and their values. As different sites may use different NLP systems, the outputs of modifiers could be different, which makes it challenging to run studies across sites. It will take us some time to make everyone to agree on a standard of modifiers and their values. |
Closing this issue as the NOTE_NLP table was added to CDM v5.2 as it appears at the top, though the discussion is still open |
for my reference - the document ontology referred to |
Addition of NOTE NLP table and new fields in NOTE table
Proposal
Relevant table: NOTE
NOTE table additions
New Fields
Field Changes
note_text type depends on RDBMS, not all the engines support CLOB, e.g. in MS SQL server this will be VARCHAR(MAX).
Outstanding issues
note_id - convert to BIGINT due to a large table size.
Changing identifier fields from INT to BIGINT should have to be a larger group discussion/decision as it would significantly affect all the existing implementations. We should consider whether to change all the identifier fields or a subset. CONDITION_OCCURRENCE, PROCEDURE_OCCURRENCE should be even larger tables.
NOTE_NLP table
This table will encode all output of NLP on clinical notes. Each row represents a single extracted term from a note.
Term_exists
Term_exists is defined as a flag that indicates if the patient actually has or had the condition. Any of the following modifiers would make Term_exists false:
A complete lack of modifiers would make Term_exists true.
For the modifiers that are there, they would have to have these values:
Term_temporal
Term_temporal is to indicate if a condition is “present” or just in the “past”.
The following would be past:
Term_modifiers
Term_modifiers will concatenate all modifiers for different types of entities (conditions, drugs, labs etc) into one string. Lab values will be saved as one of the modifiers. A list of allowable modifiers (e.g., signature for medications) and their possible values will be standardized later.
Mapping of clinical documents to Clinical Document Ontology (CDO) and standard terminology
HL7/LOINC CDO is a standard for consistent naming of documents to support a range of use cases: retrieval, organization, display, and exchange. It guides the creation of LOINC codes for clinical notes. CDO annotates each document with 5 dimensions:
Each combination of these 5 dimensions should roll up to a unique LOINC code. For example, Dentistry Hygienist Outpatient Progress note (LOINC code 34127-1) has the following dimensions:
Automation of mapping of clinical notes to a standard terminology based on the note title is possible when it is driven by ontology (aka CDO). Mapping to individual LOINC codes which may or may not exist for a particular note type cannot be fully automated. To support mapping of clinical notes to CDO in OMOP CDM, we propose the following approach:
1. Add all LOINC concepts representing 5 CDO dimensions to the Concept table. For example:
2. Represent CDO hierarchy in the Concept_Relationship table using the “Subsumes” – “Is a” relationship pair. For example:
3. Add LOINC document codes to the Concept table (e.g. Dentistry Hygienist Outpatient Progress note, LOINC code 34127-1). For example:
4. Represent dimensions of each document concept in Concept_Relationship table by its relationships to the respective concepts from CDO. Use the “Member Of” – “Has Member” (new) relationship pair. Using example from the Dentistry Hygienist Outpatient Progress note (LOINC code 34127-1):
Where concept codes represent the following concepts:
Most of the codes will not have all 5 dimensions. Therefore, they may be represented by 2-5 relationship pairs.
5. If LOINC does not have a code corresponding to a permutation of the 5 CDO encountered in the source, this code will be generated as OMOP vocabulary code. Its relationships to the CDO dimensions will be represented exactly as those of existing LOINC concepts (as described above). If/when a proper LOINC code for this permutation is released, the old code should be deprecated. Transition between the old and new codes should be represented by “Concept replaces” – “Concept replaced by” pairs.
6. Mapping from the source data will be performed to the 2-5 CDO dimensions.
Query below finds LOINC code for Dentistry Hygienist Outpatient Progress note (see example above) that has all 5 dimensions:
SELECT FROM Concept_Relationship WHERE relationship_id = ‘Has Member’ AND (concept_id_1 = 55443322132 OR concept_id_1 = 55443322175 OR concept_id_1 = 55443322166 OR concept_id_1 = 55443322107 OR concept_id_1 = 55443322146) GROUP BY concept_ID_2
If less than 5 dimensions are available, HAVING COUNT(n) clause should be added to get a unique record at the intersection of these dimensions. n is the number of dimensions available:
SELECT FROM Concept_Relationship WHERE relationship_id = ‘Has Member’ AND (concept_id_1 = 55443322132 OR concept_id_1 = 55443322175 OR concept_id_1 = 55443322146) GROUP BY concept_ID_2 HAVING COUNT(*) = 3
To identify appropriate dimension while mapping source documents, use the following concept classes:
The proposed approach will ensure that any combination of the 5 CDO dimensions encountered in the source data has a corresponding concept in the vocabulary. It will also support consistent approach to the OMOP CDM/Vocabulary conventions:
A similar mapping approach can be applied to labs.
Use Cases
Example 1 - Left ventricular ejection fraction
Left ventricular ejection fraction is an important indicator of heart health. It is measured during echocardiogram procedures but also during a range of various procedures. The value is frequently reported in clinical reports and has to be extracted using natural language processing.
Example 2 - eMERGE Phenoytpes
Existence of specific report or specific note section
Term/Concept mentioning in notes or specific sections
Related terms mentioning in the same line or adjacent lines
Numeric values with/without temporal constraints
The text was updated successfully, but these errors were encountered: