Combine Atlas Cohorts to a new cohort #398

gowthamrao · 2017-04-30T20:19:08Z

It would great to have a feature where cohorts maybe combined to create a new cohort.

Offer the following functionality

Set-Union - present in either
Set-Intersect - present in both
Set-difference - present in first, not present in second

Approach

Union - append both cohorts and de-duplicate
Intersect - append both cohorts only when subject_id is present in both cohorts, and restrict to dates that are present in both cohorts
Set-difference - append both cohorts, but only when subject_id is present in both cohort, delete the dates from first cohort that are present in second cohort

gowthamrao · 2017-09-05T21:06:09Z

@chrisknoll For combining cohorts in Atlas, can we do something like this in new Circe-be?

Take a fully qualified cohort and it becomes an input for a new cohort. i.e. we add a new criteriaQueries that references another cohort-definition or cohort itself.

criteriaQueries have the following fields: person_id, event_id, start_date, end_date, target_concept_id
cohort have the following fields: cohort_definition_id, subject_id, cohort_start_date, cohort_end_date

select subject_id as person_id
                            , row_number() over(order by subject_id, cohort_start_date, cohort_end_date) as event_id
                            , cohort_start_date as start_date
                            , cohort_end_date as end_date
                            , 0 as target_concept_id
from cohort

Can we use another cohort-definition or cohort as a criteria for initial event cohort?
What is the purpose of target_concept_id in criteriaQueries?

chrisknoll · 2017-09-06T14:49:57Z

Hi, @gowthamrao,
The challenge, in my mind, is that the cohort definition expressions were designed to be completely self contained. So, creating a cohort definition that references another cohort definition had the problem of identity (what is the ID of this other cohort definition, and will it be the same id when the cohort def is imported into another environment?) and circular refrences (Cohort A refs Cohort B refs Cohort C which Refs Cohort A).

But with the new CIRCE repo and the refactoring, we have the pieces of a cohort expression that we could arrange into a new form to support the functionality you are proposing....something like a 'compound cohort expression' which contains 1..n cohort expressions and a datastructure that describes how the cohorts should interact (merge, intersect, excude, etc). The SQL generation in the 'compound' case would be to materialize each individual cohort into a temp table, and then do the final merge/intersect to produce a final cohort result. Just brainstorming here, I don't have any actual plans for that sort of implementation, but it would be an interesting community contribution...

chrisknoll · 2017-09-06T14:52:52Z

What is the purpose of target_concept_id in criteriaQueries?

Part of the cohort criteria is the correlated query where you can say 'at least N occurrences of...'. You can also say 'at least N distinct occurrences of...'. This is where the 'target_concept_id' comes into play. For conditions, it would use the condition_concept_id field...for drug_exposure, it will be drug_concept_id, etc. It's simply used to indicate how to count distinct correlated events.

gowthamrao · 2017-09-08T11:02:09Z

will it be the same id when the cohort def is imported into another environment?

I see what you are saying. If we reference a cohort_definition_id, since cohort_definition_id is specific to a local dbms environment (it depends on the sequence function of dbms) - the cohort-constructor expression is no longer 'encapsulated'. i.e. if we migrate from one environment to another by copying the cohort constructor JSON specifications, the inclusion of a local cohort_definition_id in the cohort-constructor JSON creates an identity problem as the new dbms may assign a different cohort_definiton_id for the same cohort being referenced or may not have the referenced cohort.

We handle similar situations in two others places in Atlas (that I can think of):

concept-sets: We import pre-defined concept-sets during cohort-constructor and these concept-sets have a local-id. But, since we import the 'expression' of the concept-sets rather than the definition_id, we retain the encapsulation. The risk here is that if we change the concept-set expression in source -- it does not propagate into the cohort-constructor. Maintenance of the imported cohort-set expression is done within the cohort-constructor UI and is now separate and independent from the original local-source concept-set expression. Because we imported and maintain the concept-set expression in side the cohort-constructor, the cohort-constructor JSON portable and encapsulated.
incidence-rate. Similarly, I looked at the incidence rate -- and it does not look like incidence rate is designed to be portable. It seems to point to actual cohort_definition_id (not expression). Still we allow for JSON specification that allows for exporting the construction of incidence rate. Between environments there may be conflicts in cohort_definition_ids but all others components of the constructor is portable - the end user just has to confirm the cohort_definition_id's are accurate for the local environment. Similar to incidence-rate -- we have PLP package, FeatureExtraction package, Population level estimation -- all referencing local cohort_definition_id.

So, we have two options --
Option 1: use the 'encapsulated' approach where we export entire cohort-expression as 'nested compound cohort expressions' within another cohort-expressions, similar to concept-set example. This makes maintaining cohort definition very complex -- how do we modify the imported nested cohort-expressions. How would the UX look like - it will cause confusion. How do we build a UI to modify the cohort-constructor that was imported - similar to the concept-set expression. It seems very difficult and probably not worth it.
Option 2: use the non-encapsulated approach with the identity problem, with the solution relying on the end-user to confirm and maintain the integrity and accuracy of referenced local cohort_definition_id. I think this is the way to go, we already precedence for this -- with incidence rate, PLP, FeatureExtraction etc. The major advantage here is -- we can reference cohort that was built outside Atlas - e.g. if a cohort was built thru manual SQL code and inserted into the cohort-table. We could also allow for a) 'file import' that allows end-user to import a cohort from a csv (or other file) that has a certain structure, b) allow for creating a cohort by 'deleting' records of subject_ids that we want excluded, c) have an interface to write manual SQL code that is a cohort-constructor by itself but allow Atlas to assign cohort-definition-id using sequence. etc. This approach addresses issues like OHDSI/circe-be#4 where standardized method is just not possible. This approach would be different from #398 (comment) and will only address combining cohorts as described #398 (comment)

This makes the solution simple -- UI in Altas could be a new section (we need a new name for it other than cohorts - maybe new -> 'local cohorts' vs current -> 'global cohorts'?) -- and we could clone the reporting and explore portion of the cohorts-section. The new 'local' cohort will also be available for population-level estimation, plp, FeatureExtraction, incidence rate etc.

Thoughts?

chrisknoll · 2017-09-08T14:05:39Z

The risk here is that if we change the concept-set expression in source -- it does not propagate into the cohort-constructor. Maintenance of the imported cohort-set expression is done within the cohort-constructor UI and is now separate and independent from the original local-source concept-set expression. Because we imported and maintain the concept-set expression in side the cohort-constructor, the cohort-constructor JSON portable and encapsulated.

That's not a risk, that's by design. If you are working with a cohort definition and producing evidence for patient care, you do not want someone to change one of the dependent concept sets externally from the cohort definition and alter your study design. When putting together the design for the cohort builder, we had to balance convenience with 'gotchas', and the added complexity of maintaining the cohorts independently of the concept sets that might have been imported was the cost of keeping the cohort definition consistent.

I think this is the way to go, we already precedence for this -- with incidence rate, PLP, FeatureExtraction etc.

What you're picking up on was where we drew the line at encapsulation vs. convenience. We recognized that making a copy of the cohort expression inside the FE/IR/PLP analysis would add complexity that we didn't want to manage. So the code that generates the analysis SQL just takes a set of cohort definition IDs that should be used in the analysis, we don't care where they came from. This also de-couples the origin of the cohort records: they could have been generated by anything (CIRCE just being one way to generate cohorts).

A note on the Atlas UX:
The Atlas UI is a 'raw' way of modifying the cohort expression, in that it's just simply modifying the raw object model that goes into a CIRCE cohort expression. The UX could have been made to instead work with pre-defined expression fragments that get assembled into a concrete CIRCE expression at generation time. Or we could have loaded concept sets from a repository at generation time. Both of these cases leads to the problem of someone changing the elements outside of the cohort definition workflow and so we opted to go with a copy. But now that we have CIRCE-BE as a standalone library, anyone can make their own UI that constructs the cohort expression elements differently, and when they want to generate it, just use CIRCE BE to take the 'raw' expression form to give you your cohort records. Atlas could be extended such that when you create a cohort definition, you can choose the type of cohort expression you want to use in the definition: raw (simple), composite, compound, templated, etc.

So this is a lot of text, but just wanted to give you the background context, without pushing you one way or another. I think if you have an idea for enhancing the Atlas UI by introducing an alternative form of defining cohorts, if you have the time then go for it! I can tell you about the decisions that were made that might help avoid some of the pitfalls. I would also say that I'd like to avoid rewriting CIRCE when I think it's possible to just use CIRCE as building blocks to construct a more complex feature. If some additional refactoring to CIRCE is required to make some of those blocks more accessible, that's fine, but I'm talking about more invasive changes like the expression now has to be able to self-identify (expressions don't have any information about their cohort ID, they just know how to query the records from a CDM). That's why I think building a new construct for 'compound cohort expression' that leverages the parts of CIRCE that query the DB would work. It doesn't solve any of the issues of maintainability that you raise above, that would involve just rethinking the 'raw' vs. 'templated' approach for the UX.

gowthamrao · 2018-10-19T13:32:32Z

What you're picking up on was where we drew the line at encapsulation vs. convenience. We recognized that making a copy of the cohort expression inside the FE/IR/PLP analysis would add complexity that we didn't want to manage. So the code that generates the analysis SQL just takes a set of cohort definition IDs that should be used in the analysis, we don't care where they came from. This also de-couples the origin of the cohort records: they could have been generated by anything (CIRCE just being one way to generate cohorts).

@chrisknoll @anthonysena with the new hydra functionality - is the cohort definitions exported as JSON into R?

Circling back on this topic - i think it would be a cool functionality to create a new cohort based off of other cohorts in the same data source. In addition to creating such a cohort, making that cohort available for other analytic functions of Atlas like incident rate, plp, ple would be great too.

e.g. for large scale analytics, where we can build a library of 100s of condition cohorts, and 100s of procedure cohorts, and 100s of payer contract cohorts -- now we have the ability, using interset/union/difference, be able to analyze 100100100 cohorts without the need for building and maintaining 1,000,000 cohorts!

anthonysena · 2018-10-29T20:07:45Z

@gowthamrao - with the work done in #953, the cohort definitions that are part of the PLP/PLE specifications are exported as JSON and kept in the exported R Package.

anthonysena added cohort enhancement labels Jul 20, 2017

gowthamrao mentioned this issue Aug 5, 2017

WebAPI Release 2.2.0 Planning OHDSI/WebAPI#217

Closed

23 tasks

gowthamrao mentioned this issue Nov 8, 2018

Request - add capability to choose custom cohort as entry event #1092

Closed

anthonysena added this to the V3.0.0. "Atlas 3.0" milestone Jul 2, 2019

pbr6cornell assigned chrisknoll Jul 14, 2019

chrisknoll added this to Atlas/WebAPI Issue Triage Apr 18, 2023

github-project-automation bot moved this to Needs Review in Atlas/WebAPI Issue Triage Apr 18, 2023

rkboyce moved this from Needs Review to Under Review in Atlas/WebAPI Issue Triage Apr 18, 2023

rkboyce moved this from Under Review to Review Complete in Atlas/WebAPI Issue Triage Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine Atlas Cohorts to a new cohort #398

Combine Atlas Cohorts to a new cohort #398

gowthamrao commented Apr 30, 2017

gowthamrao commented Sep 5, 2017

chrisknoll commented Sep 6, 2017

chrisknoll commented Sep 6, 2017

gowthamrao commented Sep 8, 2017 •

edited

Loading

chrisknoll commented Sep 8, 2017

gowthamrao commented Oct 19, 2018

anthonysena commented Oct 29, 2018

Combine Atlas Cohorts to a new cohort #398

Combine Atlas Cohorts to a new cohort #398

Comments

gowthamrao commented Apr 30, 2017

gowthamrao commented Sep 5, 2017

chrisknoll commented Sep 6, 2017

chrisknoll commented Sep 6, 2017

gowthamrao commented Sep 8, 2017 • edited Loading

chrisknoll commented Sep 8, 2017

gowthamrao commented Oct 19, 2018

anthonysena commented Oct 29, 2018

gowthamrao commented Sep 8, 2017 •

edited

Loading