Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine Atlas Cohorts to a new cohort #398

Open
gowthamrao opened this issue Apr 30, 2017 · 7 comments
Open

Combine Atlas Cohorts to a new cohort #398

gowthamrao opened this issue Apr 30, 2017 · 7 comments

Comments

@gowthamrao
Copy link
Member

It would great to have a feature where cohorts maybe combined to create a new cohort.

Offer the following functionality

  • Set-Union - present in either
  • Set-Intersect - present in both
  • Set-difference - present in first, not present in second

Approach

  • Union - append both cohorts and de-duplicate
  • Intersect - append both cohorts only when subject_id is present in both cohorts, and restrict to dates that are present in both cohorts
  • Set-difference - append both cohorts, but only when subject_id is present in both cohort, delete the dates from first cohort that are present in second cohort
@gowthamrao
Copy link
Member Author

@chrisknoll For combining cohorts in Atlas, can we do something like this in new Circe-be?

Take a fully qualified cohort and it becomes an input for a new cohort. i.e. we add a new criteriaQueries that references another cohort-definition or cohort itself.

criteriaQueries have the following fields: person_id, event_id, start_date, end_date, target_concept_id
cohort have the following fields: cohort_definition_id, subject_id, cohort_start_date, cohort_end_date

select subject_id as person_id
                            , row_number() over(order by subject_id, cohort_start_date, cohort_end_date) as event_id
                            , cohort_start_date as start_date
                            , cohort_end_date as end_date
                            , 0 as target_concept_id
from cohort

Can we use another cohort-definition or cohort as a criteria for initial event cohort?
What is the purpose of target_concept_id in criteriaQueries?

@chrisknoll
Copy link
Collaborator

Hi, @gowthamrao,
The challenge, in my mind, is that the cohort definition expressions were designed to be completely self contained. So, creating a cohort definition that references another cohort definition had the problem of identity (what is the ID of this other cohort definition, and will it be the same id when the cohort def is imported into another environment?) and circular refrences (Cohort A refs Cohort B refs Cohort C which Refs Cohort A).

But with the new CIRCE repo and the refactoring, we have the pieces of a cohort expression that we could arrange into a new form to support the functionality you are proposing....something like a 'compound cohort expression' which contains 1..n cohort expressions and a datastructure that describes how the cohorts should interact (merge, intersect, excude, etc). The SQL generation in the 'compound' case would be to materialize each individual cohort into a temp table, and then do the final merge/intersect to produce a final cohort result. Just brainstorming here, I don't have any actual plans for that sort of implementation, but it would be an interesting community contribution...

@chrisknoll
Copy link
Collaborator

What is the purpose of target_concept_id in criteriaQueries?

Part of the cohort criteria is the correlated query where you can say 'at least N occurrences of...'. You can also say 'at least N distinct occurrences of...'. This is where the 'target_concept_id' comes into play. For conditions, it would use the condition_concept_id field...for drug_exposure, it will be drug_concept_id, etc. It's simply used to indicate how to count distinct correlated events.

@gowthamrao
Copy link
Member Author

gowthamrao commented Sep 8, 2017

will it be the same id when the cohort def is imported into another environment?

I see what you are saying. If we reference a cohort_definition_id, since cohort_definition_id is specific to a local dbms environment (it depends on the sequence function of dbms) - the cohort-constructor expression is no longer 'encapsulated'. i.e. if we migrate from one environment to another by copying the cohort constructor JSON specifications, the inclusion of a local cohort_definition_id in the cohort-constructor JSON creates an identity problem as the new dbms may assign a different cohort_definiton_id for the same cohort being referenced or may not have the referenced cohort.

We handle similar situations in two others places in Atlas (that I can think of):

  • concept-sets: We import pre-defined concept-sets during cohort-constructor and these concept-sets have a local-id. But, since we import the 'expression' of the concept-sets rather than the definition_id, we retain the encapsulation. The risk here is that if we change the concept-set expression in source -- it does not propagate into the cohort-constructor. Maintenance of the imported cohort-set expression is done within the cohort-constructor UI and is now separate and independent from the original local-source concept-set expression. Because we imported and maintain the concept-set expression in side the cohort-constructor, the cohort-constructor JSON portable and encapsulated.
  • incidence-rate. Similarly, I looked at the incidence rate -- and it does not look like incidence rate is designed to be portable. It seems to point to actual cohort_definition_id (not expression). Still we allow for JSON specification that allows for exporting the construction of incidence rate. Between environments there may be conflicts in cohort_definition_ids but all others components of the constructor is portable - the end user just has to confirm the cohort_definition_id's are accurate for the local environment. Similar to incidence-rate -- we have PLP package, FeatureExtraction package, Population level estimation -- all referencing local cohort_definition_id.

So, we have two options --
Option 1: use the 'encapsulated' approach where we export entire cohort-expression as 'nested compound cohort expressions' within another cohort-expressions, similar to concept-set example. This makes maintaining cohort definition very complex -- how do we modify the imported nested cohort-expressions. How would the UX look like - it will cause confusion. How do we build a UI to modify the cohort-constructor that was imported - similar to the concept-set expression. It seems very difficult and probably not worth it.
Option 2: use the non-encapsulated approach with the identity problem, with the solution relying on the end-user to confirm and maintain the integrity and accuracy of referenced local cohort_definition_id. I think this is the way to go, we already precedence for this -- with incidence rate, PLP, FeatureExtraction etc. The major advantage here is -- we can reference cohort that was built outside Atlas - e.g. if a cohort was built thru manual SQL code and inserted into the cohort-table. We could also allow for a) 'file import' that allows end-user to import a cohort from a csv (or other file) that has a certain structure, b) allow for creating a cohort by 'deleting' records of subject_ids that we want excluded, c) have an interface to write manual SQL code that is a cohort-constructor by itself but allow Atlas to assign cohort-definition-id using sequence. etc. This approach addresses issues like OHDSI/circe-be#4 where standardized method is just not possible. This approach would be different from #398 (comment) and will only address combining cohorts as described #398 (comment)

This makes the solution simple -- UI in Altas could be a new section (we need a new name for it other than cohorts - maybe new -> 'local cohorts' vs current -> 'global cohorts'?) -- and we could clone the reporting and explore portion of the cohorts-section. The new 'local' cohort will also be available for population-level estimation, plp, FeatureExtraction, incidence rate etc.

Thoughts?

@chrisknoll
Copy link
Collaborator

The risk here is that if we change the concept-set expression in source -- it does not propagate into the cohort-constructor. Maintenance of the imported cohort-set expression is done within the cohort-constructor UI and is now separate and independent from the original local-source concept-set expression. Because we imported and maintain the concept-set expression in side the cohort-constructor, the cohort-constructor JSON portable and encapsulated.

That's not a risk, that's by design. If you are working with a cohort definition and producing evidence for patient care, you do not want someone to change one of the dependent concept sets externally from the cohort definition and alter your study design. When putting together the design for the cohort builder, we had to balance convenience with 'gotchas', and the added complexity of maintaining the cohorts independently of the concept sets that might have been imported was the cost of keeping the cohort definition consistent.

I think this is the way to go, we already precedence for this -- with incidence rate, PLP, FeatureExtraction etc.

What you're picking up on was where we drew the line at encapsulation vs. convenience. We recognized that making a copy of the cohort expression inside the FE/IR/PLP analysis would add complexity that we didn't want to manage. So the code that generates the analysis SQL just takes a set of cohort definition IDs that should be used in the analysis, we don't care where they came from. This also de-couples the origin of the cohort records: they could have been generated by anything (CIRCE just being one way to generate cohorts).

A note on the Atlas UX:
The Atlas UI is a 'raw' way of modifying the cohort expression, in that it's just simply modifying the raw object model that goes into a CIRCE cohort expression. The UX could have been made to instead work with pre-defined expression fragments that get assembled into a concrete CIRCE expression at generation time. Or we could have loaded concept sets from a repository at generation time. Both of these cases leads to the problem of someone changing the elements outside of the cohort definition workflow and so we opted to go with a copy. But now that we have CIRCE-BE as a standalone library, anyone can make their own UI that constructs the cohort expression elements differently, and when they want to generate it, just use CIRCE BE to take the 'raw' expression form to give you your cohort records. Atlas could be extended such that when you create a cohort definition, you can choose the type of cohort expression you want to use in the definition: raw (simple), composite, compound, templated, etc.

So this is a lot of text, but just wanted to give you the background context, without pushing you one way or another. I think if you have an idea for enhancing the Atlas UI by introducing an alternative form of defining cohorts, if you have the time then go for it! I can tell you about the decisions that were made that might help avoid some of the pitfalls. I would also say that I'd like to avoid rewriting CIRCE when I think it's possible to just use CIRCE as building blocks to construct a more complex feature. If some additional refactoring to CIRCE is required to make some of those blocks more accessible, that's fine, but I'm talking about more invasive changes like the expression now has to be able to self-identify (expressions don't have any information about their cohort ID, they just know how to query the records from a CDM). That's why I think building a new construct for 'compound cohort expression' that leverages the parts of CIRCE that query the DB would work. It doesn't solve any of the issues of maintainability that you raise above, that would involve just rethinking the 'raw' vs. 'templated' approach for the UX.

@gowthamrao
Copy link
Member Author

What you're picking up on was where we drew the line at encapsulation vs. convenience. We recognized that making a copy of the cohort expression inside the FE/IR/PLP analysis would add complexity that we didn't want to manage. So the code that generates the analysis SQL just takes a set of cohort definition IDs that should be used in the analysis, we don't care where they came from. This also de-couples the origin of the cohort records: they could have been generated by anything (CIRCE just being one way to generate cohorts).

@chrisknoll @anthonysena with the new hydra functionality - is the cohort definitions exported as JSON into R?

Circling back on this topic - i think it would be a cool functionality to create a new cohort based off of other cohorts in the same data source. In addition to creating such a cohort, making that cohort available for other analytic functions of Atlas like incident rate, plp, ple would be great too.

e.g. for large scale analytics, where we can build a library of 100s of condition cohorts, and 100s of procedure cohorts, and 100s of payer contract cohorts -- now we have the ability, using interset/union/difference, be able to analyze 100100100 cohorts without the need for building and maintaining 1,000,000 cohorts!

@anthonysena
Copy link
Collaborator

@gowthamrao - with the work done in #953, the cohort definitions that are part of the PLP/PLE specifications are exported as JSON and kept in the exported R Package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Review Complete
Development

No branches or pull requests

3 participants