-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cohort def optimizations #142
Conversation
Added order by _date, _id to criteria queries to use the event ID as the tiebreaker. Fixes #139.
Optimized #primary_events and #qualified_events into single query, eliminating one temp table creation. Partitioning on person_id for event_ids allowing MPP architectures to leverage hashing on person_id. Fixes #141.
Please hang on before approving this I might have found an unexpected issue. |
Switched from qualified_events to included_events for end date selection.
Ok, this is ready for review now. Thank you. |
Nice job adding them everyone. :) |
@pbr6cornell , @anthonysena @fdefalco this has been open for 10 days without a review. Please review and either approve or notify me of concerns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chris, I do not see anything concerning with these changes and am inclined to approve. That said, it would be useful to do some testing on a few cohorts in our test environment to be sure that we're not missing anything here. Your thoughts?
Yes I ran internal tests using a version of Joel's cohort definition where the primary events were 'Any Visit' qualified with death events within 180 days after. you can see this definition here: From a review perspective, I think it's just a matter of reviewing the files and making a local build and generating a few cohorts of your own to see if it looks like everything is square. I've done this myself, but looking for a PR reviewer. |
OK - I'll pull this down and test it locally. Please let me know if there are any other cohorts that are good candidates for testing and I'll review those as well. |
@anthonysena , Thank you very much for taking this on. I'd say if you have any current cohort defs that you test with in hixbeta, you can copy them to a new definition (so to keep the old results) and then execute the copy under the new codebase and check that the numbers line up. That is what I did, but a second check would be good. |
* add initial set of refactored class for summary service * Added SQL for summary service functionality. See issue OHDSI/issues/108 * imported SQL file from Achilles Export2JSON, add'l refactoring of CDMResultsService impl * added stub method for retrieval of drug and procedure reports to CDMResultsService * Demographic Criteria Implementation. * Removed BETWEEN usage in criteria A BETWEEN X and Y becomes A >= X and A <= Y. BETWEEN is not consistent between all db platforms. * Incidence Rate Analysis Implementation Implementation of Incidence Rate web services, tasklets and repositories and IR Analysis Report Generation. Added new dependency: apache commons collections v4.1. * Remove BETWEEN expression from numeric and date range criteria input types. * Add null check for correlated criteria end date. Existing cohorts will have this null, so a null check is required. * enhancements for person profile service and cohort comparison service * sql syntax error * Fix to allow more than 1 person per cohort. * Flyway deployment scripts for postgreSQL and Oracle. * adding outcome model retrieval and updates to comparative cohort analysis * Negative controls implementation modified: src/test/java/org/ohdsi/webapi/test/feasibility/StudyInfoTest.java * ConceptSet optimization and comparison utilities * ConceptSet utilties * Moving GenerationStatus enum * Delete concept sets per OHDSI/Atlas#87 * Fixing ir calc flyway script for postgreSql * sql cleanup * Flyway fixes for sql server * Fixed oracle migration scripts. * Added @generated to NegativeControlRecord. JPA will manage this entity's auto-number. * Fix IR Strata TAR. IR Strata was not calculating TAR in years. * resolve interface changes with r packages * logging * resolve missing placeholder * made several sql translation changes * moved mappers and model objects to report package in prep for refactoring. * added DTO to report package and stubbed out service and runner classes. * Use same method (getTreemap) for all (non-temporal) treemaps * Add generic row mapper, text hack impls * new drugService call: http://localhost:8080/WebAPI/OPTUM-PDW/drugexp/personId/33034143174. working as is, but planning to add some fields and allow drug_concept_id to be specified in addition to or instead of person_id * Meant to push this to fork and then make pull request. Reverting. Revert "new drugService call: http://localhost:8080/WebAPI/OPTUM-PDW/drugexp/personId/33034143174. working as is, but planning to add some fields and allow drug_concept_id to be specified in addition to or instead of person_id" This reverts commit 19fee59. * Reverting revert. Will push this to fork instead of ohdsi Revert "Meant to push this to fork and then make pull request. Reverting." This reverts commit 14e1f79. * made the same mistake twice. Revert "Reverting revert. Will push this to fork instead of ohdsi" This reverts commit 5b699ad. * Fix OHDSI#137 Updated criteria group query to left join to @eventTable to capture events that matched 0 of the criteria in the group. * Cohort def optimizations (OHDSI#142) * Make first occurrence of event deterministic. Added order by _date, _id to criteria queries to use the event ID as the tiebreaker. Fixes OHDSI#139. * Cohort Definition Query Optimization Optimized #primary_events and #qualified_events into single query, eliminating one temp table creation. Partitioning on person_id for event_ids allowing MPP architectures to leverage hashing on person_id. Fixes OHDSI#141. * Added proper partitioning on person and event for gain counts. * Switched from qualified_events to included_events for end date selection. * Fixes OHDSI#144. Removed ExceptionHandler from generate cohort step; it was masking the failed job completion status by returning 'COMPLETED' after failure. * Use same method (getTreemap) for all (non-temporal) treemaps * Add generic row mapper, text hack impls * Reformat cdmresults sql scripts * Use AS keyword in all cdm report query aliases * Remove text hack and camel case column names in cdm report queries * Replace all treemap endpoints with a single generic implementation * Rename all treemap CDM report files to treemap.sql * Enquote aliases in CDM report queries to preserve case * Add GenericRowMapper ctor with ObjectMapper for performance * CDMResultsAnalysisRunner.getTreemap returns ArrayNode * Fix condition CDM results queries (by concept) * include conceptId parameter * rename sqlAgeAtFirstDiagnosis to sqlAgeAtFirstOccurrence * modify sqlConditionsByType to check vocabulary_id, not domain_id * cast xCalendarYear to INT in sqlPrevalenceByGenderAgeYear * Fix queries associated with generic drilldown reports * Data types include conditionera, drug, drugera, measurement, observation, procedure, visit * Queries include AgeAtFirstOccurrence, PrevalenceByGenderAgeYear, ByType, ByMonth * Include parameter @conceptId * Cast xCalendarYear to INT (where applicable) * Remove obsoleted code from CDMResultsService (activates observation reports) * Standardize names of "by type" query files for cdm reports to sqlByType.sql Applies to condition, drug, measurement, observation, procedure * Add cdm_database_schema to CDMResultsAnalysisRunner standard columns (for measurement treemap) * Fix typo in sqlVisitDurationByType.sql * added fixes to row mapping in /person and dashboard * implemented /achillesheel, /datadensity started implementation of /death refactored CohortAttributeMapper --> CDMAttributeMapper created new report data objects drop unused/superfluous CDMResultsAnalysisRunner * Fixes for demographic criteria. (OHDSI#146) * Added correct fields to select to match UNION of other criteria queries. * Change IR Analysis strata queries to properly partition by person_id. * Updated sql to match SqlRender SqlRender does not support 'TOP' instruction, changed * HTTP verb normalization POST = create PUT = update Fixes OHDSI#147. * fixed /datadensity * IR Enhancments Calculate distribution of Time At Risk and Time to Outcome. Provide new export to zip function to export definition and raw data from analysis results. * Cohort Definition Correlated Criteria Extends cohort definition criteria to allow specification of a criteria group whose members must result in a true result in order for the criteria to be satisfied. * Allow censoring criteria in cohort definitions. * Gracefully handle empty criteria groups in cohort definitions. Fixes OHDSI#155 * cleaned up and changes /achillesheel --> /heel * Add cdm_database_schema to CDMResultsAnalysisRunner standard columns as per mark-velez/* #e862d53 * changed endpoint signature (/heel -->/achillesheel), fixed summary yob and yob in person report * fixed null totalRecords element in Data Density report * added summary to dashboard report in CDM results service * implemented death report
This PR addresses two issues: #139 and #141.
The major enhancement is cohort queries should operate faster in a MPP environment since we are now partitioning on person_id for the generated event_ids, and it should show improvement in SMP platforms with clustered indexes on person_id.
The fix for #139 was to include the _id in the order to tiebreak to fix the non-deterministic behavior of MPP platforms.