-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(profiler): dynamically combine queries #3572
feat(profiler): dynamically combine queries #3572
Conversation
metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/ingestion/source/ge_data_profiler.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
metadata-ingestion/src/datahub/utilities/sqlalchemy_query_combiner.py
Outdated
Show resolved
Hide resolved
metadata-ingestion/src/datahub/utilities/sqlalchemy_query_combiner.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Harshal!
The GE profiler issues a ton of queries that each return a single number or a single row. This PR adds support for dynamically combining these queries together at runtime, which reduces the total number of queries issued.
Performance
Using the same testing setup as #3369 and #3510.
Small table - 9 columns, 500 rows
Medium table - 16 columns, 60k rows
These changes really shine when
turn_off_expensive_profiling_metrics
is enabled, where it yields a 2-3x speedup.Checklist