-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Auto populate -pks when not provided by user #1324
Conversation
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few suggestions and a question
Co-authored-by: Helen Cristina <helencristina@google.com>
Co-authored-by: Helen Cristina <helencristina@google.com>
Co-authored-by: Helen Cristina <helencristina@google.com>
Co-authored-by: Helen Cristina <helencristina@google.com>
Co-authored-by: Helen Cristina <helencristina@google.com>
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
data_validation/cli_tools.py
Outdated
help=( | ||
"Comma separated list of primary key columns 'col_a,col_b'" "" | ||
if is_custom_query | ||
else " (defaults to table primary key if available)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we make --primary-keys an optional argument, not be required just for custom query. Generate table partitions can use custom queries, in which case primary keys are required, but if it uses tables, primary-keys may not be required (could trigger a different error in main.py line 321 above).
The better place to make this check is in cli_tools.py. Basically at that point you have received all the arguments, they are OK syntactically. There is something not quite right with the "semantics" - that is - if it is custom query row or generate-table-partitions with custom-query, primary keys are required - in all other cases, primary keys are not required. So that check can be made there.
_check_custom_query_args
does something very similar, and is currently only used by generate-table-partitions. In the future, it can be extended for all validations if we decide to allow the user to specify either -tbls
or --source-query
and infer that it is validate row
or validate custom-query row
based on this specification. Today, anything other than generate-table-validations will fall to this line because if it is validate row
and validate column
only --tables-list
is allowed and if it is validate custom query row
and validate custom query column
, source-query
will be an attribute even if it is None
, tables_list
will not be an attribute.
Also Helen the extra ""
at the end of the first line is not a typo - it is actually "" if is_custom_query else " (defaults to primary key(s) of the table if available)"
. The linter for some reason split it in the wrong place. In light of the earlier comment, can you say something like "Primary keys when not specified will be inferred from the source or target table if available" - it is not specifically for custom query only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall - looks good - I would like another quick look at the changes made before approving.
The documentation needs to be updated - there are 10 references to primary keys and some of them are now optional.
Thank you for the excellent work.
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
primary_keys = config_manager.auto_list_primary_keys() | ||
if not primary_keys: | ||
raise ValueError( | ||
"No primary keys were provided and neither the source or target tables have primary keys. Please include --primary-keys argument" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried a custom query without any primary keys and got the following error ValueError: No primary keys were provided and neither the source or target tables have primary keys. Please include --primary-keys argument
. Is this sufficient or should we say Custom query validations must provide primary keys. No primary keys were provided and neither the source or target tables have primary keys. Please include --primary-keys argument
I am OK either way.
Sundar Mudupalli
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it as it is, it's not totally accurate but makes it clear you need to provide the primary keys. Thanks for the good review, both you and Helen provided valuable direction for me.
/gcbrun |
* feat: Auto populate -pks from Oracle constraint * feat: Auto populate -pks from Teradata constraint * feat: Auto populate -pks from PostgreSQL constraint * feat: Auto populate -pks from SQL Server, DB2 constraint * feat: Auto populate -pks no-op for BigQuery * feat: Auto populate -pks no-op for Snowflake and MySQL * tests: Fix DB2 test * Update third_party/ibis/ibis_biquery/api.py Co-authored-by: Helen Cristina <helencristina@google.com> * Update third_party/ibis/ibis_cloud_spanner/__init__.py Co-authored-by: Helen Cristina <helencristina@google.com> * Update third_party/ibis/ibis_redshift/__init__.py Co-authored-by: Helen Cristina <helencristina@google.com> * Update data_validation/cli_tools.py Co-authored-by: Helen Cristina <helencristina@google.com> * Update data_validation/config_manager.py Co-authored-by: Helen Cristina <helencristina@google.com> * chore: Refactor PostgreSQL catalog query * chore: PR comment changes * docs: Doc updates for --primay-keys changes --------- Co-authored-by: Helen Cristina <helencristina@google.com>
This PR merged in code and tests to auto populate the --primary-keys option when not provided by the user.
It add a new
list_primary_key_columns()
method to IbisBackend
class for each supported engine. For a few engines, e.g. BigQuery/Hive, the new method returns None. For most engines we run an appropriate dictionary query and retrieve the primary columns.In DVT we first run the
list_primary_key_columns()
method for the source client. If that cannot give us the columns then we run the method on the target client. If we cannot get a list of columns from either location then we throw an exception requesting the user provides the id columns in the usual way.