Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Auto populate -pks when not provided by user #1324

Merged
merged 22 commits into from
Nov 21, 2024

Conversation

nj1973
Copy link
Contributor

@nj1973 nj1973 commented Nov 11, 2024

This PR merged in code and tests to auto populate the --primary-keys option when not provided by the user.

It add a new list_primary_key_columns() method to Ibis Backend class for each supported engine. For a few engines, e.g. BigQuery/Hive, the new method returns None. For most engines we run an appropriate dictionary query and retrieve the primary columns.

In DVT we first run the list_primary_key_columns() method for the source client. If that cannot give us the columns then we run the method on the target client. If we cannot get a list of columns from either location then we throw an exception requesting the user provides the id columns in the usual way.

@nj1973 nj1973 requested a review from a team as a code owner November 11, 2024 12:28
@nj1973 nj1973 linked an issue Nov 11, 2024 that may be closed by this pull request
@nj1973 nj1973 marked this pull request as draft November 11, 2024 12:28
@nj1973 nj1973 marked this pull request as ready for review November 15, 2024 16:01
@helensilva14
Copy link
Collaborator

/gcbrun

Copy link
Collaborator

@helensilva14 helensilva14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few suggestions and a question

data_validation/cli_tools.py Outdated Show resolved Hide resolved
third_party/ibis/ibis_biquery/api.py Outdated Show resolved Hide resolved
third_party/ibis/ibis_cloud_spanner/__init__.py Outdated Show resolved Hide resolved
third_party/ibis/ibis_redshift/__init__.py Outdated Show resolved Hide resolved
third_party/ibis/ibis_postgres/client.py Outdated Show resolved Hide resolved
data_validation/config_manager.py Outdated Show resolved Hide resolved
data_validation/__main__.py Outdated Show resolved Hide resolved
nj1973 and others added 8 commits November 19, 2024 17:08
Co-authored-by: Helen Cristina <helencristina@google.com>
Co-authored-by: Helen Cristina <helencristina@google.com>
Co-authored-by: Helen Cristina <helencristina@google.com>
Co-authored-by: Helen Cristina <helencristina@google.com>
Co-authored-by: Helen Cristina <helencristina@google.com>
@nj1973
Copy link
Contributor Author

nj1973 commented Nov 19, 2024

/gcbrun

Copy link
Collaborator

@helensilva14 helensilva14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

help=(
"Comma separated list of primary key columns 'col_a,col_b'" ""
if is_custom_query
else " (defaults to table primary key if available)"
Copy link
Collaborator

@sundar-mudupalli-work sundar-mudupalli-work Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we make --primary-keys an optional argument, not be required just for custom query. Generate table partitions can use custom queries, in which case primary keys are required, but if it uses tables, primary-keys may not be required (could trigger a different error in main.py line 321 above).

The better place to make this check is in cli_tools.py. Basically at that point you have received all the arguments, they are OK syntactically. There is something not quite right with the "semantics" - that is - if it is custom query row or generate-table-partitions with custom-query, primary keys are required - in all other cases, primary keys are not required. So that check can be made there.

_check_custom_query_args does something very similar, and is currently only used by generate-table-partitions. In the future, it can be extended for all validations if we decide to allow the user to specify either -tbls or --source-query and infer that it is validate row or validate custom-query row based on this specification. Today, anything other than generate-table-validations will fall to this line because if it is validate row and validate column only --tables-list is allowed and if it is validate custom query row and validate custom query column, source-query will be an attribute even if it is None, tables_list will not be an attribute.

Also Helen the extra "" at the end of the first line is not a typo - it is actually "" if is_custom_query else " (defaults to primary key(s) of the table if available)". The linter for some reason split it in the wrong place. In light of the earlier comment, can you say something like "Primary keys when not specified will be inferred from the source or target table if available" - it is not specifically for custom query only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points, thanks.

Copy link
Collaborator

@sundar-mudupalli-work sundar-mudupalli-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall - looks good - I would like another quick look at the changes made before approving.

The documentation needs to be updated - there are 10 references to primary keys and some of them are now optional.

Thank you for the excellent work.

@nj1973
Copy link
Contributor Author

nj1973 commented Nov 20, 2024

/gcbrun

Copy link
Collaborator

@sundar-mudupalli-work sundar-mudupalli-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

primary_keys = config_manager.auto_list_primary_keys()
if not primary_keys:
raise ValueError(
"No primary keys were provided and neither the source or target tables have primary keys. Please include --primary-keys argument"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a custom query without any primary keys and got the following error ValueError: No primary keys were provided and neither the source or target tables have primary keys. Please include --primary-keys argument. Is this sufficient or should we say Custom query validations must provide primary keys. No primary keys were provided and neither the source or target tables have primary keys. Please include --primary-keys argument

I am OK either way.

Sundar Mudupalli

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left it as it is, it's not totally accurate but makes it clear you need to provide the primary keys. Thanks for the good review, both you and Helen provided valuable direction for me.

@nj1973
Copy link
Contributor Author

nj1973 commented Nov 21, 2024

/gcbrun

@nj1973 nj1973 merged commit 6119c18 into develop Nov 21, 2024
5 checks passed
@nj1973 nj1973 deleted the 1253-row-validation-primary-keys-auto branch November 21, 2024 09:18
helensilva14 added a commit that referenced this pull request Jan 20, 2025
* feat: Auto populate -pks from Oracle constraint

* feat: Auto populate -pks from Teradata constraint

* feat: Auto populate -pks from PostgreSQL constraint

* feat: Auto populate -pks from SQL Server, DB2 constraint

* feat: Auto populate -pks no-op for BigQuery

* feat: Auto populate -pks no-op for Snowflake and MySQL

* tests: Fix DB2 test

* Update third_party/ibis/ibis_biquery/api.py

Co-authored-by: Helen Cristina <helencristina@google.com>

* Update third_party/ibis/ibis_cloud_spanner/__init__.py

Co-authored-by: Helen Cristina <helencristina@google.com>

* Update third_party/ibis/ibis_redshift/__init__.py

Co-authored-by: Helen Cristina <helencristina@google.com>

* Update data_validation/cli_tools.py

Co-authored-by: Helen Cristina <helencristina@google.com>

* Update data_validation/config_manager.py

Co-authored-by: Helen Cristina <helencristina@google.com>

* chore: Refactor PostgreSQL catalog query

* chore: PR comment changes

* docs: Doc updates for --primay-keys changes

---------

Co-authored-by: Helen Cristina <helencristina@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Row validation primary keys option could be automated
3 participants