Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Auto split row concat/hash validations when many columns #1233

Merged
merged 24 commits into from
Aug 21, 2024

Conversation

nj1973
Copy link
Contributor

@nj1973 nj1973 commented Aug 19, 2024

Closes #1216.

A few of our supported SQL engines have limits on the number of columns that can be concatenated into a string.

For example:

  • PostgreSQL: psycopg2.errors.TooManyArguments) cannot pass more than 100 arguments to a function
  • Oracle: ORA-01489: result of string concatenation is too long
  • SQL Server: The concat function requires 2 to 254 arguments. (189) (SQLExecDirectW)
  • Teradata: [Error 3556] Too many columns defined for this table

This PR introduces a new option for row validations (including custom-query) to limit the number of columns processed in a single query: --max-concat-columns. If the number of columns to be processed by a validation is greater than --max-concat-columns then the validation is implicitly split into multiple validations.

The --max-concat-columns option had an adaptive default. For engines with no known limit it is uncapped and prior behaviour is retained. For engines defined in the new constant consts.MAX_CONCAT_COLUMNS_DEFAULTS we impose a limit. Users can override this at runtime. If we find new limits for other engines then the user has the runtime option as a workaround and we can easily modify the constant.

When a validation is split into multiple validations due to the number of columns we output an INFO level log message to inform the user.

I've added a new integration test table pso_data_validator.dvt_many_cols and tests for both standard and custom-query validations. Note that I have disabled the Hive tests due to time they take and that they often cause our test infrastructure to become unstable.

Even without the Hive tests this PR adds 16 more integration tests, I wonder if we don't need custom-query tests for every engine. Just one or two of the ones with a lower limit perhaps?

@nj1973 nj1973 requested a review from a team as a code owner August 19, 2024 14:05
@nj1973 nj1973 linked an issue Aug 19, 2024 that may be closed by this pull request
@nj1973 nj1973 marked this pull request as draft August 19, 2024 14:05
@nj1973 nj1973 marked this pull request as ready for review August 20, 2024 15:33
@nj1973
Copy link
Contributor Author

nj1973 commented Aug 20, 2024

/gcbrun

Copy link
Collaborator

@helensilva14 helensilva14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just minor comments, super cool implementations Neil, good job!

data_validation/cli_tools.py Outdated Show resolved Hide resolved
return ibis_query.schema()
else:
# NJ: I'm not happy about calling a private method but don't see how I can avoid it.
return client._get_schema_using_query(query_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss more about that!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ibis does not expose a public method like it does for get_schema(). I didn't feel it was right for me to add a public method to Ibis objects, I know we already patch the private method but adding new methods seemed a step too far. Happy to be influenced in another direction though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, since you added a comment about this current implementation I'd be okay to move forward as it is so we can unblock the customer. It seems like something we can analyze further later on.

tests/resources/sqlserver_test_tables.sql Show resolved Hide resolved
tests/system/data_sources/test_mysql.py Show resolved Hide resolved
tests/system/data_sources/test_hive.py Show resolved Hide resolved
@nj1973
Copy link
Contributor Author

nj1973 commented Aug 21, 2024

/gcbrun

@nj1973 nj1973 requested a review from helensilva14 August 21, 2024 13:38
Copy link
Collaborator

@helensilva14 helensilva14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

return ibis_query.schema()
else:
# NJ: I'm not happy about calling a private method but don't see how I can avoid it.
return client._get_schema_using_query(query_str)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, since you added a comment about this current implementation I'd be okay to move forward as it is so we can unblock the customer. It seems like something we can analyze further later on.

@nj1973 nj1973 merged commit ae9b72d into develop Aug 21, 2024
5 checks passed
@nj1973 nj1973 deleted the 1216-tables-with-many-columns branch August 21, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exception in PostgreSQL row validation for table with many columns
2 participants