-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Auto split row concat/hash validations when many columns #1233
Conversation
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor comments, super cool implementations Neil, good job!
return ibis_query.schema() | ||
else: | ||
# NJ: I'm not happy about calling a private method but don't see how I can avoid it. | ||
return client._get_schema_using_query(query_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss more about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ibis does not expose a public method like it does for get_schema(). I didn't feel it was right for me to add a public method to Ibis objects, I know we already patch the private method but adding new methods seemed a step too far. Happy to be influenced in another direction though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, since you added a comment about this current implementation I'd be okay to move forward as it is so we can unblock the customer. It seems like something we can analyze further later on.
/gcbrun |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
return ibis_query.schema() | ||
else: | ||
# NJ: I'm not happy about calling a private method but don't see how I can avoid it. | ||
return client._get_schema_using_query(query_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, since you added a comment about this current implementation I'd be okay to move forward as it is so we can unblock the customer. It seems like something we can analyze further later on.
Closes #1216.
A few of our supported SQL engines have limits on the number of columns that can be concatenated into a string.
For example:
psycopg2.errors.TooManyArguments) cannot pass more than 100 arguments to a function
ORA-01489: result of string concatenation is too long
The concat function requires 2 to 254 arguments. (189) (SQLExecDirectW)
[Error 3556] Too many columns defined for this table
This PR introduces a new option for row validations (including custom-query) to limit the number of columns processed in a single query:
--max-concat-columns
. If the number of columns to be processed by a validation is greater than--max-concat-columns
then the validation is implicitly split into multiple validations.The
--max-concat-columns
option had an adaptive default. For engines with no known limit it is uncapped and prior behaviour is retained. For engines defined in the new constantconsts.MAX_CONCAT_COLUMNS_DEFAULTS
we impose a limit. Users can override this at runtime. If we find new limits for other engines then the user has the runtime option as a workaround and we can easily modify the constant.When a validation is split into multiple validations due to the number of columns we output an INFO level log message to inform the user.
I've added a new integration test table
pso_data_validator.dvt_many_cols
and tests for both standard and custom-query validations. Note that I have disabled the Hive tests due to time they take and that they often cause our test infrastructure to become unstable.Even without the Hive tests this PR adds 16 more integration tests, I wonder if we don't need custom-query tests for every engine. Just one or two of the ones with a lower limit perhaps?