-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(fuzzer): Add ability in expression fuzzer to run multiple batches #11903
feat(fuzzer): Add ability in expression fuzzer to run multiple batches #11903
Conversation
✅ Deploy Preview for meta-velox canceled.
|
This pull request was exported from Phabricator. Differential Revision: D67368974 |
facebookincubator#11903) Summary: This change adds the ability to run 2 input batches for each expression fuzzer iteration which will re-use the ExprSet to simulate its typical usage in actual use-cases like in the ProjectFilter Operator. The full execution loop of each iteration is modified to accommodate this change, including input generation and modification, result verification, re-running input using TRY, finding the minimal breaking expression tree, and the facility to serialize the input and repro using the ExpressionRunner utility. Side note: this exposed a bug in Simplified path where the inputs are not cleared if during eval of inputs an exception is thrown. The fix is also a part of this change. Differential Revision: D67368974
c0cee4e
to
c87a6ab
Compare
This pull request was exported from Phabricator. Differential Revision: D67368974 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bikramSingh91, thank you for adding this support to enable fuzzer coverage of dictionary memoization! I left a few comments.
std::vector<VectorPtr> children = inputVector->children(); | ||
auto firstEncodedChild = children[indices[0]]; | ||
VELOX_CHECK_EQ( | ||
firstEncodedChild->encoding(), VectorEncoding::Simple::DICTIONARY); | ||
auto commonDictionaryIndices = firstEncodedChild->wrapInfo(); | ||
auto commonNulls = firstEncodedChild->nulls(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the assumption is that all child vectors at indices
have the same top-level (and only top-level) indices and nulls, right? Could you add a comment telling this assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its mentioned in the method comment above at L140. Lemme know if it sounds unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think the comment at L140 can be revised a bit like this "Make all children of the input row vector at 'indices' wrapped in the same dictionary Buffers. These children are assumed to have already been wrapped in the same dictionary but through separate Buffers. Making them wrapped in the same Buffers is necessary to trigger peeling."
"Input vector is not a RowVector: {}", | ||
inputVector->toString()); | ||
VELOX_CHECK_GT(inputVector->size(), 0, "Input vector must not be empty."); | ||
if (inputSelectivityPaths.size() > i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed something, but why could we have unmatched number of selectivity paths and input paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are 2 cases I am trying to handle, one where there are no input selectivity paths, the other where all inputs have one. However since I am allowing the user to input a comma separated list of paths, this check allows me to handle the case where the user only adds selectivity vector paths for some inputs.
The cmd line params don't offer the best user experience here, so I am definitely open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cmd line params don't offer the best user experience here, so I am definitely open to suggestions.
Could you share an example of the printout of the persisted repro paths and the cmd line arguments to ExpressionRunnerTest with multiple input paths and selectivity vector paths? I wonder whether we need to/should save all selectivity vector paths to a subdirectory when persisting the repro info, so that we only need to provide one path to the subdirectory here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kagamiori you can either specify the folder where all files where generated by the fuzzer like before:
--fuzzer_repro_path /tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/ --mode verify
Or you can specify the input, etc explicitly like:
--input_paths /tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/input_vector_0,/tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/input_vector_1 --input_selectivity_vector_paths /tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/input_selectivity_vector_0,/tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/input_selectivity_vector_1 --sql_path /tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/sql --mode verify
c87a6ab
to
d4b17a7
Compare
facebookincubator#11903) Summary: This change adds the ability to run 2 input batches for each expression fuzzer iteration which will re-use the ExprSet to simulate its typical usage in actual use-cases like in the ProjectFilter Operator. The full execution loop of each iteration is modified to accommodate this change, including input generation and modification, result verification, re-running input using TRY, finding the minimal breaking expression tree, and the facility to serialize the input and repro using the ExpressionRunner utility. Side note: this exposed a bug in Simplified path where the inputs are not cleared if during eval of inputs an exception is thrown. The fix is also a part of this change. Differential Revision: D67368974
This pull request was exported from Phabricator. Differential Revision: D67368974 |
d4b17a7
to
705f1e2
Compare
facebookincubator#11903) Summary: This change adds the ability to run 2 input batches for each expression fuzzer iteration which will re-use the ExprSet to simulate its typical usage in actual use-cases like in the ProjectFilter Operator. The full execution loop of each iteration is modified to accommodate this change, including input generation and modification, result verification, re-running input using TRY, finding the minimal breaking expression tree, and the facility to serialize the input and repro using the ExpressionRunner utility. Side note: this exposed a bug in both Simplified and common paths where the inputs are not cleared if during eval of inputs an exception is thrown. The fix is also a part of this change. Differential Revision: D67368974
This pull request was exported from Phabricator. Differential Revision: D67368974 |
std::vector<VectorPtr> children = inputVector->children(); | ||
auto firstEncodedChild = children[indices[0]]; | ||
VELOX_CHECK_EQ( | ||
firstEncodedChild->encoding(), VectorEncoding::Simple::DICTIONARY); | ||
auto commonDictionaryIndices = firstEncodedChild->wrapInfo(); | ||
auto commonNulls = firstEncodedChild->nulls(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think the comment at L140 can be revised a bit like this "Make all children of the input row vector at 'indices' wrapped in the same dictionary Buffers. These children are assumed to have already been wrapped in the same dictionary but through separate Buffers. Making them wrapped in the same Buffers is necessary to trigger peeling."
"Input vector is not a RowVector: {}", | ||
inputVector->toString()); | ||
VELOX_CHECK_GT(inputVector->size(), 0, "Input vector must not be empty."); | ||
if (inputSelectivityPaths.size() > i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cmd line params don't offer the best user experience here, so I am definitely open to suggestions.
Could you share an example of the printout of the persisted repro paths and the cmd line arguments to ExpressionRunnerTest with multiple input paths and selectivity vector paths? I wonder whether we need to/should save all selectivity vector paths to a subdirectory when persisting the repro info, so that we only need to provide one path to the subdirectory here.
"table 't'. Note: if selectivity vectors are specified for those inputs " | ||
"using --input_selectivity_vector_paths, then those will not be used for " | ||
"the query."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe simply say "Note: --input_selectivity_vector_paths is ignored in this mode".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for expanding the coverage of expression fuzzer! (Don't forget to fix the two comments and it's good to merge.)
facebookincubator#11903) Summary: This change adds the ability to run 2 input batches for each expression fuzzer iteration which will re-use the ExprSet to simulate its typical usage in actual use-cases like in the ProjectFilter Operator. The full execution loop of each iteration is modified to accommodate this change, including input generation and modification, result verification, re-running input using TRY, finding the minimal breaking expression tree, and the facility to serialize the input and repro using the ExpressionRunner utility. Side note: this exposed a bug in both Simplified and common paths where the inputs are not cleared if during eval of inputs an exception is thrown. The fix is also a part of this change. Reviewed By: kagamiori Differential Revision: D67368974
705f1e2
to
f67f6c5
Compare
This pull request was exported from Phabricator. Differential Revision: D67368974 |
Failures are unrelated: one in Spark Fuzzer is #11462 |
This pull request has been merged in 883b989. |
Summary:
This change adds the ability to run 2 input batches for each
expression fuzzer iteration which will re-use the ExprSet to simulate
its typical usage in actual use-cases like in the ProjectFilter
Operator. The full execution loop of each iteration is modified to
accommodate this change, including input generation and modification,
result verification, re-running input using TRY, finding the minimal
breaking expression tree, and the facility to serialize the input and
repro using the ExpressionRunner utility.
Side note: this exposed a bug in Simplified path where the inputs are
not cleared if during eval of inputs an exception is thrown. The fix is
also a part of this change.
Differential Revision: D67368974