Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(fuzzer): Add ability in expression fuzzer to run multiple batches #11903

Closed

Conversation

bikramSingh91
Copy link
Contributor

Summary:
This change adds the ability to run 2 input batches for each
expression fuzzer iteration which will re-use the ExprSet to simulate
its typical usage in actual use-cases like in the ProjectFilter
Operator. The full execution loop of each iteration is modified to
accommodate this change, including input generation and modification,
result verification, re-running input using TRY, finding the minimal
breaking expression tree, and the facility to serialize the input and
repro using the ExpressionRunner utility.

Side note: this exposed a bug in Simplified path where the inputs are
not cleared if during eval of inputs an exception is thrown. The fix is
also a part of this change.

Differential Revision: D67368974

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 18, 2024
Copy link

netlify bot commented Dec 18, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit f67f6c5
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/676ef1e10d5b8a0008792c0e

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67368974

bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this pull request Dec 18, 2024
facebookincubator#11903)

Summary:

This change adds the ability to run 2 input batches for each
expression fuzzer iteration which will re-use the ExprSet to simulate
its typical usage in actual use-cases like in the ProjectFilter
Operator. The full execution loop of each iteration is modified to
accommodate this change, including input generation and modification,
result verification, re-running input using TRY, finding the minimal
breaking expression tree, and the facility to serialize the input and
repro using the ExpressionRunner utility.

Side note: this exposed a bug in Simplified path where the inputs are
not cleared if during eval of inputs an exception is thrown. The fix is
also a part of this change.

Differential Revision: D67368974
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67368974

Copy link
Contributor

@kagamiori kagamiori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bikramSingh91, thank you for adding this support to enable fuzzer coverage of dictionary memoization! I left a few comments.

velox/expression/fuzzer/ExpressionFuzzerVerifier.cpp Outdated Show resolved Hide resolved
Comment on lines +149 to +153
std::vector<VectorPtr> children = inputVector->children();
auto firstEncodedChild = children[indices[0]];
VELOX_CHECK_EQ(
firstEncodedChild->encoding(), VectorEncoding::Simple::DICTIONARY);
auto commonDictionaryIndices = firstEncodedChild->wrapInfo();
auto commonNulls = firstEncodedChild->nulls();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the assumption is that all child vectors at indices have the same top-level (and only top-level) indices and nulls, right? Could you add a comment telling this assumption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its mentioned in the method comment above at L140. Lemme know if it sounds unclear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think the comment at L140 can be revised a bit like this "Make all children of the input row vector at 'indices' wrapped in the same dictionary Buffers. These children are assumed to have already been wrapped in the same dictionary but through separate Buffers. Making them wrapped in the same Buffers is necessary to trigger peeling."

"Input vector is not a RowVector: {}",
inputVector->toString());
VELOX_CHECK_GT(inputVector->size(), 0, "Input vector must not be empty.");
if (inputSelectivityPaths.size() > i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed something, but why could we have unmatched number of selectivity paths and input paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 cases I am trying to handle, one where there are no input selectivity paths, the other where all inputs have one. However since I am allowing the user to input a comma separated list of paths, this check allows me to handle the case where the user only adds selectivity vector paths for some inputs.

The cmd line params don't offer the best user experience here, so I am definitely open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cmd line params don't offer the best user experience here, so I am definitely open to suggestions.

Could you share an example of the printout of the persisted repro paths and the cmd line arguments to ExpressionRunnerTest with multiple input paths and selectivity vector paths? I wonder whether we need to/should save all selectivity vector paths to a subdirectory when persisting the repro info, so that we only need to provide one path to the subdirectory here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kagamiori you can either specify the folder where all files where generated by the fuzzer like before:

--fuzzer_repro_path /tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/ --mode verify 

Or you can specify the input, etc explicitly like:

--input_paths /tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/input_vector_0,/tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/input_vector_1 --input_selectivity_vector_paths /tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/input_selectivity_vector_0,/tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/input_selectivity_vector_1 --sql_path /tmp/fuzzer_repro/velox_expressionVerifier_Gu1Seu/sql --mode verify 

velox/expression/tests/ExpressionRunner.cpp Outdated Show resolved Hide resolved
velox/expression/tests/ExpressionVerifier.cpp Show resolved Hide resolved
bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this pull request Dec 19, 2024
facebookincubator#11903)

Summary:

This change adds the ability to run 2 input batches for each
expression fuzzer iteration which will re-use the ExprSet to simulate
its typical usage in actual use-cases like in the ProjectFilter
Operator. The full execution loop of each iteration is modified to
accommodate this change, including input generation and modification,
result verification, re-running input using TRY, finding the minimal
breaking expression tree, and the facility to serialize the input and
repro using the ExpressionRunner utility.

Side note: this exposed a bug in Simplified path where the inputs are
not cleared if during eval of inputs an exception is thrown. The fix is
also a part of this change.

Differential Revision: D67368974
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67368974

bikramSingh91 pushed a commit to bikramSingh91/velox that referenced this pull request Dec 20, 2024
facebookincubator#11903)

Summary:

This change adds the ability to run 2 input batches for each
expression fuzzer iteration which will re-use the ExprSet to simulate
its typical usage in actual use-cases like in the ProjectFilter
Operator. The full execution loop of each iteration is modified to
accommodate this change, including input generation and modification,
result verification, re-running input using TRY, finding the minimal
breaking expression tree, and the facility to serialize the input and
repro using the ExpressionRunner utility.

Side note: this exposed a bug in both Simplified and common paths
where the inputs are not cleared if during eval of inputs an exception is
thrown. The fix is also a part of this change.

Differential Revision: D67368974
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67368974

Comment on lines +149 to +153
std::vector<VectorPtr> children = inputVector->children();
auto firstEncodedChild = children[indices[0]];
VELOX_CHECK_EQ(
firstEncodedChild->encoding(), VectorEncoding::Simple::DICTIONARY);
auto commonDictionaryIndices = firstEncodedChild->wrapInfo();
auto commonNulls = firstEncodedChild->nulls();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think the comment at L140 can be revised a bit like this "Make all children of the input row vector at 'indices' wrapped in the same dictionary Buffers. These children are assumed to have already been wrapped in the same dictionary but through separate Buffers. Making them wrapped in the same Buffers is necessary to trigger peeling."

"Input vector is not a RowVector: {}",
inputVector->toString());
VELOX_CHECK_GT(inputVector->size(), 0, "Input vector must not be empty.");
if (inputSelectivityPaths.size() > i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cmd line params don't offer the best user experience here, so I am definitely open to suggestions.

Could you share an example of the printout of the persisted repro paths and the cmd line arguments to ExpressionRunnerTest with multiple input paths and selectivity vector paths? I wonder whether we need to/should save all selectivity vector paths to a subdirectory when persisting the repro info, so that we only need to provide one path to the subdirectory here.

Comment on lines 95 to 97
"table 't'. Note: if selectivity vectors are specified for those inputs "
"using --input_selectivity_vector_paths, then those will not be used for "
"the query.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe simply say "Note: --input_selectivity_vector_paths is ignored in this mode".

Copy link
Contributor

@kagamiori kagamiori left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for expanding the coverage of expression fuzzer! (Don't forget to fix the two comments and it's good to merge.)

facebookincubator#11903)

Summary:

This change adds the ability to run 2 input batches for each
expression fuzzer iteration which will re-use the ExprSet to simulate
its typical usage in actual use-cases like in the ProjectFilter
Operator. The full execution loop of each iteration is modified to
accommodate this change, including input generation and modification,
result verification, re-running input using TRY, finding the minimal
breaking expression tree, and the facility to serialize the input and
repro using the ExpressionRunner utility.

Side note: this exposed a bug in both Simplified and common paths
where the inputs are not cleared if during eval of inputs an exception is
thrown. The fix is also a part of this change.

Reviewed By: kagamiori

Differential Revision: D67368974
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D67368974

@bikramSingh91
Copy link
Contributor Author

Failures are unrelated:
one in Build with GCC / Linux release with adapters is #11857

one in Spark Fuzzer is #11462

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 883b989.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants