-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core-clp: Rewrite wildcard matching method and add systematic unit tests (fixes #427). #428
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed implementation. Overall I think it is now much easier to understand compared to the previous implementation.
// Handle `tame` or `wild` being empty | ||
if (wild.empty()) { | ||
return tame.empty(); | ||
} | ||
if (tame.empty()) { | ||
return "*" == wild; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not moving the empty check to the beginning of the function?
|
||
// Handle boundary conditions | ||
if (tame_end_it == tame_it) { | ||
return (wild_end_it == wild_it) || (wild_end_it == wild_it + 1 && '*' == *wild_it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is repeated in the next for loop. How about we make a helper, sth like is_wild_reaching_end_or_trailing_star
? (maybe we can discuss to come up with a better name)
// Handle boundary conditions | ||
if (tame_end_it == tame_it) { | ||
return (wild_end_it == wild_it) || (wild_end_it == wild_it + 1 && '*' == *wild_it); | ||
} else if (wild_end_it == wild_it) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think using a new if
instead of else if
would be more clear? Essentially these are two different cases to handle.
} else if (wild_end_it == wild_it) { | ||
if (tame_end_it == tame_it) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if tame_end_it == tame_it
and wild_end_it == wild_it
, we should already return in the above if right?
// Reset to bookmarks | ||
tame_it = tame_bookmark_it + 1; | ||
wild_it = wild_bookmark_it; | ||
if (false | ||
== advance_tame_to_next_match(tame_end_it, tame_it, tame_bookmark_it, wild_it)) | ||
{ | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong: if this branch is triggered, wild has already reached the end without consuming the entire tame. We should be handling the last group of tame after the last *
. In this case, we only need to match the last n characters (determined by wild_it - wild_bookmark_it
, and properly counting escape chars in between) in tame right? For example, if tame is "aaaaaaaa" and wild is "*a", we don't have to advance tame to match every single "a" but jump to match the last one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Are you suggesting, in this case, that we should iterate backwards from the end of tame
to see if it matches the last group in wild
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, iterating backwards is non-trivial because of escaped characters. If we see a ?
in wild
, we have to check the character before it to know if it's an escape character. But even if it is, we don't know if it's escaping the ?
or it's preceded by an escape itself. So it's easier to always iterate forwards.
* @param tame_it Returns `tame`'s updated iterator. | ||
* @param tame_bookmark_it Returns `tame`'s updated bookmark. | ||
* @param wild_it Returns `wild`'s updated iterator. | ||
* @return true on success, false if `tame` cannot match `wild`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @return true on success, false if `tame` cannot match `wild`. | |
* @return Whether `tame` can successfully match `wild`, |
* | ||
* NOTE: | ||
* - This method expects that `tame_it` < `tame_end_it` | ||
* - This method should be inlined for performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is enforced. afaik we are just suggesting the compiler to inline the method. But I'm also not sure whether we should use always_intline
attribute since we may need to compile this file using none-gnu compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't enforced, but at the same time I don't know if forcing it to be inlined is necessary. The reason I said it should be inlined is because in past performance tests, the inline
hint did make a difference. Nowadays though, it seems like gcc inlines it regardless of the hint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two high level questions:
- The generation of the wild and tame is deterministic. Can we pre-generate them and add a script used for generation? Or we don't really care about the time cost
- Without annotating any baseline runtime or run time comparison, how do we detect performance regression in the updated performance test?
// possibilities---to generate this variety. For each wildcard string, we also generate one or | ||
// more strings that can be matched by the wildcard string. | ||
|
||
// The template below is meant to test 1-3 groups of WildcardStringAlphabets separated by '*'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the generated wild
, *
is added only at beginning three times (since i
is set to iterate 3 times). Did we miss to add *
in between when creating template_wildcard_str
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
return; | ||
} | ||
|
||
size_t const tame_size_before_modification = tame.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t const tame_size_before_modification = tame.size(); | |
auto const tame_size_before_modification = tame.size(); |
|
||
size_t const tame_size_before_modification = tame.size(); | ||
auto alphabet = chosen_alphabets.front(); | ||
auto next_chosen_alphabets = chosen_alphabets.subspan(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto next_chosen_alphabets = chosen_alphabets.subspan(1); | |
auto const next_chosen_alphabets = chosen_alphabets.subspan(1); |
return; | ||
} | ||
|
||
size_t const wild_size_before_modification = wild.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_t const wild_size_before_modification = wild.size(); | |
autoconst wild_size_before_modification = wild.size(); |
string tame1{"a?c"}; | ||
string wild1{"*a\\??*"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use constexpr std::string_view
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were committed accidentally. I removed them now.
REQUIRE(wildcard_match_unsafe_case_sensitive(tameString, wildString) == true); | ||
} | ||
TEST_CASE("wildcard_match_unsafe", "[wildcard]") { | ||
string tame{"0!2#4%6&8(aBcDeFgHiJkLmNoPqRsTuVwXyZ"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use constexpr std::string_view
?
REQUIRE(wildcard_match_unsafe(tameString, wildString, isCaseSensitive) == true); | ||
SCENARIO("wildcard_match_unsafe_case_sensitive performance", "[wildcard performance]") { | ||
auto const tests_dir = std::filesystem::path{__FILE__}.parent_path(); | ||
auto log_file_path = tests_dir / "test_network_reader_src" / "random.log"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto log_file_path = tests_dir / "test_network_reader_src" / "random.log"; | |
auto const log_file_path = tests_dir / "test_network_reader_src" / "random.log"; |
GIVEN("All lower case tame and mixed lower and upper case wild") { | ||
tameString = "abcde", wildString = "A?c*"; | ||
REQUIRE(wildcard_match_unsafe(tameString, wildString, isCaseSensitive) == true); | ||
SCENARIO("wildcard_match_unsafe_case_sensitive performance", "[wildcard performance]") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading the doc, we're not using any of GIVEN
, THEN
, or WHEN
. Can we just make it a normal TEST_CASE
?
} | ||
} | ||
} | ||
auto end_timestamp = high_resolution_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto end_timestamp = high_resolution_clock::now(); | |
auto const end_timestamp = high_resolution_clock::now(); |
bool isCaseSensitive = false; | ||
allPassed &= wildcard_match_unsafe("mississippi", "*issip*PI", isCaseSensitive); | ||
REQUIRE(allPassed == true); | ||
auto begin_timestamp = high_resolution_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto begin_timestamp = high_resolution_clock::now(); | |
auto const begin_timestamp = high_resolution_clock::now(); |
Co-authored-by: Lin Zhihao <59785146+LinZhihao-723@users.noreply.github.com>
True. I can move the generation code to a Python script and pregenerate them. With the generation bug fix, it takes something like 15 minutes to finish, so pregeneration is definitely worthwhile.
The previous performance test didn't detect regressions since it was comparing against itself. I'm a little reluctant to set a hard performance target since that's machine specific. For now, I hope everyone who modifies (and reviews) this code will benchmark the implementation before and after to ensure they aren't introducing regressions. That said, I do think there's a way to speed up and simplify the implementation, but that's for another PR. |
For 1: sure. I'm ok to delay it to a new PR. |
…se complexity to reduce runtime.
All the cases end up being a 41GiB file, so I decided to just simplify the test case. I think it still covers everything and it's performant enough imo.
Done. |
Description
This PR:
clp::string_utils::wildcard_match_unsafe_case_sensitive
both to fixclp::string_utils::wildcard_match_unsafe_case_sensitive
fails to match with certain queries #427, to slightly simplify the logic, and to add more comments to explain the algorithm;Validation performed
Validated unit tests passed.