Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize find_first_not_of/find_last_not_of member functions (multiple characters overloads) #5206

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Dec 25, 2024

Two remaining in find_meow_of family,
Together with #5102 should complete basic_string vectorization coverage.

Surprisingly not trivial change. The not flavor does not have early return for the inner (needle) loop. This severely impacts the paths that do have this inner loop.

⚙️ Product code changes

Added the implementation of find_meow_not_of for 8 and 16 bit characters.

No 32-bit and 64-bit characters vectorization. We happen to support them in find_first_of, because it exists as a free function callable with integers or pointers, but supporting them in find_first_not_of would take severely altering the specific AVX2 algorithm, that doesn't need to be altered otherwise.

The implementation is added into existing functions via a template parameter, like in #5102. For bitmap algorithms and small needle path it is only a matter of results negation or bit mask inversion, which is done:

The fallback nested loop has a separate compile-time branch without early return.

For SSE4.2 large needle branch. in addition to the negation in the intrinsic parameter, need also to switch to no-early-return inner loop, and combine the results. The _Test_whole_needle lambda has changed to have different loop based on template parameter. It was also changed to return position, and having inner lambda _Step instead of them both. The lambda change can potentially affect codegen in non-not control path, but I don't expect it to be too much of impact, if any at all.

🏁 Benchmark code changes

The fill strategy was altered to:

  • Avoid limits for the needle length to benchmark any needle length
  • Provide the different coverage for not member functions which makes more sense for them

So the iota was dropped. Still incremental values are used to fill needle. because it is boring to just memset std::fill it.

💹 Performance expectations

The not function are expected to perform almost the same, as their positive counterpart. But sure we can't have supersymmetry here.

The noticeable distinct thing is SSE4.2 path with different instructions. It has less control flow, but it has PCMPESTRM instead of PCMPESTRI, Their performance is overall the same, but there is some small difference on some CPUs, Decent Intels tend to like PCMPESTRI, decent AMDs tend to make no difference, older AMDs and power-saving Intels tend to like PCMPESTRM.

See the comparison on uops.info.

Apparently we're good on big scale, and fine tuning cannot be addressed anyway, so I didn't attempt to look for new thresholds for not functions.

⏱️ Benchmark results

i5 1235U

Benchmark Time Time
bm<AlgType::str_member_first_not, char>/2/3 5.98 ns 5.58 ns
bm<AlgType::str_member_first_not, char>/6/81 36.7 ns 23.8 ns
bm<AlgType::str_member_first_not, char>/7/4 12.5 ns 18.6 ns
bm<AlgType::str_member_first_not, char>/9/3 12.9 ns 16.6 ns
bm<AlgType::str_member_first_not, char>/22/5 16.2 ns 17.8 ns
bm<AlgType::str_member_first_not, char>/58/2 25.8 ns 17.6 ns
bm<AlgType::str_member_first_not, char>/75/85 63.8 ns 45.8 ns
bm<AlgType::str_member_first_not, char>/102/4 43.8 ns 19.5 ns
bm<AlgType::str_member_first_not, char>/200/46 82.3 ns 39.9 ns
bm<AlgType::str_member_first_not, char>/325/1 95.8 ns 38.2 ns
bm<AlgType::str_member_first_not, char>/400/50 131 ns 58.0 ns
bm<AlgType::str_member_first_not, char>/1011/11 262 ns 115 ns
bm<AlgType::str_member_first_not, char>/1280/46 338 ns 122 ns
bm<AlgType::str_member_first_not, char>/1502/23 380 ns 133 ns
bm<AlgType::str_member_first_not, char>/2203/54 563 ns 237 ns
bm<AlgType::str_member_first_not, char>/3056/7 740 ns 268 ns
bm<AlgType::str_member_first_not, wchar_t>/2/3 5.65 ns 6.15 ns
bm<AlgType::str_member_first_not, wchar_t>/6/81 38.3 ns 49.3 ns
bm<AlgType::str_member_first_not, wchar_t>/7/4 11.6 ns 14.5 ns
bm<AlgType::str_member_first_not, wchar_t>/9/3 11.3 ns 14.6 ns
bm<AlgType::str_member_first_not, wchar_t>/22/5 15.8 ns 15.3 ns
bm<AlgType::str_member_first_not, wchar_t>/58/2 29.8 ns 20.7 ns
bm<AlgType::str_member_first_not, wchar_t>/75/85 69.6 ns 52.8 ns
bm<AlgType::str_member_first_not, wchar_t>/102/4 52.7 ns 27.9 ns
bm<AlgType::str_member_first_not, wchar_t>/200/46 106 ns 50.4 ns
bm<AlgType::str_member_first_not, wchar_t>/325/1 132 ns 58.4 ns
bm<AlgType::str_member_first_not, wchar_t>/400/50 180 ns 65.7 ns
bm<AlgType::str_member_first_not, wchar_t>/1011/11 375 ns 139 ns
bm<AlgType::str_member_first_not, wchar_t>/1280/46 488 ns 155 ns
bm<AlgType::str_member_first_not, wchar_t>/1502/23 555 ns 186 ns
bm<AlgType::str_member_first_not, wchar_t>/2203/54 897 ns 266 ns
bm<AlgType::str_member_first_not, wchar_t>/3056/7 1120 ns 333 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2/3 15.6 ns 17.4 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/6/81 22.6 ns 29.4 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/7/4 23.1 ns 15.3 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/9/3 34.4 ns 15.6 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/22/5 51.7 ns 16.0 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/58/2 112 ns 20.6 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/75/85 177 ns 196 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/102/4 203 ns 27.7 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/200/46 480 ns 275 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/325/1 445 ns 67.9 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50 963 ns 623 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1011/11 2773 ns 443 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46 3156 ns 1671 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23 3413 ns 867 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54 5532 ns 3279 ns
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/3056/7 8053 ns 566 ns
bm<AlgType::str_member_last_not, char>/2/3 5.12 ns 5.42 ns
bm<AlgType::str_member_last_not, char>/6/81 32.4 ns 22.7 ns
bm<AlgType::str_member_last_not, char>/7/4 10.8 ns 17.3 ns
bm<AlgType::str_member_last_not, char>/9/3 11.1 ns 14.7 ns
bm<AlgType::str_member_last_not, char>/22/5 14.9 ns 15.6 ns
bm<AlgType::str_member_last_not, char>/58/2 27.6 ns 15.8 ns
bm<AlgType::str_member_last_not, char>/75/85 53.4 ns 40.5 ns
bm<AlgType::str_member_last_not, char>/102/4 45.5 ns 17.9 ns
bm<AlgType::str_member_last_not, char>/200/46 86.0 ns 36.5 ns
bm<AlgType::str_member_last_not, char>/325/1 103 ns 37.8 ns
bm<AlgType::str_member_last_not, char>/400/50 138 ns 57.5 ns
bm<AlgType::str_member_last_not, char>/1011/11 276 ns 116 ns
bm<AlgType::str_member_last_not, char>/1280/46 363 ns 138 ns
bm<AlgType::str_member_last_not, char>/1502/23 415 ns 144 ns
bm<AlgType::str_member_last_not, char>/2203/54 601 ns 210 ns
bm<AlgType::str_member_last_not, char>/3056/7 826 ns 263 ns
bm<AlgType::str_member_last_not, wchar_t>/2/3 5.82 ns 5.77 ns
bm<AlgType::str_member_last_not, wchar_t>/6/81 37.8 ns 43.8 ns
bm<AlgType::str_member_last_not, wchar_t>/7/4 9.71 ns 14.2 ns
bm<AlgType::str_member_last_not, wchar_t>/9/3 10.4 ns 14.2 ns
bm<AlgType::str_member_last_not, wchar_t>/22/5 15.7 ns 15.1 ns
bm<AlgType::str_member_last_not, wchar_t>/58/2 36.6 ns 19.0 ns
bm<AlgType::str_member_last_not, wchar_t>/75/85 78.3 ns 52.8 ns
bm<AlgType::str_member_last_not, wchar_t>/102/4 55.8 ns 26.5 ns
bm<AlgType::str_member_last_not, wchar_t>/200/46 114 ns 46.8 ns
bm<AlgType::str_member_last_not, wchar_t>/325/1 166 ns 42.5 ns
bm<AlgType::str_member_last_not, wchar_t>/400/50 187 ns 62.5 ns
bm<AlgType::str_member_last_not, wchar_t>/1011/11 381 ns 127 ns
bm<AlgType::str_member_last_not, wchar_t>/1280/46 539 ns 150 ns
bm<AlgType::str_member_last_not, wchar_t>/1502/23 563 ns 170 ns
bm<AlgType::str_member_last_not, wchar_t>/2203/54 847 ns 265 ns
bm<AlgType::str_member_last_not, wchar_t>/3056/7 1242 ns 375 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/2/3 13.2 ns 14.7 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/6/81 25.4 ns 29.9 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/7/4 21.4 ns 14.1 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/9/3 32.0 ns 14.4 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/22/5 49.6 ns 14.9 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/58/2 110 ns 19.5 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/75/85 186 ns 211 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/102/4 203 ns 26.9 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/200/46 489 ns 309 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/325/1 474 ns 65.4 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/400/50 1151 ns 707 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1011/11 2455 ns 620 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1280/46 3207 ns 1924 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/1502/23 4029 ns 1346 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/2203/54 5595 ns 3755 ns
bm<AlgType::str_member_last_not, wchar_t, L'\x03B1'>/3056/7 7376 ns 557 ns

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner December 25, 2024 12:59
@StephanTLavavej StephanTLavavej added the performance Must go faster label Jan 4, 2025
@StephanTLavavej StephanTLavavej self-assigned this Jan 4, 2025
@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej removed their assignment Jan 14, 2025
@StephanTLavavej

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej self-assigned this Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Status: Initial Review
Development

Successfully merging this pull request may close these issues.

2 participants