-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401
[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401
Conversation
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Feel free to get in touch with me. The algorithm is used in Node.js, Bun. It is in the PHP interpreter. And many other important systems. Note that .NET is currently considering adopting this approach: dotnet/runtime#104199 based on our C# implementation https://github.com/simdutf/SimdUnicode |
Thanks a lot @lemire ! I'm a big fan of your work! I need to polish this pull request a bit, then if I have any questions I'll contact you by PM. Based on the amount of times this function will be called in Mojo, I see the need to push for an implementation that will be as fast as possible. |
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@JoeLoser |
Yeah, someone just filed an issue internally about this as it failed in today's nightly release. Taking a look now. |
There is a fast UTF-8 validation algorithm that is used in multiple systems and it has been ported in several programming languages. As far as I know, it is the fastest algorithm... plus minus some instruction-level optimizations. The key insight is that you have three table lookups followed by two bitwise AND. These three lookups do almost all the work. It is very difficult to beat and it works well on various instruction sets. You will find an independent implementation in the PHP interpreter (in C): https://github.com/php/php-src/blob/9147687b6d5c4491e1e19cb0d80ffabc479593ef/ext/mbstring/mbstring.c#L5266 The reference implementation is found in simdutf: https://github.com/simdutf/simdutf/blob/master/src/generic/utf8_validation/utf8_lookup4_algorithm.h (this is an ISA-agnostic implementation). This is maybe what is implemented here, but I do not recognize it. The fast lookup algorithm should look as follows (ARM NEON version in C#): Vector128<byte> prev1 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 1));
Vector128<byte> byte_1_high = AdvSimd.Arm64.VectorTableLookup(shuf1, AdvSimd.ShiftRightLogical(prev1.AsUInt16(), 4).AsByte() & v0f);
Vector128<byte> byte_1_low = AdvSimd.Arm64.VectorTableLookup(shuf2, (prev1 & v0f));
Vector128<byte> byte_2_high = AdvSimd.Arm64.VectorTableLookup(shuf3, AdvSimd.ShiftRightLogical(currentBlock.AsUInt16(), 4).AsByte() & v0f);
Vector128<byte> sc = AdvSimd.And(AdvSimd.And(byte_1_high, byte_1_low), byte_2_high);
Vector128<byte> prev2 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 2));
Vector128<byte> prev3 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 3));
prevInputBlock = currentBlock;
Vector128<byte> isThirdByte = AdvSimd.SubtractSaturate(prev2, thirdByte);
Vector128<byte> isFourthByte = AdvSimd.SubtractSaturate(prev3, fourthByte);
Vector128<byte> must23 = AdvSimd.Or(isThirdByte, isFourthByte);
Vector128<byte> must23As80 = AdvSimd.And(must23, v80);
Vector128<byte> error = AdvSimd.Xor(must23As80, sc);
if (error != Vector128<byte>.Zero)
{
// report error
} The key ingredient at these constants: Vector128<byte> shuf1 = Vector128.Create(TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
TWO_CONTS, TWO_CONTS, TWO_CONTS, TWO_CONTS,
TOO_SHORT | OVERLONG_2,
TOO_SHORT,
TOO_SHORT | OVERLONG_3 | SURROGATE,
TOO_SHORT | TOO_LARGE | TOO_LARGE_1000 | OVERLONG_4);
Vector128<byte> shuf2 = Vector128.Create(CARRY | OVERLONG_3 | OVERLONG_2 | OVERLONG_4,
CARRY | OVERLONG_2,
CARRY,
CARRY,
CARRY | TOO_LARGE,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000 | SURROGATE,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000);
Vector128<byte> shuf3 = Vector128.Create(TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE_1000 | OVERLONG_4,
TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE,
TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT);
based on the following constants: const byte TOO_SHORT = 1 << 0;
const byte TOO_LONG = 1 << 1;
const byte OVERLONG_3 = 1 << 2;
const byte SURROGATE = 1 << 4;
const byte OVERLONG_2 = 1 << 5;
const byte TWO_CONTS = 1 << 7;
const byte TOO_LARGE = 1 << 3;
const byte TOO_LARGE_1000 = 1 << 6;
const byte OVERLONG_4 = 1 << 6;
const byte CARRY = TOO_SHORT | TOO_LONG | TWO_CONTS; References
|
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@lemire thank you for the insights. First I looked at the code you provided in C# and looked for similarities of the implementation I took from lemire/fastvalidate-utf-8 . While the main ingredients were present (three table lookups followed by two bitwise AND) but they were a lot more things going on and the values in the tables were not the same. Rather than trying to see if the two implementations were equivalent, which is non-trivial since operations don't have a defined order, I decided to implement a new version in Mojo using the C# implementation you provided, with some help from the source file. The results are, well, here, and clear. On my system, Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11):
So many thanks for providing this improved version! You can look at the diff, it's really close to the C# code. We don't have the fast path for ASCII yet but that can come in another pull request. |
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@gabrieldemarmiesse you may also want to update pr main description and title according to last changes. Speedup is quite impressive, so it is worth to mention IMHO. Btw, great work! |
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Awesome work @gabrieldemarmiesse thank you for tackling this ! As I mentioned in Discord, you might wanna look at There might also be a problem IMO with the benchmarks I've seen and it's that most have many non ASCII characters and or are very fabricated (strings in real life aren't randomly distributed), it might be worth it to have separate benchmarks for "typical" english, spanish, mandarin, and hindi which are realistically the most written on the internet, using the lorem ipsum generator. |
@martinvuyk Thanks for chiming in. About trying newer algorithms, re-implementing a new algorithm in a new programming language, profiling it and optimizing it can be a lot of work (at least a day), thus it's not really fitting to do this in the scope of this pull request. We can focus on this pull request, merge it if the maintainers agree that the improvements are clear. Afterwards, me or someone else from the community can investigate newer algorithms and open another PR with the benchmark results. Let's try to tackle one thing at a time. About benchmarking quality, I totally agree with you. We would benefit from multiple benchmarks on multiple corpus. For example, I didn't add the "ASCII fast path" (skip iterations when we recognize consecutive ascii blocks) to this PR, which should be pretty trivial to add, and can bring huge improvements on benchmarks with lots of ASCII in the corpus. When I have more time, I'll try to do a more complete benchmark with more corpus. |
Regarding benchmarks, I offer the following repository which contains a wide range of different files, including lipsum files, but also files containing a lot of ASCII mixed with richer Unicode characters... See https://github.com/lemire/unicode_lipsum I also recommend the twitter JSON file: https://github.com/simdutf/SimdUnicode/blob/main/benchmark/data/twitter.json It is an interesting mix of ASCII (JSON) and internal characters.
Note that the reference you offer predates the reference of the lookup algorithm: Validating UTF-8 In Less Than One Instruction Per Byte The lookup algorithm 'evolved' UTF-8 over years and, there has been four distinct lookup algorithms, and these algorithms followed various other approaches. I believe that the latest one implemented here (which is lookup... or lookup4 as we used to call it), is likely the fastest available SIMD validation algorithm. To be fair, the implementation matters and tuning can help. |
yeah those look good to me 👍 . I just read the code for the random string generation hadn't seen the twitter.json use.
Oh I thought they were only "revisions" on the same algo for different ISAs using less instructions/mem loads.
Totally understandable. Now that I've actually read Lemire's paper it might be a better algo than what I linked, since it actually registers the error type (awesome bit recycling for the continuation byte btw.) and does seem to only require 3 table lookups, whereas the range algo of the repo I linked builds an index table and then adjusts some idxs based on the first and second byte (2 to 5 extra instructions, branchless) doing around 6 lookups.
Honestly it might be a bit overkill, this is awesome work as it is and what I coded was just a "good enough" approach until someone took the torch and built something better. It might be better to leave it as a pending addition to the stdlib benchmarks when someone wants to try out a new algorithm. I just mentioned what I thought of the benchmarks as something that we could improve in the future. |
Building good benchmarks is like building good tests. It is usually a net positive long term. |
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
!sync |
✅🟣 This contribution has been merged 🟣✅ Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours. We use Copybara to merge external contributions, click here to learn more. |
…se4 (#47462) [External] [stdlib] Make utf8 validation ~10-13x faster on neon and sse4 ## Description of the changes In the future `_is_valid_utf8()` will be used massively. As such, we need every performance improvement possible, as long as the complexity cost is reasonable. This PR changes the implementation fo the function `_is_valid_utf8()` without changing the signature. It's a drop-in replacement. This implementation is describled in the paper [Validating UTF-8 in less than one instruction per byte](https://arxiv.org/abs/2010.03090) by John Keiser and Daniel Lemire, which is pretty close to the state of the art on the subject. A reference C++ implementation can be found in the repository [lemire/fastvalidate-utf-8](https://github.com/lemire/fastvalidate-utf-8), precisely in [this file](https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h). Notice how Mojo makes this more generic and readable, as well as portable. Note that the only improvement that I'm aware of to this algorithm is the [is_utf8 library of simdutf](https://github.com/simdutf/is_utf8) which is based on the algorithm made in this PR. It is significantly harder to implement as it's a production grade library full of macros and other things that I have a harder time reading than fastvalidate-utf-8. Two good blog posts have been made one the subject: * [Validating UTF-8 strings using as little as 0.7 cycles per byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) * [Validating UTF-8 bytes using only 0.45 cycles per byte, AVX edition](https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/) ## Types of utf-8 errors While I'm not sure that the current implementation can detect all classes of errors, this algorithms checks the following rules: a) **5+ Byte**. The leading byte must have fewer than 5 header bits. b) **Too Short**. The leading byte must be followed by N-1 continuation bytes, where N is the UTF-8 character length. c) **Too Long**. The leading byte must not be a continuation byte. d) **Overlong**. The decoded character must be above U+7F for two-byte characters, U+7FF for three-byte characters, and U+FFFF for four-byte characters. e) **Too Large**. The decoded character must be less than or equal to U+10FFFF. f) **Surrogate**. The decoded character must be not be in U+D800...DFFF. ## Why is this implementation so much faster than the current one? ### The current implementation The current implementation was using simd, but not in an optimal way: 1) It was loading N bytes in an simd vector 2) Was checking with simd instructions if it was ascii (a fast path) and skip the chunk if that's the case. If some bytes are ascii, do a plain for loop on each byte to skip them. 3) If the chunk was not full ascii, look at the first byte and get the number of bytes in the character that is at the start of the simd vector 4) Increment the counter with the number of bytes in the character (2, 3 or 4) and go back to step 1. Since the index can increment by 2,3 or 4 before loading the next chunk of bytes, the following problems were presents: 1) We do a lot of iterations, one per character if it's non-ascii and since each iteration involves a `LOAD`, that's expensive. 2) Doing a for loop on each byte to check ascii require a lot of instructions to be executed per byte and not good branching predictability (if not ascii, continue the loop, it's a hard to predict branch). 3) Since the data isn't aligned anymore (`idx` is not a multiple of the simd size), loading the chunk into an simd vector is quite slow. 4) Overall, many if-statements are present in the loop. ### The new implementation The new algorithm improves on this by reading the data chunk by chunk, keeping into simd vectors information about the previous chunk. You can look at the `_is_valid_utf8` function which has the main loop. Since it's chunk by chunk, the index jump is always the simd size. It has the following properties which usually make the hardware happy: 1) One single LOAD per iteration. 2) LOAD each byte once. 3) Once a chunk is loaded into SIMD, no branching is done. 4) Once a chunk is loaded into simd, no more load from memory is done, we only work with simd vectors in registers. 5) Many simd operations don't depend on each other, which gives flexibility to the cpu with the out-of-order execution. 6) The jump size if constant (simd size) which makes it very easy to know which data will be accessed next and the data to load in simd is always aligned correctly. Basically, you load 32 bytes of data into simd, do a bunch of computation in registers without branching (~60 assembly instructions, look at the `_check_utf8_bytes` function, those instructions don't all depend on each other), and only take the next 32 bytes when you're done with the current chunk. Keep a bunch of vectors here and there to be able to validate characters which spans across two simd vectors. Such type of computation is the one cpus are optimized for. Since we only do simd operations, the number of assembly instructions is the same if the simd vector is of size 8 or size 64, meaning that if the cpu has biggest simd sizes available (AVX512) then a considerable speedup can be achieved. ## Benchmark code We don't make use of the `benchmark` module, because it is not available when an external contributor recompiles the stdlib. ```mojo import sys from testing import assert_true, assert_false from utils.string_slice import _is_valid_utf8 import time @no_inline fn keep(x: Bool) -> Int: return not x fn get_big_string() -> String: var string = str( "안녕하세요,세상 hello mojo! 🔥🔥hopefully this string is complicated enough :p" " éç__çè" ) # The string is 100 bytes long. return string * 100_000 # 10MB def main(): print("Has neon:", sys.has_neon()) print("Has sse4:", sys.has_sse4()) print("SIMD size:", sys.simdbytewidth()) var big_string = get_big_string() @parameter fn utf8_simd_validation_benchmark() raises: # we want to validate ~1gb of data for _ in range(100): var result = _is_valid_utf8(big_string.unsafe_ptr(), len(big_string)) assert_true(result) # warmup for _ in range(3): utf8_simd_validation_benchmark() iterations = 10 t1 = time.now() for _ in range(iterations): utf8_simd_validation_benchmark() t2 = time.now() _ = big_string average_ns = (t2 - t1) / iterations average_s = average_ns / 1_000_000_000 print("Validate 1GB of UTF-8 data in", average_s, "s") print(1.0 / average_s, "GB/s") ``` Put it in a file called `bench.mojo`. Bench the nightly version with ``` MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true mojo build bench.mojo && ./bench ``` Bench this branch by doing a checkout on it and then: ``` MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true MODULAR_MOJO_NIGHTLY_IMPORT_PATH=./build mojo build bench.mojo && ./bench ``` ## Benchmark results: * On AMD Ryzen 9 7945HX with Radeon Graphics we get a **x10.8** speedup. * On Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11) I get **x7.3** speedup I don't have the numbers for less capable CPUs, notably the ones that don't have the instruction to do the `dynamic_shuffle()` in one single instruction. If you have other cpus, please run the benchmark and report the numbers here. Thanks! ## Future work in this area To further improve the algorithm, a few paths can be taken: 1) The reference implementation has a "fast path" for ASCII chunks where many instructions are skipped if the chunk is ASCII. This can improve the speed for some common cases. Such cases are when there is a lot of ascii chars in a string. This fast path has not been yet implemented. 2) The original author states in the repo, and we could see the changes made: > NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: [please adopt the simdutf library] (https://github.com/simdutf/). It is much more powerful, faster and better tested. 3) This algorithm could inform the user which precise byte is causing an issue and why. This has not been implemented yet. Note that it will add branching to the loop and will require benchmarking to make sure there is no performance penalty. 4) I looked at the assembly and some instructions that were present in the original implementation were not used here. Meaning there is some room for using specialized instruction that will save a few cycles. A good example is the `_mm_alignr_epi8` function which should in theory be one single instruction. I am not currently working on those improvements. Someone else can investigate. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3401 MODULAR_ORIG_COMMIT_REV_ID: 3dabaa99b60da779630c7be36f01f8d4468eeab9
Landed in 8e41b7b! Thank you for your contribution 🎉 |
@gabrieldemarmiesse I realized yesterday that you aren't on the changelog for latest stable Mojo and I was sad especially knowing all you did for Honestly, I'm +1000 on graduating |
I guess I was too lazy to add entries to the changelog haha. Thanks for the reminder :) |
[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * #3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in #3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299
[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * modular#3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in modular#3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes modular#3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299
…se4 (#47462) [External] [stdlib] Make utf8 validation ~10-13x faster on neon and sse4 ## Description of the changes In the future `_is_valid_utf8()` will be used massively. As such, we need every performance improvement possible, as long as the complexity cost is reasonable. This PR changes the implementation fo the function `_is_valid_utf8()` without changing the signature. It's a drop-in replacement. This implementation is describled in the paper [Validating UTF-8 in less than one instruction per byte](https://arxiv.org/abs/2010.03090) by John Keiser and Daniel Lemire, which is pretty close to the state of the art on the subject. A reference C++ implementation can be found in the repository [lemire/fastvalidate-utf-8](https://github.com/lemire/fastvalidate-utf-8), precisely in [this file](https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h). Notice how Mojo makes this more generic and readable, as well as portable. Note that the only improvement that I'm aware of to this algorithm is the [is_utf8 library of simdutf](https://github.com/simdutf/is_utf8) which is based on the algorithm made in this PR. It is significantly harder to implement as it's a production grade library full of macros and other things that I have a harder time reading than fastvalidate-utf-8. Two good blog posts have been made one the subject: * [Validating UTF-8 strings using as little as 0.7 cycles per byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) * [Validating UTF-8 bytes using only 0.45 cycles per byte, AVX edition](https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/) ## Types of utf-8 errors While I'm not sure that the current implementation can detect all classes of errors, this algorithms checks the following rules: a) **5+ Byte**. The leading byte must have fewer than 5 header bits. b) **Too Short**. The leading byte must be followed by N-1 continuation bytes, where N is the UTF-8 character length. c) **Too Long**. The leading byte must not be a continuation byte. d) **Overlong**. The decoded character must be above U+7F for two-byte characters, U+7FF for three-byte characters, and U+FFFF for four-byte characters. e) **Too Large**. The decoded character must be less than or equal to U+10FFFF. f) **Surrogate**. The decoded character must be not be in U+D800...DFFF. ## Why is this implementation so much faster than the current one? ### The current implementation The current implementation was using simd, but not in an optimal way: 1) It was loading N bytes in an simd vector 2) Was checking with simd instructions if it was ascii (a fast path) and skip the chunk if that's the case. If some bytes are ascii, do a plain for loop on each byte to skip them. 3) If the chunk was not full ascii, look at the first byte and get the number of bytes in the character that is at the start of the simd vector 4) Increment the counter with the number of bytes in the character (2, 3 or 4) and go back to step 1. Since the index can increment by 2,3 or 4 before loading the next chunk of bytes, the following problems were presents: 1) We do a lot of iterations, one per character if it's non-ascii and since each iteration involves a `LOAD`, that's expensive. 2) Doing a for loop on each byte to check ascii require a lot of instructions to be executed per byte and not good branching predictability (if not ascii, continue the loop, it's a hard to predict branch). 3) Since the data isn't aligned anymore (`idx` is not a multiple of the simd size), loading the chunk into an simd vector is quite slow. 4) Overall, many if-statements are present in the loop. ### The new implementation The new algorithm improves on this by reading the data chunk by chunk, keeping into simd vectors information about the previous chunk. You can look at the `_is_valid_utf8` function which has the main loop. Since it's chunk by chunk, the index jump is always the simd size. It has the following properties which usually make the hardware happy: 1) One single LOAD per iteration. 2) LOAD each byte once. 3) Once a chunk is loaded into SIMD, no branching is done. 4) Once a chunk is loaded into simd, no more load from memory is done, we only work with simd vectors in registers. 5) Many simd operations don't depend on each other, which gives flexibility to the cpu with the out-of-order execution. 6) The jump size if constant (simd size) which makes it very easy to know which data will be accessed next and the data to load in simd is always aligned correctly. Basically, you load 32 bytes of data into simd, do a bunch of computation in registers without branching (~60 assembly instructions, look at the `_check_utf8_bytes` function, those instructions don't all depend on each other), and only take the next 32 bytes when you're done with the current chunk. Keep a bunch of vectors here and there to be able to validate characters which spans across two simd vectors. Such type of computation is the one cpus are optimized for. Since we only do simd operations, the number of assembly instructions is the same if the simd vector is of size 8 or size 64, meaning that if the cpu has biggest simd sizes available (AVX512) then a considerable speedup can be achieved. ## Benchmark code We don't make use of the `benchmark` module, because it is not available when an external contributor recompiles the stdlib. ```mojo import sys from testing import assert_true, assert_false from utils.string_slice import _is_valid_utf8 import time @no_inline fn keep(x: Bool) -> Int: return not x fn get_big_string() -> String: var string = str( "안녕하세요,세상 hello mojo! 🔥🔥hopefully this string is complicated enough :p" " éç__çè" ) # The string is 100 bytes long. return string * 100_000 # 10MB def main(): print("Has neon:", sys.has_neon()) print("Has sse4:", sys.has_sse4()) print("SIMD size:", sys.simdbytewidth()) var big_string = get_big_string() @parameter fn utf8_simd_validation_benchmark() raises: # we want to validate ~1gb of data for _ in range(100): var result = _is_valid_utf8(big_string.unsafe_ptr(), len(big_string)) assert_true(result) # warmup for _ in range(3): utf8_simd_validation_benchmark() iterations = 10 t1 = time.now() for _ in range(iterations): utf8_simd_validation_benchmark() t2 = time.now() _ = big_string average_ns = (t2 - t1) / iterations average_s = average_ns / 1_000_000_000 print("Validate 1GB of UTF-8 data in", average_s, "s") print(1.0 / average_s, "GB/s") ``` Put it in a file called `bench.mojo`. Bench the nightly version with ``` MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true mojo build bench.mojo && ./bench ``` Bench this branch by doing a checkout on it and then: ``` MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true MODULAR_MOJO_NIGHTLY_IMPORT_PATH=./build mojo build bench.mojo && ./bench ``` ## Benchmark results: * On AMD Ryzen 9 7945HX with Radeon Graphics we get a **x10.8** speedup. * On Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11) I get **x7.3** speedup I don't have the numbers for less capable CPUs, notably the ones that don't have the instruction to do the `dynamic_shuffle()` in one single instruction. If you have other cpus, please run the benchmark and report the numbers here. Thanks! ## Future work in this area To further improve the algorithm, a few paths can be taken: 1) The reference implementation has a "fast path" for ASCII chunks where many instructions are skipped if the chunk is ASCII. This can improve the speed for some common cases. Such cases are when there is a lot of ascii chars in a string. This fast path has not been yet implemented. 2) The original author states in the repo, and we could see the changes made: > NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: [please adopt the simdutf library] (https://github.com/simdutf/). It is much more powerful, faster and better tested. 3) This algorithm could inform the user which precise byte is causing an issue and why. This has not been implemented yet. Note that it will add branching to the loop and will require benchmarking to make sure there is no performance penalty. 4) I looked at the assembly and some instructions that were present in the original implementation were not used here. Meaning there is some room for using specialized instruction that will save a few cycles. A good example is the `_mm_alignr_epi8` function which should in theory be one single instruction. I am not currently working on those improvements. Someone else can investigate. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3401 MODULAR_ORIG_COMMIT_REV_ID: 3dabaa99b60da779630c7be36f01f8d4468eeab9
[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * #3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in #3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299
Dependencies
The following PR should be merged first:
SIMD._dynamic_shuffle()
#3397Description of the changes
In the future
_is_valid_utf8()
will be used massively. As such, we need every performance improvement possible, as long as the complexity cost is reasonnable.This PR changes the implementation fo the function
_is_valid_utf8()
without changing the signature. It's a drop-in replacement.This implementation is describled in the paper Validating UTF-8 in less than one instruction per byte by John Keiser and Daniel Lemire, which is pretty close to the state of the art on the subject.
A reference C++ implementation can be found in the repository lemire/fastvalidate-utf-8, precisely in this file. Notice how Mojo makes this more generic and readable, as well as portable.
Note that the only improvement that I'm aware of to this algorithm is the is_utf8 library of simdutf which is based on the algorithm made in this PR. It is significantly harder to implement as it's a production grade library full of macros and other things that I have a harder time reading than fastvalidate-utf-8.
Two good blog posts have been made one the subject:
Types of utf-8 errors
While I'm not sure that the current implementation can detect all classes of errors, this algorithms checks the following rules:
a) 5+ Byte. The leading byte must have fewer than 5 header bits.
b) Too Short. The leading byte must be followed by N-1 continuation bytes, where N is the UTF-8 character length.
c) Too Long. The leading byte must not be a continuation byte.
d) Overlong. The decoded character must be above U+7F for two-byte characters, U+7FF for three-byte characters,
and U+FFFF for four-byte characters.
e) Too Large. The decoded character must be less than or equal to U+10FFFF.
f) Surrogate. The decoded character must be not be in U+D800...DFFF.
Why is this implementation so much faster than the current one?
The current implementation
The current implementation was using simd, but not in an optimal way:
of the simd vector
Since the index can increment by 2,3 or 4 before loading the next chunk of bytes, the following problems were presents:
LOAD
, that's expensive.idx
is not a multiple of the simd size), loading the chunk into an simd vector is quite slow.The new implementation
The new algorithm improves on this by reading the data chunk by chunk, keeping into simd vectors information about the previous chunk. You can look at the
_is_valid_utf8
function which has the main loop. Since it's chunk by chunk, the index jump is always the simd size.It has the following properties which usually make the hardware happy:
Basically, you load 32 bytes of data into simd, do a bunch of computation in registers without branching (~60 assembly instructions, look at the
_check_utf8_bytes
function, those instructions don't all depend on each other), and only take the next 32 bytes when you're done with the current chunk. Keep a bunch of vectors here and there to be able to validate characters which spans across two simd vectors. Such type of computation is the one cpus are optimized for.Since we only do simd operations, the number of assembly instructions is the same if the simd vector is of size 8 or size 64, meaning that if the cpu has biggest simd sizes available (AVX512) then a considerable speedup can be achieved.
Benchmark code
We don't make use of the
benchmark
module, because it is not available when an external contributor recompiles the stdlib.Put it in a file called
bench.mojo
.Bench the nightly version with
Bench this branch by doing a checkout on it and then:
Benchmark results:
I don't have the numbers for less capable CPUs, notably the ones that don't have the instruction to do the
dynamic_shuffle()
in one single instruction.If you have other cpus, please run the benchmark and report the numbers here. Thanks!
Future work in this area
To further improve the algorithm, a few paths can be taken:
_mm_alignr_epi8
function which should in theory be one single instruction.I am not currently working on those improvements. Someone else can investigate.