[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401

gabrieldemarmiesse · 2024-08-20T16:19:12Z

Dependencies

The following PR should be merged first:

Description of the changes

In the future _is_valid_utf8() will be used massively. As such, we need every performance improvement possible, as long as the complexity cost is reasonnable.

This PR changes the implementation fo the function _is_valid_utf8() without changing the signature. It's a drop-in replacement.
This implementation is describled in the paper Validating UTF-8 in less than one instruction per byte by John Keiser and Daniel Lemire, which is pretty close to the state of the art on the subject.

A reference C++ implementation can be found in the repository lemire/fastvalidate-utf-8, precisely in this file. Notice how Mojo makes this more generic and readable, as well as portable.

Note that the only improvement that I'm aware of to this algorithm is the is_utf8 library of simdutf which is based on the algorithm made in this PR. It is significantly harder to implement as it's a production grade library full of macros and other things that I have a harder time reading than fastvalidate-utf-8.

Two good blog posts have been made one the subject:

Types of utf-8 errors

While I'm not sure that the current implementation can detect all classes of errors, this algorithms checks the following rules:
a) 5+ Byte. The leading byte must have fewer than 5 header bits.
b) Too Short. The leading byte must be followed by N-1 continuation bytes, where N is the UTF-8 character length.
c) Too Long. The leading byte must not be a continuation byte.
d) Overlong. The decoded character must be above U+7F for two-byte characters, U+7FF for three-byte characters,
and U+FFFF for four-byte characters.
e) Too Large. The decoded character must be less than or equal to U+10FFFF.
f) Surrogate. The decoded character must be not be in U+D800...DFFF.

Why is this implementation so much faster than the current one?

The current implementation

The current implementation was using simd, but not in an optimal way:

It was loading N bytes in an simd vector
Was checking with simd instructions if it was ascii (a fast path) and skip the chunk if that's the case. If some bytes are ascii, do a plain for loop on each byte to skip them.
If the chunk was not full ascii, look at the first byte and get the number of bytes in the character that is at the start
of the simd vector
Increment the counter with the number of bytes in the character (2, 3 or 4) and go back to step 1.

Since the index can increment by 2,3 or 4 before loading the next chunk of bytes, the following problems were presents:

We do a lot of iterations, one per character if it's non-ascii and since each iteration involves a LOAD, that's expensive.
Doing a for loop on each byte to check ascii require a lot of instructions to be executed per byte and not good branching predictability (if not ascii, continue the loop, it's a hard to predict branch).
Since the data isn't aligned anymore (idx is not a multiple of the simd size), loading the chunk into an simd vector is quite slow.
Overall, many if-statements are present in the loop.

The new implementation

The new algorithm improves on this by reading the data chunk by chunk, keeping into simd vectors information about the previous chunk. You can look at the _is_valid_utf8 function which has the main loop. Since it's chunk by chunk, the index jump is always the simd size.

It has the following properties which usually make the hardware happy:

One single LOAD per iteration.
LOAD each byte once.
Once a chunk is loaded into SIMD, no branching is done.
Once a chunk is loaded into simd, no more load from memory is done, we only work with simd vectors in registers.
Many simd operations don't depend on each other, which gives flexibility to the cpu with the out-of-order execution.
The jump size if constant (simd size) which makes it very easy to know which data will be accessed next and the data to load in simd is always aligned correctly.

Basically, you load 32 bytes of data into simd, do a bunch of computation in registers without branching (~60 assembly instructions, look at the _check_utf8_bytes function, those instructions don't all depend on each other), and only take the next 32 bytes when you're done with the current chunk. Keep a bunch of vectors here and there to be able to validate characters which spans across two simd vectors. Such type of computation is the one cpus are optimized for.

Since we only do simd operations, the number of assembly instructions is the same if the simd vector is of size 8 or size 64, meaning that if the cpu has biggest simd sizes available (AVX512) then a considerable speedup can be achieved.

Benchmark code

We don't make use of the benchmark module, because it is not available when an external contributor recompiles the stdlib.

import sys
from testing import assert_true, assert_false
from utils.string_slice import _is_valid_utf8
import time


@no_inline
fn keep(x: Bool) -> Int:
    return not x

fn get_big_string() -> String:
    var string = str(
        "안녕하세요,세상 hello mojo! 🔥🔥hopefully this string is complicated enough :p"
        " éç__çè"
    )
    # The string is 100 bytes long.
    return string * 100_000  # 10MB


def main():
    print("Has neon:", sys.has_neon())
    print("Has sse4:", sys.has_sse4())
    print("SIMD size:", sys.simdbytewidth())
    
    var big_string = get_big_string()

    @parameter
    fn utf8_simd_validation_benchmark() raises:
        # we want to validate ~1gb of data
        for _ in range(100):
            var result = _is_valid_utf8(big_string.unsafe_ptr(), len(big_string))
            assert_true(result)

    # warmup
    for _ in range(3):
        utf8_simd_validation_benchmark()

    iterations = 10
    t1 = time.now()
    for _ in range(iterations):
        utf8_simd_validation_benchmark()
    t2 = time.now()
    _ = big_string

    average_ns = (t2 - t1) / iterations
    average_s = average_ns / 1_000_000_000
    
    print("Validate 1GB of UTF-8 data in", average_s, "s")
    print(1.0 / average_s, "GB/s")

Put it in a file called bench.mojo.
Bench the nightly version with

MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true mojo build bench.mojo && ./bench

Bench this branch by doing a checkout on it and then:

MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true MODULAR_MOJO_NIGHTLY_IMPORT_PATH=./build mojo build bench.mojo &&  ./bench

Benchmark results:

On AMD Ryzen 9 7945HX with Radeon Graphics we get a x10.8 speedup.
On Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11) I get x7.3 speedup

I don't have the numbers for less capable CPUs, notably the ones that don't have the instruction to do the dynamic_shuffle() in one single instruction.

If you have other cpus, please run the benchmark and report the numbers here. Thanks!

Future work in this area

To further improve the algorithm, a few paths can be taken:

The reference implementation has a "fast path" for ASCII chunks where many instructions are skipped if the chunk is ASCII. This can improve the speed for some common cases. Such cases are when there is a lot of ascii chars in a string. This fast path has not been yet implemented.
The original author states in the repo, and we could see the changes made:

NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: [please adopt the simdutf library]
(https://github.com/simdutf/). It is much more powerful, faster and better tested.

This algorithm could inform the user which precise byte is causing an issue and why. This has not been implemented yet. Note that it will add branching to the loop and will require benchmarking to make sure there is no performance penalty.
I looked at the assembly and some instructions that were present in the original implementation were not used here. Meaning there is some room for using specialized instruction that will save a few cycles. A good example is the _mm_alignr_epi8 function which should in theory be one single instruction.

I am not currently working on those improvements. Someone else can investigate.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

lemire · 2024-08-21T21:44:19Z

Feel free to get in touch with me.

The algorithm is used in Node.js, Bun. It is in the PHP interpreter. And many other important systems.

Note that .NET is currently considering adopting this approach: dotnet/runtime#104199 based on our C# implementation https://github.com/simdutf/SimdUnicode

gabrieldemarmiesse · 2024-08-21T22:08:54Z

Thanks a lot @lemire ! I'm a big fan of your work! I need to polish this pull request a bit, then if I have any questions I'll contact you by PM.

Based on the amount of times this function will be called in Mojo, I see the need to push for an implementation that will be as fast as possible.

lemire · 2024-08-21T22:30:29Z

@gabrieldemarmiesse ❤️

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

…lidation

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

stdlib/src/builtin/simd.mojo

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse · 2024-08-22T16:42:10Z

@JoeLoser nbody.mojo Has been failing recently, this seems unrelated to the changes here as I got this error in another PR.

JoeLoser · 2024-08-22T17:10:32Z

@JoeLoser nbody.mojo Has been failing recently, this seems unrelated to the changes here as I got this error in another PR.

Yeah, someone just filed an issue internally about this as it failed in today's nightly release. Taking a look now.

lemire · 2024-08-22T20:06:36Z

@gabrieldemarmiesse

There is a fast UTF-8 validation algorithm that is used in multiple systems and it has been ported in several programming languages. As far as I know, it is the fastest algorithm... plus minus some instruction-level optimizations.

The key insight is that you have three table lookups followed by two bitwise AND. These three lookups do almost all the work. It is very difficult to beat and it works well on various instruction sets.

You will find an independent implementation in the PHP interpreter (in C): https://github.com/php/php-src/blob/9147687b6d5c4491e1e19cb0d80ffabc479593ef/ext/mbstring/mbstring.c#L5266

The reference implementation is found in simdutf: https://github.com/simdutf/simdutf/blob/master/src/generic/utf8_validation/utf8_lookup4_algorithm.h (this is an ISA-agnostic implementation).

This is maybe what is implemented here, but I do not recognize it.

The fast lookup algorithm should look as follows (ARM NEON version in C#):

Vector128<byte> prev1 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 1));
Vector128<byte> byte_1_high = AdvSimd.Arm64.VectorTableLookup(shuf1,                             AdvSimd.ShiftRightLogical(prev1.AsUInt16(), 4).AsByte() & v0f);
Vector128<byte> byte_1_low = AdvSimd.Arm64.VectorTableLookup(shuf2, (prev1 & v0f));
Vector128<byte> byte_2_high = AdvSimd.Arm64.VectorTableLookup(shuf3,                             AdvSimd.ShiftRightLogical(currentBlock.AsUInt16(), 4).AsByte() & v0f);
Vector128<byte> sc = AdvSimd.And(AdvSimd.And(byte_1_high, byte_1_low), byte_2_high);
                            
Vector128<byte> prev2 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 2));
Vector128<byte> prev3 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 3));
prevInputBlock = currentBlock;
Vector128<byte> isThirdByte = AdvSimd.SubtractSaturate(prev2, thirdByte);
Vector128<byte> isFourthByte = AdvSimd.SubtractSaturate(prev3, fourthByte);
Vector128<byte> must23 = AdvSimd.Or(isThirdByte, isFourthByte);
Vector128<byte> must23As80 = AdvSimd.And(must23, v80);
Vector128<byte> error = AdvSimd.Xor(must23As80, sc);
if (error != Vector128<byte>.Zero)
                            {
                                  // report error
                            }

The key ingredient at these constants:

                    Vector128<byte> shuf1 = Vector128.Create(TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
                            TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
                            TWO_CONTS, TWO_CONTS, TWO_CONTS, TWO_CONTS,
                            TOO_SHORT | OVERLONG_2,
                            TOO_SHORT,
                            TOO_SHORT | OVERLONG_3 | SURROGATE,
                            TOO_SHORT | TOO_LARGE | TOO_LARGE_1000 | OVERLONG_4);

                    Vector128<byte> shuf2 = Vector128.Create(CARRY | OVERLONG_3 | OVERLONG_2 | OVERLONG_4,
                            CARRY | OVERLONG_2,
                            CARRY,
                            CARRY,
                            CARRY | TOO_LARGE,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000 | SURROGATE,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000);
                    Vector128<byte> shuf3 = Vector128.Create(TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
                            TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
                            TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE_1000 | OVERLONG_4,
                            TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE,
                            TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
                            TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
                            TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT);

based on the following constants:

        const byte TOO_SHORT = 1 << 0;
        const byte TOO_LONG = 1 << 1;
        const byte OVERLONG_3 = 1 << 2;
        const byte SURROGATE = 1 << 4;
        const byte OVERLONG_2 = 1 << 5;
        const byte TWO_CONTS = 1 << 7;
        const byte TOO_LARGE = 1 << 3;
        const byte TOO_LARGE_1000 = 1 << 6;
        const byte OVERLONG_4 = 1 << 6;
        const byte CARRY = TOO_SHORT | TOO_LONG | TWO_CONTS;

References

John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse · 2024-08-23T11:03:05Z

@lemire thank you for the insights.

First I looked at the code you provided in C# and looked for similarities of the implementation I took from lemire/fastvalidate-utf-8 . While the main ingredients were present (three table lookups followed by two bitwise AND) but they were a lot more things going on and the values in the tables were not the same.

Rather than trying to see if the two implementations were equivalent, which is non-trivial since operations don't have a defined order, I decided to implement a new version in Mojo using the C# implementation you provided, with some help from the source file.

The results are, well, here, and clear. On my system, Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11):

utf-8 validation currently in the stdlib: 1.16 GB/s
Mojo implementation taken from lemire/fastvalidate-utf-8: 8.03 GB/s (x7.0)
Mojo implementation taken from the C# file: 13.16 GB/s (x11.3)

So many thanks for providing this improved version! You can look at the diff, it's really close to the C# code. We don't have the fast path for ASCII yet but that can come in another pull request.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gryznar · 2024-08-23T12:02:30Z

@gabrieldemarmiesse you may also want to update pr main description and title according to last changes. Speedup is quite impressive, so it is worth to mention IMHO. Btw, great work!

stdlib/src/utils/_utf8_validation.mojo

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

martinvuyk · 2024-09-02T20:26:16Z

Awesome work @gabrieldemarmiesse thank you for tackling this !

As I mentioned in Discord, you might wanna look at # TODO: implement a faster algorithm like https://github.com/cyb70289/utf8 implementations for inspiration (there are 2 algos). They have a benchmark results table that shows it's faster than lemire/fastvalidate-utf-8 (might be slower than the implementations lemire mentioned in the comments on this PR). They explain their algo in the readme. I think it might be useful to also benchmark it in your machine, though I am not sure their benchmark is very good (the text file is very unrealistic IMO).

There might also be a problem IMO with the benchmarks I've seen and it's that most have many non ASCII characters and or are very fabricated (strings in real life aren't randomly distributed), it might be worth it to have separate benchmarks for "typical" english, spanish, mandarin, and hindi which are realistically the most written on the internet, using the lorem ipsum generator.

gabrieldemarmiesse · 2024-09-03T10:19:55Z

@martinvuyk Thanks for chiming in.

About trying newer algorithms, re-implementing a new algorithm in a new programming language, profiling it and optimizing it can be a lot of work (at least a day), thus it's not really fitting to do this in the scope of this pull request. We can focus on this pull request, merge it if the maintainers agree that the improvements are clear. Afterwards, me or someone else from the community can investigate newer algorithms and open another PR with the benchmark results. Let's try to tackle one thing at a time.

About benchmarking quality, I totally agree with you. We would benefit from multiple benchmarks on multiple corpus. For example, I didn't add the "ASCII fast path" (skip iterations when we recognize consecutive ascii blocks) to this PR, which should be pretty trivial to add, and can bring huge improvements on benchmarks with lots of ASCII in the corpus. When I have more time, I'll try to do a more complete benchmark with more corpus.

lemire · 2024-09-03T12:23:28Z

@martinvuyk

Regarding benchmarks, I offer the following repository which contains a wide range of different files, including lipsum files, but also files containing a lot of ASCII mixed with richer Unicode characters...

See https://github.com/lemire/unicode_lipsum

I also recommend the twitter JSON file:

https://github.com/simdutf/SimdUnicode/blob/main/benchmark/data/twitter.json

It is an interesting mix of ASCII (JSON) and internal characters.

As I mentioned in Discord, you might wanna look at # TODO: implement a faster algorithm like https://github.com/cyb70289/utf8 implementations for inspiration (there are 2 algos). They have a benchmark results table that shows it's faster than lemire/fastvalidate-utf-8 (might be slower than the implementations lemire mentioned in the comments on this PR).

Note that the reference you offer predates the reference of the lookup algorithm:

Validating UTF-8 In Less Than One Instruction Per Byte
Software: Practice and Experience 51 (5), 2021
https://arxiv.org/abs/2010.03090

The lookup algorithm 'evolved' UTF-8 over years and, there has been four distinct lookup algorithms, and these algorithms followed various other approaches.

I believe that the latest one implemented here (which is lookup... or lookup4 as we used to call it), is likely the fastest available SIMD validation algorithm.

To be fair, the implementation matters and tuning can help.

martinvuyk · 2024-09-03T14:36:14Z

Regarding benchmarks, I offer the following repository which contains a wide range of different files, including lipsum files, but also files containing a lot of ASCII mixed with richer Unicode characters...

yeah those look good to me 👍 . I just read the code for the random string generation hadn't seen the twitter.json use.

The lookup algorithm 'evolved' UTF-8 over years and, there has been four distinct lookup algorithms, and these algorithms followed various other approaches.

Oh I thought they were only "revisions" on the same algo for different ISAs using less instructions/mem loads.

About trying newer algorithms, re-implementing a new algorithm in a new programming language, profiling it and optimizing it can be a lot of work (at least a day), thus it's not really fitting to do this in the scope of this pull request. We can focus on this pull request, merge it if the maintainers agree that the improvements are clear. Afterwards, me or someone else from the community can investigate newer algorithms and open another PR with the benchmark results. Let's try to tackle one thing at a time.

Totally understandable. Now that I've actually read Lemire's paper it might be a better algo than what I linked, since it actually registers the error type (awesome bit recycling for the continuation byte btw.) and does seem to only require 3 table lookups, whereas the range algo of the repo I linked builds an index table and then adjusts some idxs based on the first and second byte (2 to 5 extra instructions, branchless) doing around 6 lookups.

When I have more time, I'll try to do a more complete benchmark with more corpus.

Honestly it might be a bit overkill, this is awesome work as it is and what I coded was just a "good enough" approach until someone took the torch and built something better. It might be better to leave it as a pending addition to the stdlib benchmarks when someone wants to try out a new algorithm. I just mentioned what I thought of the benchmarks as something that we could improve in the future.

lemire · 2024-09-03T14:45:33Z

@martinvuyk

I just mentioned what I thought of the benchmarks as something that we could improve in the future.

Building good benchmarks is like building good tests. It is usually a net positive long term.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

JoeLoser · 2024-09-18T23:51:02Z

!sync

modularbot · 2024-09-19T22:37:00Z

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@parameter

…se4 (#47462) [External] [stdlib] Make utf8 validation ~10-13x faster on neon and sse4 ## Description of the changes In the future `_is_valid_utf8()` will be used massively. As such, we need every performance improvement possible, as long as the complexity cost is reasonable. This PR changes the implementation fo the function `_is_valid_utf8()` without changing the signature. It's a drop-in replacement. This implementation is describled in the paper [Validating UTF-8 in less than one instruction per byte](https://arxiv.org/abs/2010.03090) by John Keiser and Daniel Lemire, which is pretty close to the state of the art on the subject. A reference C++ implementation can be found in the repository [lemire/fastvalidate-utf-8](https://github.com/lemire/fastvalidate-utf-8), precisely in [this file](https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h). Notice how Mojo makes this more generic and readable, as well as portable. Note that the only improvement that I'm aware of to this algorithm is the [is_utf8 library of simdutf](https://github.com/simdutf/is_utf8) which is based on the algorithm made in this PR. It is significantly harder to implement as it's a production grade library full of macros and other things that I have a harder time reading than fastvalidate-utf-8. Two good blog posts have been made one the subject: * [Validating UTF-8 strings using as little as 0.7 cycles per byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) * [Validating UTF-8 bytes using only 0.45 cycles per byte, AVX edition](https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/) ## Types of utf-8 errors While I'm not sure that the current implementation can detect all classes of errors, this algorithms checks the following rules: a) **5+ Byte**. The leading byte must have fewer than 5 header bits. b) **Too Short**. The leading byte must be followed by N-1 continuation bytes, where N is the UTF-8 character length. c) **Too Long**. The leading byte must not be a continuation byte. d) **Overlong**. The decoded character must be above U+7F for two-byte characters, U+7FF for three-byte characters, and U+FFFF for four-byte characters. e) **Too Large**. The decoded character must be less than or equal to U+10FFFF. f) **Surrogate**. The decoded character must be not be in U+D800...DFFF. ## Why is this implementation so much faster than the current one? ### The current implementation The current implementation was using simd, but not in an optimal way: 1) It was loading N bytes in an simd vector 2) Was checking with simd instructions if it was ascii (a fast path) and skip the chunk if that's the case. If some bytes are ascii, do a plain for loop on each byte to skip them. 3) If the chunk was not full ascii, look at the first byte and get the number of bytes in the character that is at the start of the simd vector 4) Increment the counter with the number of bytes in the character (2, 3 or 4) and go back to step 1. Since the index can increment by 2,3 or 4 before loading the next chunk of bytes, the following problems were presents: 1) We do a lot of iterations, one per character if it's non-ascii and since each iteration involves a `LOAD`, that's expensive. 2) Doing a for loop on each byte to check ascii require a lot of instructions to be executed per byte and not good branching predictability (if not ascii, continue the loop, it's a hard to predict branch). 3) Since the data isn't aligned anymore (`idx` is not a multiple of the simd size), loading the chunk into an simd vector is quite slow. 4) Overall, many if-statements are present in the loop. ### The new implementation The new algorithm improves on this by reading the data chunk by chunk, keeping into simd vectors information about the previous chunk. You can look at the `_is_valid_utf8` function which has the main loop. Since it's chunk by chunk, the index jump is always the simd size. It has the following properties which usually make the hardware happy: 1) One single LOAD per iteration. 2) LOAD each byte once. 3) Once a chunk is loaded into SIMD, no branching is done. 4) Once a chunk is loaded into simd, no more load from memory is done, we only work with simd vectors in registers. 5) Many simd operations don't depend on each other, which gives flexibility to the cpu with the out-of-order execution. 6) The jump size if constant (simd size) which makes it very easy to know which data will be accessed next and the data to load in simd is always aligned correctly. Basically, you load 32 bytes of data into simd, do a bunch of computation in registers without branching (~60 assembly instructions, look at the `_check_utf8_bytes` function, those instructions don't all depend on each other), and only take the next 32 bytes when you're done with the current chunk. Keep a bunch of vectors here and there to be able to validate characters which spans across two simd vectors. Such type of computation is the one cpus are optimized for. Since we only do simd operations, the number of assembly instructions is the same if the simd vector is of size 8 or size 64, meaning that if the cpu has biggest simd sizes available (AVX512) then a considerable speedup can be achieved. ## Benchmark code We don't make use of the `benchmark` module, because it is not available when an external contributor recompiles the stdlib. ```mojo import sys from testing import assert_true, assert_false from utils.string_slice import _is_valid_utf8 import time @no_inline fn keep(x: Bool) -> Int: return not x fn get_big_string() -> String: var string = str( "안녕하세요,세상 hello mojo! 🔥🔥hopefully this string is complicated enough :p" " éç__çè" ) # The string is 100 bytes long. return string * 100_000 # 10MB def main(): print("Has neon:", sys.has_neon()) print("Has sse4:", sys.has_sse4()) print("SIMD size:", sys.simdbytewidth()) var big_string = get_big_string() @parameter fn utf8_simd_validation_benchmark() raises: # we want to validate ~1gb of data for _ in range(100): var result = _is_valid_utf8(big_string.unsafe_ptr(), len(big_string)) assert_true(result) # warmup for _ in range(3): utf8_simd_validation_benchmark() iterations = 10 t1 = time.now() for _ in range(iterations): utf8_simd_validation_benchmark() t2 = time.now() _ = big_string average_ns = (t2 - t1) / iterations average_s = average_ns / 1_000_000_000 print("Validate 1GB of UTF-8 data in", average_s, "s") print(1.0 / average_s, "GB/s") ``` Put it in a file called `bench.mojo`. Bench the nightly version with ``` MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true mojo build bench.mojo && ./bench ``` Bench this branch by doing a checkout on it and then: ``` MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true MODULAR_MOJO_NIGHTLY_IMPORT_PATH=./build mojo build bench.mojo && ./bench ``` ## Benchmark results: * On AMD Ryzen 9 7945HX with Radeon Graphics we get a **x10.8** speedup. * On Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11) I get **x7.3** speedup I don't have the numbers for less capable CPUs, notably the ones that don't have the instruction to do the `dynamic_shuffle()` in one single instruction. If you have other cpus, please run the benchmark and report the numbers here. Thanks! ## Future work in this area To further improve the algorithm, a few paths can be taken: 1) The reference implementation has a "fast path" for ASCII chunks where many instructions are skipped if the chunk is ASCII. This can improve the speed for some common cases. Such cases are when there is a lot of ascii chars in a string. This fast path has not been yet implemented. 2) The original author states in the repo, and we could see the changes made: > NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: [please adopt the simdutf library] (https://github.com/simdutf/). It is much more powerful, faster and better tested. 3) This algorithm could inform the user which precise byte is causing an issue and why. This has not been implemented yet. Note that it will add branching to the loop and will require benchmarking to make sure there is no performance penalty. 4) I looked at the assembly and some instructions that were present in the original implementation were not used here. Meaning there is some room for using specialized instruction that will save a few cycles. A good example is the `_mm_alignr_epi8` function which should in theory be one single instruction. I am not currently working on those improvements. Someone else can investigate. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3401 MODULAR_ORIG_COMMIT_REV_ID: 3dabaa99b60da779630c7be36f01f8d4468eeab9

modularbot · 2024-09-21T02:10:04Z

Landed in 8e41b7b! Thank you for your contribution 🎉

martinvuyk · 2024-10-04T01:00:07Z

@gabrieldemarmiesse I realized yesterday that you aren't on the changelog for latest stable Mojo and I was sad especially knowing all you did for UInt and a lot of other behind the scenes work, I think it was an honest mistake but because you aren't adding contributions as important as this one to the changelog with your PR and github user!

Honestly, I'm +1000 on graduating is_valid_utf8() to utils/string_slice.mojo and exposing it as a public API, there are many use cases for validating utf8/user input and very few people will take the time and effort to do an implementation like this one 🔥 . Maybe open a PR doing that ;) Do give credit to yourself sometimes !

gabrieldemarmiesse · 2024-10-04T15:17:56Z

I guess I was too lazy to add entries to the changelog haha. Thanks for the reminder :)

[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * #3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in #3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299

[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * modular#3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in modular#3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes modular#3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299

@parameter

…se4 (#47462) [External] [stdlib] Make utf8 validation ~10-13x faster on neon and sse4 ## Description of the changes In the future `_is_valid_utf8()` will be used massively. As such, we need every performance improvement possible, as long as the complexity cost is reasonable. This PR changes the implementation fo the function `_is_valid_utf8()` without changing the signature. It's a drop-in replacement. This implementation is describled in the paper [Validating UTF-8 in less than one instruction per byte](https://arxiv.org/abs/2010.03090) by John Keiser and Daniel Lemire, which is pretty close to the state of the art on the subject. A reference C++ implementation can be found in the repository [lemire/fastvalidate-utf-8](https://github.com/lemire/fastvalidate-utf-8), precisely in [this file](https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h). Notice how Mojo makes this more generic and readable, as well as portable. Note that the only improvement that I'm aware of to this algorithm is the [is_utf8 library of simdutf](https://github.com/simdutf/is_utf8) which is based on the algorithm made in this PR. It is significantly harder to implement as it's a production grade library full of macros and other things that I have a harder time reading than fastvalidate-utf-8. Two good blog posts have been made one the subject: * [Validating UTF-8 strings using as little as 0.7 cycles per byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/) * [Validating UTF-8 bytes using only 0.45 cycles per byte, AVX edition](https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/) ## Types of utf-8 errors While I'm not sure that the current implementation can detect all classes of errors, this algorithms checks the following rules: a) **5+ Byte**. The leading byte must have fewer than 5 header bits. b) **Too Short**. The leading byte must be followed by N-1 continuation bytes, where N is the UTF-8 character length. c) **Too Long**. The leading byte must not be a continuation byte. d) **Overlong**. The decoded character must be above U+7F for two-byte characters, U+7FF for three-byte characters, and U+FFFF for four-byte characters. e) **Too Large**. The decoded character must be less than or equal to U+10FFFF. f) **Surrogate**. The decoded character must be not be in U+D800...DFFF. ## Why is this implementation so much faster than the current one? ### The current implementation The current implementation was using simd, but not in an optimal way: 1) It was loading N bytes in an simd vector 2) Was checking with simd instructions if it was ascii (a fast path) and skip the chunk if that's the case. If some bytes are ascii, do a plain for loop on each byte to skip them. 3) If the chunk was not full ascii, look at the first byte and get the number of bytes in the character that is at the start of the simd vector 4) Increment the counter with the number of bytes in the character (2, 3 or 4) and go back to step 1. Since the index can increment by 2,3 or 4 before loading the next chunk of bytes, the following problems were presents: 1) We do a lot of iterations, one per character if it's non-ascii and since each iteration involves a `LOAD`, that's expensive. 2) Doing a for loop on each byte to check ascii require a lot of instructions to be executed per byte and not good branching predictability (if not ascii, continue the loop, it's a hard to predict branch). 3) Since the data isn't aligned anymore (`idx` is not a multiple of the simd size), loading the chunk into an simd vector is quite slow. 4) Overall, many if-statements are present in the loop. ### The new implementation The new algorithm improves on this by reading the data chunk by chunk, keeping into simd vectors information about the previous chunk. You can look at the `_is_valid_utf8` function which has the main loop. Since it's chunk by chunk, the index jump is always the simd size. It has the following properties which usually make the hardware happy: 1) One single LOAD per iteration. 2) LOAD each byte once. 3) Once a chunk is loaded into SIMD, no branching is done. 4) Once a chunk is loaded into simd, no more load from memory is done, we only work with simd vectors in registers. 5) Many simd operations don't depend on each other, which gives flexibility to the cpu with the out-of-order execution. 6) The jump size if constant (simd size) which makes it very easy to know which data will be accessed next and the data to load in simd is always aligned correctly. Basically, you load 32 bytes of data into simd, do a bunch of computation in registers without branching (~60 assembly instructions, look at the `_check_utf8_bytes` function, those instructions don't all depend on each other), and only take the next 32 bytes when you're done with the current chunk. Keep a bunch of vectors here and there to be able to validate characters which spans across two simd vectors. Such type of computation is the one cpus are optimized for. Since we only do simd operations, the number of assembly instructions is the same if the simd vector is of size 8 or size 64, meaning that if the cpu has biggest simd sizes available (AVX512) then a considerable speedup can be achieved. ## Benchmark code We don't make use of the `benchmark` module, because it is not available when an external contributor recompiles the stdlib. ```mojo import sys from testing import assert_true, assert_false from utils.string_slice import _is_valid_utf8 import time @no_inline fn keep(x: Bool) -> Int: return not x fn get_big_string() -> String: var string = str( "안녕하세요,세상 hello mojo! 🔥🔥hopefully this string is complicated enough :p" " éç__çè" ) # The string is 100 bytes long. return string * 100_000 # 10MB def main(): print("Has neon:", sys.has_neon()) print("Has sse4:", sys.has_sse4()) print("SIMD size:", sys.simdbytewidth()) var big_string = get_big_string() @parameter fn utf8_simd_validation_benchmark() raises: # we want to validate ~1gb of data for _ in range(100): var result = _is_valid_utf8(big_string.unsafe_ptr(), len(big_string)) assert_true(result) # warmup for _ in range(3): utf8_simd_validation_benchmark() iterations = 10 t1 = time.now() for _ in range(iterations): utf8_simd_validation_benchmark() t2 = time.now() _ = big_string average_ns = (t2 - t1) / iterations average_s = average_ns / 1_000_000_000 print("Validate 1GB of UTF-8 data in", average_s, "s") print(1.0 / average_s, "GB/s") ``` Put it in a file called `bench.mojo`. Bench the nightly version with ``` MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true mojo build bench.mojo && ./bench ``` Bench this branch by doing a checkout on it and then: ``` MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true MODULAR_MOJO_NIGHTLY_IMPORT_PATH=./build mojo build bench.mojo && ./bench ``` ## Benchmark results: * On AMD Ryzen 9 7945HX with Radeon Graphics we get a **x10.8** speedup. * On Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11) I get **x7.3** speedup I don't have the numbers for less capable CPUs, notably the ones that don't have the instruction to do the `dynamic_shuffle()` in one single instruction. If you have other cpus, please run the benchmark and report the numbers here. Thanks! ## Future work in this area To further improve the algorithm, a few paths can be taken: 1) The reference implementation has a "fast path" for ASCII chunks where many instructions are skipped if the chunk is ASCII. This can improve the speed for some common cases. Such cases are when there is a lot of ascii chars in a string. This fast path has not been yet implemented. 2) The original author states in the repo, and we could see the changes made: > NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: [please adopt the simdutf library] (https://github.com/simdutf/). It is much more powerful, faster and better tested. 3) This algorithm could inform the user which precise byte is causing an issue and why. This has not been implemented yet. Note that it will add branching to the loop and will require benchmarking to make sure there is no performance penalty. 4) I looked at the assembly and some instructions that were present in the original implementation were not used here. Meaning there is some room for using specialized instruction that will save a few cycles. A good example is the `_mm_alignr_epi8` function which should in theory be one single instruction. I am not currently working on those improvements. Someone else can investigate. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3401 MODULAR_ORIG_COMMIT_REV_ID: 3dabaa99b60da779630c7be36f01f8d4468eeab9

[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster ## Dependencies The following PR should be merged first: * #3397 ## Description of the changes `b64encode` is the function that encode bytes to base 64. Base 64 encoding is massively used across the industry, being to write secrets as text or to send data across the internet. Since it's going to be used a lot, we should make sure it is fast. As such, this PR provides a new implementation of `b64encode` around 5 times faster than the current one. This implementation was taken from the following papers: Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the speed of a memory copy, Software: Practice and Experience 50 (2), 2020. https://arxiv.org/abs/1910.05109 Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using AVX2 Instructions, ACM Transactions on the Web 12 (3), 2018. https://arxiv.org/abs/1704.00605 Note that there are substancial differences between the papers and this implementation. There are two reasons for this: * We want to avoid using assembly/llvm intrinsics directly and try to use the functions provided by the stdlib * We want to keep the complexity low, so we don't make a slightly different algorithm for each simd sizes and each cpu architecture. In a nutshell, we decide on a simd size, let's say 32. So at each iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them to base 64, it then becomes 32 bytes, and then we store those 32 bytes in the output buffer. We have a final iteration with the last incomplete chunks where we shouldn't load everything at once, otherwise we would get out of bounds errors. We then use partial loads and store and masking, but the main SIMD algorithm is used. The reasons for the speedups are simlar to the ones provided in #3401 ## API changes The existing api is ```mojo fn b64encode(str: String) -> String: ``` and has several limitations: 1) The input of the function is raw bytes. It doesn't have to represent text. Requirering the user to provide a `String` forces the user to handle null termination on its bytes and whatever other requirement `String` might have to use bytes. 2) It is not possible to write the produced bytes in an existing buffer. 3) It is hard to benchmark as the signature implies that the function allocates memory on the heap. 4) It supposes that the input value owns the underlying data, meaning that it's not possible to use the function if the data is not owned. `Span` would be a better choice here. We keep in this PR the existing signature for backward compatibility and add new overloads. Now the signatures are: ```mojo fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _]) fn b64encode(input_bytes: List[UInt8, _]) -> String fn b64encode(input_string: String) -> String ``` Note that it could be further improved in future PRs as currently `Span` is not easy to use but would be a right fit for the input value. We could also in the future remove `fn b64encode(input_string: String) -> String`. Note that the python api takes `bytes` as input and returns `bytes`. ## Benchmarking Benchmarking is harder than usual here because the base function does memory allocation. To avoid having the alloc in the benchmark, we must modify the original function to add the overloads described above. In this case we can benchmark and on my system ``` WSL2 windows 11 Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz Base speed: 3,80 GHz Sockets: 1 Cores: 8 Logical processors: 16 Virtualization: Enabled L1 cache: 512 KB L2 cache: 2,0 MB L3 cache: 16,0 MB ``` We get around 5x speedup. I don't provide the benchmark script here because it won't work out of the box (see the issue mentionned above), but if that's really necessary to get this merged, I'll provide the diff + the benchmark script. ## Future work As said before, this PR is not an exact re-implementation of the papers and the state of the art implementation that comes with it, the [simdutf](https://github.com/simdutf/simdutf) library. This is to keep this implementation simple and portable as it will work on any CPU that has an simd size of at least 4 bytes, and below or equal 64 bytes. In future PRs, we could provide futher speedups by using simd algorithms that are specific to each architecture. This will greatly increase the complexity of the code. I'll leave this decision to the maintainers. We can also re-write `b64decode` using simd and it's also expected that we'll get speedups. This can be the topic of another PR too. Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com> Closes #3443 MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299

gabrieldemarmiesse added 6 commits August 19, 2024 17:23

[stdlib] Add new method SIMD.dynamic_shuffle()

47c4175

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Typos and format

aa7ffa0

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Add support for neon

6b81b8c

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Make it work for any size with recursivity

f5c1127

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Fix some things, better tests

cbe544b

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Add a working implementation of full simd utf8 validation

471a5e0

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse changed the title ~~[stdlib] Make utf8 validation ~6x faster on neon and sse4~~ [stdlib] Make utf8 validation ~6-10x faster on neon and sse4 Aug 21, 2024

gabrieldemarmiesse added 7 commits August 22, 2024 10:44

Merge branch 'nightly' into add_dynamic_shuffle

8a4fc36

Add nodebug and use rebind

1fb3b11

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Merge branch 'add_dynamic_shuffle' into faster_utf8_validation

2d7064a

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Added no_inline for a 10% speedup

e58d55c

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

[stdlib] Add more utf-8 validation unit tests

9adfa5d

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Add todo

efdef33

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Merge branch 'add_more_tests_for_validating_utf8' into faster_utf8_va…

84b4f60

…lidation

JoeLoser self-assigned this Aug 22, 2024

gabrieldemarmiesse added 2 commits August 22, 2024 13:42

Better formatting and some notes

a53fc63

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Some more notes

275eb3e

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

lemire reviewed Aug 22, 2024

View reviewed changes

stdlib/src/builtin/simd.mojo Outdated Show resolved Hide resolved

gabrieldemarmiesse added 2 commits August 22, 2024 16:33

Add todos about sse3

d3c607c

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Merge branch 'add_dynamic_shuffle' into faster_utf8_validation

9587d29

Use the C# implementation as reference for validation

1423dde

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Some renaming to conform in Mojo style guide

d607d60

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse marked this pull request as ready for review August 23, 2024 11:18

gabrieldemarmiesse requested a review from a team as a code owner August 23, 2024 11:18

lemire reviewed Aug 23, 2024

View reviewed changes

stdlib/src/utils/_utf8_validation.mojo Show resolved Hide resolved

gabrieldemarmiesse added 2 commits August 23, 2024 14:19

Move references to the top of the file

0fc014e

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

Merge branch 'nightly' into faster_utf8_validation

2513e10

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

gabrieldemarmiesse changed the title ~~[stdlib] Make utf8 validation ~6-10x faster on neon and sse4~~ [stdlib] Make utf8 validation ~10-13x faster on neon and sse4 Aug 25, 2024

JoeLoser mentioned this pull request Sep 13, 2024

[stdlib] Move SIMD._dynamic_shuffle closer to utf-8 validation use #3477

Open

Merge branch 'nightly' into faster_utf8_validation

c483e1c

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>

modularbot added the imported-internally Signals that a given pull request has been imported internally. label Sep 18, 2024

modularbot added the merged-internally Indicates that this pull request has been merged internally label Sep 19, 2024

modularbot added the merged-externally Merged externally in public mojo repo label Sep 21, 2024

modularbot closed this Sep 21, 2024

vasily-v-ryabov mentioned this pull request Oct 9, 2024

Use UTF-8 internally for strings. faster-cpython/ideas#684

Open

gabrieldemarmiesse mentioned this pull request Sep 24, 2024

[stdlib] Use SIMD to make b64encode 4.7x faster #3443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401

[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401

gabrieldemarmiesse commented Aug 20, 2024 •

edited

Loading

lemire commented Aug 21, 2024

gabrieldemarmiesse commented Aug 21, 2024

lemire commented Aug 21, 2024

gabrieldemarmiesse commented Aug 22, 2024

JoeLoser commented Aug 22, 2024

lemire commented Aug 22, 2024

gabrieldemarmiesse commented Aug 23, 2024

gryznar commented Aug 23, 2024 •

edited

Loading

martinvuyk commented Sep 2, 2024

gabrieldemarmiesse commented Sep 3, 2024 •

edited

Loading

lemire commented Sep 3, 2024

martinvuyk commented Sep 3, 2024

lemire commented Sep 3, 2024

JoeLoser commented Sep 18, 2024

modularbot commented Sep 19, 2024

modularbot commented Sep 21, 2024

martinvuyk commented Oct 4, 2024

gabrieldemarmiesse commented Oct 4, 2024

[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401

[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401

Conversation

gabrieldemarmiesse commented Aug 20, 2024 • edited Loading

Dependencies

Description of the changes

Types of utf-8 errors

Why is this implementation so much faster than the current one?

The current implementation

The new implementation

Benchmark code

Benchmark results:

Future work in this area

lemire commented Aug 21, 2024

gabrieldemarmiesse commented Aug 21, 2024

lemire commented Aug 21, 2024

gabrieldemarmiesse commented Aug 22, 2024

JoeLoser commented Aug 22, 2024

lemire commented Aug 22, 2024

References

gabrieldemarmiesse commented Aug 23, 2024

gryznar commented Aug 23, 2024 • edited Loading

martinvuyk commented Sep 2, 2024

gabrieldemarmiesse commented Sep 3, 2024 • edited Loading

lemire commented Sep 3, 2024

martinvuyk commented Sep 3, 2024

lemire commented Sep 3, 2024

JoeLoser commented Sep 18, 2024

modularbot commented Sep 19, 2024

modularbot commented Sep 21, 2024

martinvuyk commented Oct 4, 2024

gabrieldemarmiesse commented Oct 4, 2024

gabrieldemarmiesse commented Aug 20, 2024 •

edited

Loading

gryznar commented Aug 23, 2024 •

edited

Loading

gabrieldemarmiesse commented Sep 3, 2024 •

edited

Loading