Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Make utf8 validation ~10-13x faster on neon and sse4 #3401

Conversation

gabrieldemarmiesse
Copy link
Contributor

@gabrieldemarmiesse gabrieldemarmiesse commented Aug 20, 2024

Dependencies

The following PR should be merged first:

Description of the changes

In the future _is_valid_utf8() will be used massively. As such, we need every performance improvement possible, as long as the complexity cost is reasonnable.

This PR changes the implementation fo the function _is_valid_utf8() without changing the signature. It's a drop-in replacement.
This implementation is describled in the paper Validating UTF-8 in less than one instruction per byte by John Keiser and Daniel Lemire, which is pretty close to the state of the art on the subject.

A reference C++ implementation can be found in the repository lemire/fastvalidate-utf-8, precisely in this file. Notice how Mojo makes this more generic and readable, as well as portable.

Note that the only improvement that I'm aware of to this algorithm is the is_utf8 library of simdutf which is based on the algorithm made in this PR. It is significantly harder to implement as it's a production grade library full of macros and other things that I have a harder time reading than fastvalidate-utf-8.

Two good blog posts have been made one the subject:

Types of utf-8 errors

While I'm not sure that the current implementation can detect all classes of errors, this algorithms checks the following rules:
a) 5+ Byte. The leading byte must have fewer than 5 header bits.
b) Too Short. The leading byte must be followed by N-1 continuation bytes, where N is the UTF-8 character length.
c) Too Long. The leading byte must not be a continuation byte.
d) Overlong. The decoded character must be above U+7F for two-byte characters, U+7FF for three-byte characters,
and U+FFFF for four-byte characters.
e) Too Large. The decoded character must be less than or equal to U+10FFFF.
f) Surrogate. The decoded character must be not be in U+D800...DFFF.

Why is this implementation so much faster than the current one?

The current implementation

The current implementation was using simd, but not in an optimal way:

  1. It was loading N bytes in an simd vector
  2. Was checking with simd instructions if it was ascii (a fast path) and skip the chunk if that's the case. If some bytes are ascii, do a plain for loop on each byte to skip them.
  3. If the chunk was not full ascii, look at the first byte and get the number of bytes in the character that is at the start
    of the simd vector
  4. Increment the counter with the number of bytes in the character (2, 3 or 4) and go back to step 1.

Since the index can increment by 2,3 or 4 before loading the next chunk of bytes, the following problems were presents:

  1. We do a lot of iterations, one per character if it's non-ascii and since each iteration involves a LOAD, that's expensive.
  2. Doing a for loop on each byte to check ascii require a lot of instructions to be executed per byte and not good branching predictability (if not ascii, continue the loop, it's a hard to predict branch).
  3. Since the data isn't aligned anymore (idx is not a multiple of the simd size), loading the chunk into an simd vector is quite slow.
  4. Overall, many if-statements are present in the loop.

The new implementation

The new algorithm improves on this by reading the data chunk by chunk, keeping into simd vectors information about the previous chunk. You can look at the _is_valid_utf8 function which has the main loop. Since it's chunk by chunk, the index jump is always the simd size.

It has the following properties which usually make the hardware happy:

  1. One single LOAD per iteration.
  2. LOAD each byte once.
  3. Once a chunk is loaded into SIMD, no branching is done.
  4. Once a chunk is loaded into simd, no more load from memory is done, we only work with simd vectors in registers.
  5. Many simd operations don't depend on each other, which gives flexibility to the cpu with the out-of-order execution.
  6. The jump size if constant (simd size) which makes it very easy to know which data will be accessed next and the data to load in simd is always aligned correctly.

Basically, you load 32 bytes of data into simd, do a bunch of computation in registers without branching (~60 assembly instructions, look at the _check_utf8_bytes function, those instructions don't all depend on each other), and only take the next 32 bytes when you're done with the current chunk. Keep a bunch of vectors here and there to be able to validate characters which spans across two simd vectors. Such type of computation is the one cpus are optimized for.

Since we only do simd operations, the number of assembly instructions is the same if the simd vector is of size 8 or size 64, meaning that if the cpu has biggest simd sizes available (AVX512) then a considerable speedup can be achieved.

Benchmark code

We don't make use of the benchmark module, because it is not available when an external contributor recompiles the stdlib.

import sys
from testing import assert_true, assert_false
from utils.string_slice import _is_valid_utf8
import time


@no_inline
fn keep(x: Bool) -> Int:
    return not x

fn get_big_string() -> String:
    var string = str(
        "안녕하세요,세상 hello mojo! 🔥🔥hopefully this string is complicated enough :p"
        " éç__çè"
    )
    # The string is 100 bytes long.
    return string * 100_000  # 10MB


def main():
    print("Has neon:", sys.has_neon())
    print("Has sse4:", sys.has_sse4())
    print("SIMD size:", sys.simdbytewidth())
    
    var big_string = get_big_string()

    @parameter
    fn utf8_simd_validation_benchmark() raises:
        # we want to validate ~1gb of data
        for _ in range(100):
            var result = _is_valid_utf8(big_string.unsafe_ptr(), len(big_string))
            assert_true(result)

    # warmup
    for _ in range(3):
        utf8_simd_validation_benchmark()

    iterations = 10
    t1 = time.now()
    for _ in range(iterations):
        utf8_simd_validation_benchmark()
    t2 = time.now()
    _ = big_string

    average_ns = (t2 - t1) / iterations
    average_s = average_ns / 1_000_000_000
    
    print("Validate 1GB of UTF-8 data in", average_s, "s")
    print(1.0 / average_s, "GB/s")

Put it in a file called bench.mojo.
Bench the nightly version with

MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true mojo build bench.mojo && ./bench

Bench this branch by doing a checkout on it and then:

MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true MODULAR_MOJO_NIGHTLY_IMPORT_PATH=./build mojo build bench.mojo &&  ./bench

Benchmark results:

  • On AMD Ryzen 9 7945HX with Radeon Graphics we get a x10.8 speedup.
  • On Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11) I get x7.3 speedup

I don't have the numbers for less capable CPUs, notably the ones that don't have the instruction to do the dynamic_shuffle() in one single instruction.

If you have other cpus, please run the benchmark and report the numbers here. Thanks!

Future work in this area

To further improve the algorithm, a few paths can be taken:

  1. The reference implementation has a "fast path" for ASCII chunks where many instructions are skipped if the chunk is ASCII. This can improve the speed for some common cases. Such cases are when there is a lot of ascii chars in a string. This fast path has not been yet implemented.
  2. The original author states in the repo, and we could see the changes made:

NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: [please adopt the simdutf library]
(https://github.com/simdutf/). It is much more powerful, faster and better tested.

  1. This algorithm could inform the user which precise byte is causing an issue and why. This has not been implemented yet. Note that it will add branching to the loop and will require benchmarking to make sure there is no performance penalty.
  2. I looked at the assembly and some instructions that were present in the original implementation were not used here. Meaning there is some room for using specialized instruction that will save a few cycles. A good example is the _mm_alignr_epi8 function which should in theory be one single instruction.

I am not currently working on those improvements. Someone else can investigate.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@gabrieldemarmiesse gabrieldemarmiesse changed the title [stdlib] Make utf8 validation ~6x faster on neon and sse4 [stdlib] Make utf8 validation ~6-10x faster on neon and sse4 Aug 21, 2024
@lemire
Copy link

lemire commented Aug 21, 2024

Feel free to get in touch with me.

The algorithm is used in Node.js, Bun. It is in the PHP interpreter. And many other important systems.

Note that .NET is currently considering adopting this approach: dotnet/runtime#104199 based on our C# implementation https://github.com/simdutf/SimdUnicode

@gabrieldemarmiesse
Copy link
Contributor Author

Thanks a lot @lemire ! I'm a big fan of your work! I need to polish this pull request a bit, then if I have any questions I'll contact you by PM.

Based on the amount of times this function will be called in Mojo, I see the need to push for an implementation that will be as fast as possible.

@lemire
Copy link

lemire commented Aug 21, 2024

@gabrieldemarmiesse ❤️

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@JoeLoser JoeLoser self-assigned this Aug 22, 2024
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@gabrieldemarmiesse
Copy link
Contributor Author

@JoeLoser nbody.mojo Has been failing recently, this seems unrelated to the changes here as I got this error in another PR.

@JoeLoser
Copy link
Collaborator

@JoeLoser nbody.mojo Has been failing recently, this seems unrelated to the changes here as I got this error in another PR.

Yeah, someone just filed an issue internally about this as it failed in today's nightly release. Taking a look now.

@lemire
Copy link

lemire commented Aug 22, 2024

@gabrieldemarmiesse

There is a fast UTF-8 validation algorithm that is used in multiple systems and it has been ported in several programming languages. As far as I know, it is the fastest algorithm... plus minus some instruction-level optimizations.

The key insight is that you have three table lookups followed by two bitwise AND. These three lookups do almost all the work. It is very difficult to beat and it works well on various instruction sets.

You will find an independent implementation in the PHP interpreter (in C): https://github.com/php/php-src/blob/9147687b6d5c4491e1e19cb0d80ffabc479593ef/ext/mbstring/mbstring.c#L5266

The reference implementation is found in simdutf: https://github.com/simdutf/simdutf/blob/master/src/generic/utf8_validation/utf8_lookup4_algorithm.h (this is an ISA-agnostic implementation).

This is maybe what is implemented here, but I do not recognize it.

The fast lookup algorithm should look as follows (ARM NEON version in C#):

Vector128<byte> prev1 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 1));
Vector128<byte> byte_1_high = AdvSimd.Arm64.VectorTableLookup(shuf1,                             AdvSimd.ShiftRightLogical(prev1.AsUInt16(), 4).AsByte() & v0f);
Vector128<byte> byte_1_low = AdvSimd.Arm64.VectorTableLookup(shuf2, (prev1 & v0f));
Vector128<byte> byte_2_high = AdvSimd.Arm64.VectorTableLookup(shuf3,                             AdvSimd.ShiftRightLogical(currentBlock.AsUInt16(), 4).AsByte() & v0f);
Vector128<byte> sc = AdvSimd.And(AdvSimd.And(byte_1_high, byte_1_low), byte_2_high);
                            
Vector128<byte> prev2 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 2));
Vector128<byte> prev3 = AdvSimd.ExtractVector128(prevInputBlock, currentBlock, (byte)(16 - 3));
prevInputBlock = currentBlock;
Vector128<byte> isThirdByte = AdvSimd.SubtractSaturate(prev2, thirdByte);
Vector128<byte> isFourthByte = AdvSimd.SubtractSaturate(prev3, fourthByte);
Vector128<byte> must23 = AdvSimd.Or(isThirdByte, isFourthByte);
Vector128<byte> must23As80 = AdvSimd.And(must23, v80);
Vector128<byte> error = AdvSimd.Xor(must23As80, sc);
if (error != Vector128<byte>.Zero)
                            {
                                  // report error
                            }

The key ingredient at these constants:

                    Vector128<byte> shuf1 = Vector128.Create(TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
                            TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
                            TWO_CONTS, TWO_CONTS, TWO_CONTS, TWO_CONTS,
                            TOO_SHORT | OVERLONG_2,
                            TOO_SHORT,
                            TOO_SHORT | OVERLONG_3 | SURROGATE,
                            TOO_SHORT | TOO_LARGE | TOO_LARGE_1000 | OVERLONG_4);

                    Vector128<byte> shuf2 = Vector128.Create(CARRY | OVERLONG_3 | OVERLONG_2 | OVERLONG_4,
                            CARRY | OVERLONG_2,
                            CARRY,
                            CARRY,
                            CARRY | TOO_LARGE,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000 | SURROGATE,
                            CARRY | TOO_LARGE | TOO_LARGE_1000,
                            CARRY | TOO_LARGE | TOO_LARGE_1000);
                    Vector128<byte> shuf3 = Vector128.Create(TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
                            TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
                            TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE_1000 | OVERLONG_4,
                            TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE,
                            TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
                            TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
                            TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT);

                    

based on the following constants:

        const byte TOO_SHORT = 1 << 0;
        const byte TOO_LONG = 1 << 1;
        const byte OVERLONG_3 = 1 << 2;
        const byte SURROGATE = 1 << 4;
        const byte OVERLONG_2 = 1 << 5;
        const byte TWO_CONTS = 1 << 7;
        const byte TOO_LARGE = 1 << 3;
        const byte TOO_LARGE_1000 = 1 << 6;
        const byte OVERLONG_4 = 1 << 6;
        const byte CARRY = TOO_SHORT | TOO_LONG | TWO_CONTS;

References

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@gabrieldemarmiesse
Copy link
Contributor Author

@lemire thank you for the insights.

First I looked at the code you provided in C# and looked for similarities of the implementation I took from lemire/fastvalidate-utf-8 . While the main ingredients were present (three table lookups followed by two bitwise AND) but they were a lot more things going on and the values in the tables were not the same.

Rather than trying to see if the two implementations were equivalent, which is non-trivial since operations don't have a defined order, I decided to implement a new version in Mojo using the C# implementation you provided, with some help from the source file.

The results are, well, here, and clear. On my system, Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11):

  • utf-8 validation currently in the stdlib: 1.16 GB/s
  • Mojo implementation taken from lemire/fastvalidate-utf-8: 8.03 GB/s (x7.0)
  • Mojo implementation taken from the C# file: 13.16 GB/s (x11.3)

So many thanks for providing this improved version! You can look at the diff, it's really close to the C# code. We don't have the fast path for ASCII yet but that can come in another pull request.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@gabrieldemarmiesse gabrieldemarmiesse marked this pull request as ready for review August 23, 2024 11:18
@gabrieldemarmiesse gabrieldemarmiesse requested a review from a team as a code owner August 23, 2024 11:18
@gryznar
Copy link
Contributor

gryznar commented Aug 23, 2024

@gabrieldemarmiesse you may also want to update pr main description and title according to last changes. Speedup is quite impressive, so it is worth to mention IMHO. Btw, great work!

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@gabrieldemarmiesse gabrieldemarmiesse changed the title [stdlib] Make utf8 validation ~6-10x faster on neon and sse4 [stdlib] Make utf8 validation ~10-13x faster on neon and sse4 Aug 25, 2024
@martinvuyk
Copy link
Contributor

Awesome work @gabrieldemarmiesse thank you for tackling this !

As I mentioned in Discord, you might wanna look at # TODO: implement a faster algorithm like https://github.com/cyb70289/utf8 implementations for inspiration (there are 2 algos). They have a benchmark results table that shows it's faster than lemire/fastvalidate-utf-8 (might be slower than the implementations lemire mentioned in the comments on this PR). They explain their algo in the readme. I think it might be useful to also benchmark it in your machine, though I am not sure their benchmark is very good (the text file is very unrealistic IMO).

There might also be a problem IMO with the benchmarks I've seen and it's that most have many non ASCII characters and or are very fabricated (strings in real life aren't randomly distributed), it might be worth it to have separate benchmarks for "typical" english, spanish, mandarin, and hindi which are realistically the most written on the internet, using the lorem ipsum generator.

@gabrieldemarmiesse
Copy link
Contributor Author

gabrieldemarmiesse commented Sep 3, 2024

@martinvuyk Thanks for chiming in.

About trying newer algorithms, re-implementing a new algorithm in a new programming language, profiling it and optimizing it can be a lot of work (at least a day), thus it's not really fitting to do this in the scope of this pull request. We can focus on this pull request, merge it if the maintainers agree that the improvements are clear. Afterwards, me or someone else from the community can investigate newer algorithms and open another PR with the benchmark results. Let's try to tackle one thing at a time.

About benchmarking quality, I totally agree with you. We would benefit from multiple benchmarks on multiple corpus. For example, I didn't add the "ASCII fast path" (skip iterations when we recognize consecutive ascii blocks) to this PR, which should be pretty trivial to add, and can bring huge improvements on benchmarks with lots of ASCII in the corpus. When I have more time, I'll try to do a more complete benchmark with more corpus.

@lemire
Copy link

lemire commented Sep 3, 2024

@martinvuyk

Regarding benchmarks, I offer the following repository which contains a wide range of different files, including lipsum files, but also files containing a lot of ASCII mixed with richer Unicode characters...

See https://github.com/lemire/unicode_lipsum

I also recommend the twitter JSON file:

https://github.com/simdutf/SimdUnicode/blob/main/benchmark/data/twitter.json

It is an interesting mix of ASCII (JSON) and internal characters.

As I mentioned in Discord, you might wanna look at # TODO: implement a faster algorithm like https://github.com/cyb70289/utf8 implementations for inspiration (there are 2 algos). They have a benchmark results table that shows it's faster than lemire/fastvalidate-utf-8 (might be slower than the implementations lemire mentioned in the comments on this PR).

Note that the reference you offer predates the reference of the lookup algorithm:

Validating UTF-8 In Less Than One Instruction Per Byte
Software: Practice and Experience 51 (5), 2021
https://arxiv.org/abs/2010.03090

The lookup algorithm 'evolved' UTF-8 over years and, there has been four distinct lookup algorithms, and these algorithms followed various other approaches.

I believe that the latest one implemented here (which is lookup... or lookup4 as we used to call it), is likely the fastest available SIMD validation algorithm.

To be fair, the implementation matters and tuning can help.

@martinvuyk
Copy link
Contributor

Regarding benchmarks, I offer the following repository which contains a wide range of different files, including lipsum files, but also files containing a lot of ASCII mixed with richer Unicode characters...

yeah those look good to me 👍 . I just read the code for the random string generation hadn't seen the twitter.json use.

The lookup algorithm 'evolved' UTF-8 over years and, there has been four distinct lookup algorithms, and these algorithms followed various other approaches.

Oh I thought they were only "revisions" on the same algo for different ISAs using less instructions/mem loads.

About trying newer algorithms, re-implementing a new algorithm in a new programming language, profiling it and optimizing it can be a lot of work (at least a day), thus it's not really fitting to do this in the scope of this pull request. We can focus on this pull request, merge it if the maintainers agree that the improvements are clear. Afterwards, me or someone else from the community can investigate newer algorithms and open another PR with the benchmark results. Let's try to tackle one thing at a time.

Totally understandable. Now that I've actually read Lemire's paper it might be a better algo than what I linked, since it actually registers the error type (awesome bit recycling for the continuation byte btw.) and does seem to only require 3 table lookups, whereas the range algo of the repo I linked builds an index table and then adjusts some idxs based on the first and second byte (2 to 5 extra instructions, branchless) doing around 6 lookups.

When I have more time, I'll try to do a more complete benchmark with more corpus.

Honestly it might be a bit overkill, this is awesome work as it is and what I coded was just a "good enough" approach until someone took the torch and built something better. It might be better to leave it as a pending addition to the stdlib benchmarks when someone wants to try out a new algorithm. I just mentioned what I thought of the benchmarks as something that we could improve in the future.

@lemire
Copy link

lemire commented Sep 3, 2024

@martinvuyk

I just mentioned what I thought of the benchmarks as something that we could improve in the future.

Building good benchmarks is like building good tests. It is usually a net positive long term.

Signed-off-by: gabrieldemarmiesse <gabrieldemarmiesse@gmail.com>
@JoeLoser
Copy link
Collaborator

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Sep 18, 2024
@modularbot
Copy link
Collaborator

✅🟣 This contribution has been merged 🟣✅

Your pull request has been merged to the internal upstream Mojo sources. It will be reflected here in the Mojo repository on the nightly branch during the next Mojo nightly release, typically within the next 24-48 hours.

We use Copybara to merge external contributions, click here to learn more.

@modularbot modularbot added the merged-internally Indicates that this pull request has been merged internally label Sep 19, 2024
modularbot pushed a commit that referenced this pull request Sep 21, 2024
…se4 (#47462)

[External] [stdlib] Make utf8 validation ~10-13x faster on neon and sse4
## Description of the changes

In the future `_is_valid_utf8()` will be used massively. As such, we
need every performance improvement possible, as long as the complexity
cost is reasonable.

This PR changes the implementation fo the function `_is_valid_utf8()`
without changing the signature. It's a drop-in replacement.
This implementation is describled in the paper [Validating UTF-8 in less
than one instruction per byte](https://arxiv.org/abs/2010.03090) by John
Keiser and Daniel Lemire, which is pretty close to the state of the art
on the subject.

A reference C++ implementation can be found in the repository
[lemire/fastvalidate-utf-8](https://github.com/lemire/fastvalidate-utf-8),
precisely in [this
file](https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h).
Notice how Mojo makes this more generic and readable, as well as
portable.

Note that the only improvement that I'm aware of to this algorithm is
the [is_utf8 library of simdutf](https://github.com/simdutf/is_utf8)
which is based on the algorithm made in this PR. It is significantly
harder to implement as it's a production grade library full of macros
and other things that I have a harder time reading than
fastvalidate-utf-8.

Two good blog posts have been made one the subject:
* [Validating UTF-8 strings using as little as 0.7 cycles per
byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/)
* [Validating UTF-8 bytes using only 0.45 cycles per byte, AVX
edition](https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/)
## Types of utf-8 errors

While I'm not sure that the current implementation can detect all
classes of errors, this algorithms checks the following rules:
a) **5+ Byte**. The leading byte must have fewer than 5 header bits.
b) **Too Short**. The leading byte must be followed by N-1 continuation
bytes, where N is the UTF-8 character length.
c) **Too Long**. The leading byte must not be a continuation byte.
d) **Overlong**. The decoded character must be above U+7F for two-byte
characters, U+7FF for three-byte characters,
and U+FFFF for four-byte characters.
e) **Too Large**. The decoded character must be less than or equal to
U+10FFFF.
f) **Surrogate**. The decoded character must be not be in U+D800...DFFF.

## Why is this implementation so much faster than the current one?

### The current implementation
The current implementation was using simd, but not in an optimal way:
1) It was loading N bytes in an simd vector
2) Was checking with simd instructions if it was ascii (a fast path) and
skip the chunk if that's the case. If some bytes are ascii, do a plain
for loop on each byte to skip them.
3) If the chunk was not full ascii, look at the first byte and get the
number of bytes in the character that is at the start
of the simd vector
4) Increment the counter with the number of bytes in the character (2, 3
or 4) and go back to step 1.

Since the index can increment by 2,3 or 4 before loading the next chunk
of bytes, the following problems were presents:
1) We do a lot of iterations, one per character if it's non-ascii and
since each iteration involves a `LOAD`, that's expensive.
2) Doing a for loop on each byte to check ascii require a lot of
instructions to be executed per byte and not good branching
predictability (if not ascii, continue the loop, it's a hard to predict
branch).
3) Since the data isn't aligned anymore (`idx` is not a multiple of the
simd size), loading the chunk into an simd vector is quite slow.
4) Overall, many if-statements are present in the loop.

### The new implementation
The new algorithm improves on this by reading the data chunk by chunk,
keeping into simd vectors information about the previous chunk. You can
look at the `_is_valid_utf8` function which has the main loop. Since
it's chunk by chunk, the index jump is always the simd size.

It has the following properties which usually make the hardware happy:
1) One single LOAD per iteration.
2) LOAD each byte once.
3) Once a chunk is loaded into SIMD, no branching is done.
4) Once a chunk is loaded into simd, no more load from memory is done,
we only work with simd vectors in registers.
5) Many simd operations don't depend on each other, which gives
flexibility to the cpu with the out-of-order execution.
6) The jump size if constant (simd size) which makes it very easy to
know which data will be accessed next and the data to load in simd is
always aligned correctly.

Basically, you load 32 bytes of data into simd, do a bunch of
computation in registers without branching (~60 assembly instructions,
look at the `_check_utf8_bytes` function, those instructions don't all
depend on each other), and only take the next 32 bytes when you're done
with the current chunk. Keep a bunch of vectors here and there to be
able to validate characters which spans across two simd vectors. Such
type of computation is the one cpus are optimized for.

Since we only do simd operations, the number of assembly instructions is
the same if the simd vector is of size 8 or size 64, meaning that if the
cpu has biggest simd sizes available (AVX512) then a considerable
speedup can be achieved.

## Benchmark code

We don't make use of the `benchmark` module, because it is not available
when an external contributor recompiles the stdlib.

```mojo
import sys
from testing import assert_true, assert_false
from utils.string_slice import _is_valid_utf8
import time

@no_inline
fn keep(x: Bool) -> Int:
    return not x

fn get_big_string() -> String:
    var string = str(
        "안녕하세요,세상 hello mojo! 🔥🔥hopefully this string is complicated enough :p"
        " éç__çè"
    )
    # The string is 100 bytes long.
    return string * 100_000  # 10MB

def main():
    print("Has neon:", sys.has_neon())
    print("Has sse4:", sys.has_sse4())
    print("SIMD size:", sys.simdbytewidth())

    var big_string = get_big_string()

    @parameter
    fn utf8_simd_validation_benchmark() raises:
        # we want to validate ~1gb of data
        for _ in range(100):
            var result = _is_valid_utf8(big_string.unsafe_ptr(), len(big_string))
            assert_true(result)

    # warmup
    for _ in range(3):
        utf8_simd_validation_benchmark()

    iterations = 10
    t1 = time.now()
    for _ in range(iterations):
        utf8_simd_validation_benchmark()
    t2 = time.now()
    _ = big_string

    average_ns = (t2 - t1) / iterations
    average_s = average_ns / 1_000_000_000

    print("Validate 1GB of UTF-8 data in", average_s, "s")
    print(1.0 / average_s, "GB/s")
```

Put it in a file called `bench.mojo`.
Bench the nightly version with
```
MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true mojo build bench.mojo && ./bench
```

Bench this branch by doing a checkout on it and then:
```
MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true MODULAR_MOJO_NIGHTLY_IMPORT_PATH=./build mojo build bench.mojo &&  ./bench
```

## Benchmark results:
* On AMD Ryzen 9 7945HX with Radeon Graphics we get a **x10.8** speedup.
* On Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11) I get
**x7.3** speedup

I don't have the numbers for less capable CPUs, notably the ones that
don't have the instruction to do the `dynamic_shuffle()` in one single
instruction.

If you have other cpus, please run the benchmark and report the numbers
here. Thanks!

## Future work in this area

To further improve the algorithm, a few paths can be taken:
1) The reference implementation has a "fast path" for ASCII chunks where
many instructions are skipped if the chunk is ASCII. This can improve
the speed for some common cases. Such cases are when there is a lot of
ascii chars in a string. This fast path has not been yet implemented.
2) The original author states in the repo, and we could see the changes
made:
> NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: [please
adopt the simdutf library]
(https://github.com/simdutf/). It is much more powerful, faster and
better tested.
3) This algorithm could inform the user which precise byte is causing an
issue and why. This has not been implemented yet. Note that it will add
branching to the loop and will require benchmarking to make sure there
is no performance penalty.
4) I looked at the assembly and some instructions that were present in
the original implementation were not used here. Meaning there is some
room for using specialized instruction that will save a few cycles. A
good example is the `_mm_alignr_epi8` function which should in theory be
one single instruction.

I am not currently working on those improvements. Someone else can
investigate.

Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com>
Closes #3401
MODULAR_ORIG_COMMIT_REV_ID: 3dabaa99b60da779630c7be36f01f8d4468eeab9
@modularbot modularbot added the merged-externally Merged externally in public mojo repo label Sep 21, 2024
@modularbot
Copy link
Collaborator

Landed in 8e41b7b! Thank you for your contribution 🎉

@modularbot modularbot closed this Sep 21, 2024
@martinvuyk
Copy link
Contributor

@gabrieldemarmiesse I realized yesterday that you aren't on the changelog for latest stable Mojo and I was sad especially knowing all you did for UInt and a lot of other behind the scenes work, I think it was an honest mistake but because you aren't adding contributions as important as this one to the changelog with your PR and github user!

Honestly, I'm +1000 on graduating is_valid_utf8() to utils/string_slice.mojo and exposing it as a public API, there are many use cases for validating utf8/user input and very few people will take the time and effort to do an implementation like this one 🔥 . Maybe open a PR doing that ;) Do give credit to yourself sometimes !

@gabrieldemarmiesse
Copy link
Contributor Author

I guess I was too lazy to add entries to the changelog haha. Thanks for the reminder :)

modularbot pushed a commit that referenced this pull request Oct 30, 2024
[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster

## Dependencies

The following PR should be merged first:
* #3397

## Description of the changes

`b64encode` is the function that encode bytes to base 64. Base 64
encoding is massively used across the industry, being to write secrets
as text or to send data across the internet.

Since it's going to be used a lot, we should make sure it is fast. As
such, this PR provides a new implementation of `b64encode` around 5
times faster than the current one.

This implementation was taken from the following papers:

Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the
speed of a memory copy, Software: Practice and Experience 50 (2), 2020.
https://arxiv.org/abs/1910.05109
Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using
AVX2
Instructions, ACM Transactions on the Web 12 (3), 2018.
https://arxiv.org/abs/1704.00605

Note that there are substancial differences between the papers and this
implementation. There are two reasons for this:
* We want to avoid using assembly/llvm intrinsics directly and try to
use the functions provided by the stdlib
* We want to keep the complexity low, so we don't make a slightly
different algorithm for each simd sizes and each cpu architecture.

In a nutshell, we decide on a simd size, let's say 32. So at each
iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them
to base 64, it then becomes 32 bytes, and then we store those 32 bytes
in the output buffer.

We have a final iteration with the last incomplete chunks where we
shouldn't load everything at once, otherwise we would get out of bounds
errors. We then use partial loads and store and masking, but the main
SIMD algorithm is used.

The reasons for the speedups are simlar to the ones provided in
#3401

## API changes

The existing api is
```mojo
fn b64encode(str: String) -> String:
```
and has several limitations:
1) The input of the function is raw bytes. It doesn't have to represent
text. Requirering the user to provide a `String` forces the user to
handle null termination on its bytes and whatever other requirement
`String` might have to use bytes.
2) It is not possible to write the produced bytes in an existing buffer.
3) It is hard to benchmark as the signature implies that the function
allocates memory on the heap.
4) It supposes that the input value owns the underlying data, meaning
that it's not possible to use the function if the data is not owned.
`Span` would be a better choice here.

We keep in this PR the existing signature for backward compatibility and
add new overloads. Now the signatures are:
```mojo
fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _])
fn b64encode(input_bytes: List[UInt8, _]) -> String
fn b64encode(input_string: String) -> String
```

Note that it could be further improved in future PRs as currently `Span`
is not easy to use but would be a right fit for the input value. We
could also in the future remove `fn b64encode(input_string: String) ->
String`.

Note that the python api takes `bytes` as input and returns `bytes`.

## Benchmarking

Benchmarking is harder than usual here because the base function does
memory allocation. To avoid having the alloc in the benchmark, we must
modify the original function to add the overloads described above. In
this case we can benchmark and on my system

```
WSL2 windows 11
Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz
Base speed:	3,80 GHz
Sockets:	1
Cores:	8
Logical processors:	16
Virtualization:	Enabled
L1 cache:	512 KB
L2 cache:	2,0 MB
L3 cache:	16,0 MB
```
We get around 5x speedup.

I don't provide the benchmark script here because it won't work out of
the box (see the issue mentionned above), but if that's really necessary
to get this merged, I'll provide the diff + the benchmark script.

## Future work

As said before, this PR is not an exact re-implementation of the papers
and the state of the art implementation that comes with it, the
[simdutf](https://github.com/simdutf/simdutf) library.

This is to keep this implementation simple and portable as it will work
on any CPU that has an simd size of at least 4 bytes, and below or equal
64 bytes.

In future PRs, we could provide futher speedups by using simd algorithms
that are specific to each architecture. This will greatly increase the
complexity of the code. I'll leave this decision to the maintainers.

We can also re-write `b64decode` using simd and it's also expected that
we'll get speedups. This can be the topic of another PR too.

Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com>
Closes #3443
MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299
Ahajha pushed a commit to Ahajha/mojo that referenced this pull request Oct 31, 2024
[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster

## Dependencies

The following PR should be merged first:
* modular#3397

## Description of the changes

`b64encode` is the function that encode bytes to base 64. Base 64
encoding is massively used across the industry, being to write secrets
as text or to send data across the internet.

Since it's going to be used a lot, we should make sure it is fast. As
such, this PR provides a new implementation of `b64encode` around 5
times faster than the current one.

This implementation was taken from the following papers:

Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the
speed of a memory copy, Software: Practice and Experience 50 (2), 2020.
https://arxiv.org/abs/1910.05109
Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using
AVX2
Instructions, ACM Transactions on the Web 12 (3), 2018.
https://arxiv.org/abs/1704.00605

Note that there are substancial differences between the papers and this
implementation. There are two reasons for this:
* We want to avoid using assembly/llvm intrinsics directly and try to
use the functions provided by the stdlib
* We want to keep the complexity low, so we don't make a slightly
different algorithm for each simd sizes and each cpu architecture.

In a nutshell, we decide on a simd size, let's say 32. So at each
iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them
to base 64, it then becomes 32 bytes, and then we store those 32 bytes
in the output buffer.

We have a final iteration with the last incomplete chunks where we
shouldn't load everything at once, otherwise we would get out of bounds
errors. We then use partial loads and store and masking, but the main
SIMD algorithm is used.

The reasons for the speedups are simlar to the ones provided in
modular#3401

## API changes

The existing api is
```mojo
fn b64encode(str: String) -> String:
```
and has several limitations:
1) The input of the function is raw bytes. It doesn't have to represent
text. Requirering the user to provide a `String` forces the user to
handle null termination on its bytes and whatever other requirement
`String` might have to use bytes.
2) It is not possible to write the produced bytes in an existing buffer.
3) It is hard to benchmark as the signature implies that the function
allocates memory on the heap.
4) It supposes that the input value owns the underlying data, meaning
that it's not possible to use the function if the data is not owned.
`Span` would be a better choice here.

We keep in this PR the existing signature for backward compatibility and
add new overloads. Now the signatures are:
```mojo
fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _])
fn b64encode(input_bytes: List[UInt8, _]) -> String
fn b64encode(input_string: String) -> String
```

Note that it could be further improved in future PRs as currently `Span`
is not easy to use but would be a right fit for the input value. We
could also in the future remove `fn b64encode(input_string: String) ->
String`.

Note that the python api takes `bytes` as input and returns `bytes`.

## Benchmarking

Benchmarking is harder than usual here because the base function does
memory allocation. To avoid having the alloc in the benchmark, we must
modify the original function to add the overloads described above. In
this case we can benchmark and on my system

```
WSL2 windows 11
Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz
Base speed:	3,80 GHz
Sockets:	1
Cores:	8
Logical processors:	16
Virtualization:	Enabled
L1 cache:	512 KB
L2 cache:	2,0 MB
L3 cache:	16,0 MB
```
We get around 5x speedup.

I don't provide the benchmark script here because it won't work out of
the box (see the issue mentionned above), but if that's really necessary
to get this merged, I'll provide the diff + the benchmark script.

## Future work

As said before, this PR is not an exact re-implementation of the papers
and the state of the art implementation that comes with it, the
[simdutf](https://github.com/simdutf/simdutf) library.

This is to keep this implementation simple and portable as it will work
on any CPU that has an simd size of at least 4 bytes, and below or equal
64 bytes.

In future PRs, we could provide futher speedups by using simd algorithms
that are specific to each architecture. This will greatly increase the
complexity of the code. I'll leave this decision to the maintainers.

We can also re-write `b64decode` using simd and it's also expected that
we'll get speedups. This can be the topic of another PR too.

Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com>
Closes modular#3443
MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299
modularbot pushed a commit that referenced this pull request Dec 17, 2024
…se4 (#47462)

[External] [stdlib] Make utf8 validation ~10-13x faster on neon and sse4
## Description of the changes

In the future `_is_valid_utf8()` will be used massively. As such, we
need every performance improvement possible, as long as the complexity
cost is reasonable.

This PR changes the implementation fo the function `_is_valid_utf8()`
without changing the signature. It's a drop-in replacement.
This implementation is describled in the paper [Validating UTF-8 in less
than one instruction per byte](https://arxiv.org/abs/2010.03090) by John
Keiser and Daniel Lemire, which is pretty close to the state of the art
on the subject.

A reference C++ implementation can be found in the repository
[lemire/fastvalidate-utf-8](https://github.com/lemire/fastvalidate-utf-8),
precisely in [this
file](https://github.com/lemire/fastvalidate-utf-8/blob/master/include/simdutf8check.h).
Notice how Mojo makes this more generic and readable, as well as
portable.

Note that the only improvement that I'm aware of to this algorithm is
the [is_utf8 library of simdutf](https://github.com/simdutf/is_utf8)
which is based on the algorithm made in this PR. It is significantly
harder to implement as it's a production grade library full of macros
and other things that I have a harder time reading than
fastvalidate-utf-8.

Two good blog posts have been made one the subject:
* [Validating UTF-8 strings using as little as 0.7 cycles per
byte](https://lemire.me/blog/2018/05/16/validating-utf-8-strings-using-as-little-as-0-7-cycles-per-byte/)
* [Validating UTF-8 bytes using only 0.45 cycles per byte, AVX
edition](https://lemire.me/blog/2018/10/19/validating-utf-8-bytes-using-only-0-45-cycles-per-byte-avx-edition/)
## Types of utf-8 errors

While I'm not sure that the current implementation can detect all
classes of errors, this algorithms checks the following rules:
a) **5+ Byte**. The leading byte must have fewer than 5 header bits.
b) **Too Short**. The leading byte must be followed by N-1 continuation
bytes, where N is the UTF-8 character length.
c) **Too Long**. The leading byte must not be a continuation byte.
d) **Overlong**. The decoded character must be above U+7F for two-byte
characters, U+7FF for three-byte characters,
and U+FFFF for four-byte characters.
e) **Too Large**. The decoded character must be less than or equal to
U+10FFFF.
f) **Surrogate**. The decoded character must be not be in U+D800...DFFF.

## Why is this implementation so much faster than the current one?

### The current implementation
The current implementation was using simd, but not in an optimal way:
1) It was loading N bytes in an simd vector
2) Was checking with simd instructions if it was ascii (a fast path) and
skip the chunk if that's the case. If some bytes are ascii, do a plain
for loop on each byte to skip them.
3) If the chunk was not full ascii, look at the first byte and get the
number of bytes in the character that is at the start
of the simd vector
4) Increment the counter with the number of bytes in the character (2, 3
or 4) and go back to step 1.

Since the index can increment by 2,3 or 4 before loading the next chunk
of bytes, the following problems were presents:
1) We do a lot of iterations, one per character if it's non-ascii and
since each iteration involves a `LOAD`, that's expensive.
2) Doing a for loop on each byte to check ascii require a lot of
instructions to be executed per byte and not good branching
predictability (if not ascii, continue the loop, it's a hard to predict
branch).
3) Since the data isn't aligned anymore (`idx` is not a multiple of the
simd size), loading the chunk into an simd vector is quite slow.
4) Overall, many if-statements are present in the loop.

### The new implementation
The new algorithm improves on this by reading the data chunk by chunk,
keeping into simd vectors information about the previous chunk. You can
look at the `_is_valid_utf8` function which has the main loop. Since
it's chunk by chunk, the index jump is always the simd size.

It has the following properties which usually make the hardware happy:
1) One single LOAD per iteration.
2) LOAD each byte once.
3) Once a chunk is loaded into SIMD, no branching is done.
4) Once a chunk is loaded into simd, no more load from memory is done,
we only work with simd vectors in registers.
5) Many simd operations don't depend on each other, which gives
flexibility to the cpu with the out-of-order execution.
6) The jump size if constant (simd size) which makes it very easy to
know which data will be accessed next and the data to load in simd is
always aligned correctly.

Basically, you load 32 bytes of data into simd, do a bunch of
computation in registers without branching (~60 assembly instructions,
look at the `_check_utf8_bytes` function, those instructions don't all
depend on each other), and only take the next 32 bytes when you're done
with the current chunk. Keep a bunch of vectors here and there to be
able to validate characters which spans across two simd vectors. Such
type of computation is the one cpus are optimized for.

Since we only do simd operations, the number of assembly instructions is
the same if the simd vector is of size 8 or size 64, meaning that if the
cpu has biggest simd sizes available (AVX512) then a considerable
speedup can be achieved.

## Benchmark code

We don't make use of the `benchmark` module, because it is not available
when an external contributor recompiles the stdlib.

```mojo
import sys
from testing import assert_true, assert_false
from utils.string_slice import _is_valid_utf8
import time

@no_inline
fn keep(x: Bool) -> Int:
    return not x

fn get_big_string() -> String:
    var string = str(
        "안녕하세요,세상 hello mojo! 🔥🔥hopefully this string is complicated enough :p"
        " éç__çè"
    )
    # The string is 100 bytes long.
    return string * 100_000  # 10MB

def main():
    print("Has neon:", sys.has_neon())
    print("Has sse4:", sys.has_sse4())
    print("SIMD size:", sys.simdbytewidth())

    var big_string = get_big_string()

    @parameter
    fn utf8_simd_validation_benchmark() raises:
        # we want to validate ~1gb of data
        for _ in range(100):
            var result = _is_valid_utf8(big_string.unsafe_ptr(), len(big_string))
            assert_true(result)

    # warmup
    for _ in range(3):
        utf8_simd_validation_benchmark()

    iterations = 10
    t1 = time.now()
    for _ in range(iterations):
        utf8_simd_validation_benchmark()
    t2 = time.now()
    _ = big_string

    average_ns = (t2 - t1) / iterations
    average_s = average_ns / 1_000_000_000

    print("Validate 1GB of UTF-8 data in", average_s, "s")
    print(1.0 / average_s, "GB/s")
```

Put it in a file called `bench.mojo`.
Bench the nightly version with
```
MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true mojo build bench.mojo && ./bench
```

Bench this branch by doing a checkout on it and then:
```
MOJO_OVERRIDE_COMPILER_VERSION_CHECK=true MODULAR_MOJO_NIGHTLY_IMPORT_PATH=./build mojo build bench.mojo &&  ./bench
```

## Benchmark results:
* On AMD Ryzen 9 7945HX with Radeon Graphics we get a **x10.8** speedup.
* On Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz (WSL2, windows 11) I get
**x7.3** speedup

I don't have the numbers for less capable CPUs, notably the ones that
don't have the instruction to do the `dynamic_shuffle()` in one single
instruction.

If you have other cpus, please run the benchmark and report the numbers
here. Thanks!

## Future work in this area

To further improve the algorithm, a few paths can be taken:
1) The reference implementation has a "fast path" for ASCII chunks where
many instructions are skipped if the chunk is ASCII. This can improve
the speed for some common cases. Such cases are when there is a lot of
ascii chars in a string. This fast path has not been yet implemented.
2) The original author states in the repo, and we could see the changes
made:
> NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: [please
adopt the simdutf library]
(https://github.com/simdutf/). It is much more powerful, faster and
better tested.
3) This algorithm could inform the user which precise byte is causing an
issue and why. This has not been implemented yet. Note that it will add
branching to the loop and will require benchmarking to make sure there
is no performance penalty.
4) I looked at the assembly and some instructions that were present in
the original implementation were not used here. Meaning there is some
room for using specialized instruction that will save a few cycles. A
good example is the `_mm_alignr_epi8` function which should in theory be
one single instruction.

I am not currently working on those improvements. Someone else can
investigate.

Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com>
Closes #3401
MODULAR_ORIG_COMMIT_REV_ID: 3dabaa99b60da779630c7be36f01f8d4468eeab9
modularbot pushed a commit that referenced this pull request Dec 17, 2024
[External] [stdlib] Use SIMD to make `b64encode` 4.7x faster

## Dependencies

The following PR should be merged first:
* #3397

## Description of the changes

`b64encode` is the function that encode bytes to base 64. Base 64
encoding is massively used across the industry, being to write secrets
as text or to send data across the internet.

Since it's going to be used a lot, we should make sure it is fast. As
such, this PR provides a new implementation of `b64encode` around 5
times faster than the current one.

This implementation was taken from the following papers:

Wojciech Muła, Daniel Lemire, Base64 encoding and decoding at almost the
speed of a memory copy, Software: Practice and Experience 50 (2), 2020.
https://arxiv.org/abs/1910.05109
Wojciech Muła, Daniel Lemire, Faster Base64 Encoding and Decoding using
AVX2
Instructions, ACM Transactions on the Web 12 (3), 2018.
https://arxiv.org/abs/1704.00605

Note that there are substancial differences between the papers and this
implementation. There are two reasons for this:
* We want to avoid using assembly/llvm intrinsics directly and try to
use the functions provided by the stdlib
* We want to keep the complexity low, so we don't make a slightly
different algorithm for each simd sizes and each cpu architecture.

In a nutshell, we decide on a simd size, let's say 32. So at each
iteration, we load 32 bytes, reshuffle the 24 first bytes, convert them
to base 64, it then becomes 32 bytes, and then we store those 32 bytes
in the output buffer.

We have a final iteration with the last incomplete chunks where we
shouldn't load everything at once, otherwise we would get out of bounds
errors. We then use partial loads and store and masking, but the main
SIMD algorithm is used.

The reasons for the speedups are simlar to the ones provided in
#3401

## API changes

The existing api is
```mojo
fn b64encode(str: String) -> String:
```
and has several limitations:
1) The input of the function is raw bytes. It doesn't have to represent
text. Requirering the user to provide a `String` forces the user to
handle null termination on its bytes and whatever other requirement
`String` might have to use bytes.
2) It is not possible to write the produced bytes in an existing buffer.
3) It is hard to benchmark as the signature implies that the function
allocates memory on the heap.
4) It supposes that the input value owns the underlying data, meaning
that it's not possible to use the function if the data is not owned.
`Span` would be a better choice here.

We keep in this PR the existing signature for backward compatibility and
add new overloads. Now the signatures are:
```mojo
fn b64encode(input_bytes: List[UInt8, _], inout result: List[UInt8, _])
fn b64encode(input_bytes: List[UInt8, _]) -> String
fn b64encode(input_string: String) -> String
```

Note that it could be further improved in future PRs as currently `Span`
is not easy to use but would be a right fit for the input value. We
could also in the future remove `fn b64encode(input_string: String) ->
String`.

Note that the python api takes `bytes` as input and returns `bytes`.

## Benchmarking

Benchmarking is harder than usual here because the base function does
memory allocation. To avoid having the alloc in the benchmark, we must
modify the original function to add the overloads described above. In
this case we can benchmark and on my system

```
WSL2 windows 11
Intel(R) Core(TM) i7-10700KF CPU @ 3.80GHz
Base speed:	3,80 GHz
Sockets:	1
Cores:	8
Logical processors:	16
Virtualization:	Enabled
L1 cache:	512 KB
L2 cache:	2,0 MB
L3 cache:	16,0 MB
```
We get around 5x speedup.

I don't provide the benchmark script here because it won't work out of
the box (see the issue mentionned above), but if that's really necessary
to get this merged, I'll provide the diff + the benchmark script.

## Future work

As said before, this PR is not an exact re-implementation of the papers
and the state of the art implementation that comes with it, the
[simdutf](https://github.com/simdutf/simdutf) library.

This is to keep this implementation simple and portable as it will work
on any CPU that has an simd size of at least 4 bytes, and below or equal
64 bytes.

In future PRs, we could provide futher speedups by using simd algorithms
that are specific to each architecture. This will greatly increase the
complexity of the code. I'll leave this decision to the maintainers.

We can also re-write `b64decode` using simd and it's also expected that
we'll get speedups. This can be the topic of another PR too.

Co-authored-by: Gabriel de Marmiesse <gabrieldemarmiesse@gmail.com>
Closes #3443
MODULAR_ORIG_COMMIT_REV_ID: 0cd01a091ba8cfdaac49dcf43280de22d9c8b299
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally. merged-externally Merged externally in public mojo repo merged-internally Indicates that this pull request has been merged internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants