-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByteVec: immediate / remote immutable byte vectors + intrinsics. #8964
Conversation
I ended up going with @vtjnash's suggestion of representing this as an Int128 since that makes the code generation easier: you can cast the Int128 to various vector types like Uint8 x 8 or Int x 2, which are both handy for writing bytevec_len and bytevec_ref intrinsics. The code generation is already pretty decent, e.g.: julia> s = Str("Hello") "Hello" julia> @code_native sizeof(s) .section __TEXT,__text,regular,pure_instructions Filename: bytes.jl Source line: 64 push RBP mov RBP, RSP Source line: 64 mov RAX, QWORD PTR [RDI + 8] vmovq XMM0, QWORD PTR [RAX + 16] vmovq XMM1, QWORD PTR [RAX + 8] vpunpcklqdq XMM0, XMM1, XMM0 ## xmm0 = xmm1[0],xmm0[0] vpextrq RAX, XMM0, 1 neg RAX vpextrb ECX, XMM0, 15 test CL, CL cmovns RAX, RCX pop RBP ret One area to investigate optimization is to try to make sure that the checks involved in loading a single byte can be hoisted out of loops *and* ideally, we want, instead of a loop with a branch inside of it a branch with a loop inside of each path, but that's a bit of a more tricky optimization than I think we're doing now. But with that code that iterates strings could really fly. Next steps are to implement bytevec_eq and bytevec_cmp intrinsics. I know these aren't strictly necessary, but I suspect that the builtin versions can be made very efficient.
} here; | ||
struct { | ||
uint8_t *data; | ||
long neglen; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want a signed integer type that is big enough to hold the length of an arbitrary string (up to the factor of 2 lost by the sign bit), shouldn't this be intptr_t
or ptrdiff_t
? The long
type always scares me because I'm never quite sure what it means for an arbitrary compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the rampant assumptions we've made, I suspect that size_t
would do but, yes, long
may be a bit sketchy.
The first issue that was standing in the way of SIMD and vectorization was that the |
I noticed that the clever trick that I used to avoid a branch in the fast path for bytevec_ref when the bytes are immediate was causing a needless dependency on the index in the branch condition even in the case with no bounds checking. This could, of course, prevent certain optimization transforms, so I got rid of it.
@JeffBezanson, @ArchRobison, @vtjnash, @Keno, @jakebolewski, do any of you have thoughts on coaxing LLVM into doing such a transformation? This the current code for julia> @code_llvm length(s)
define i64 @"julia_length;41343"(%jl_value_t*) {
top:
%1 = getelementptr inbounds %jl_value_t* %0, i64 1, i32 0, !dbg !1017
%2 = load %jl_value_t** %1, align 8, !dbg !1017, !tbaa %jtbaa_immut
%3 = getelementptr %jl_value_t* %2, i64 1, !dbg !1017
%4 = bitcast %jl_value_t* %3 to i128*, !dbg !1017
%5 = load i128* %4, align 8, !dbg !1017, !tbaa %jtbaa_immut, !julia_type !1020
%6 = bitcast i128 %5 to <16 x i8>, !dbg !1017
%7 = extractelement <16 x i8> %6, i32 15, !dbg !1017
%8 = bitcast i128 %5 to <2 x i64>, !dbg !1017
%9 = extractelement <2 x i64> %8, i32 1, !dbg !1017
%10 = icmp slt i8 %7, 0, !dbg !1017
%11 = sub i64 0, %9, !dbg !1017
%12 = zext i8 %7 to i64, !dbg !1017
%13 = select i1 %10, i64 %11, i64 %12, !dbg !1017
%14 = icmp slt i64 %13, 1, !dbg !1017
br i1 %14, label %L3, label %L.preheader, !dbg !1017
L.preheader: ; preds = %top
%15 = load i128* %4, align 8, !dbg !1017, !tbaa %jtbaa_immut, !julia_type !1020
%16 = bitcast i128 %15 to <16 x i8>, !dbg !1017
%17 = extractelement <16 x i8> %16, i32 15, !dbg !1017
%18 = icmp sgt i8 %17, -1, !dbg !1017
%19 = bitcast i128 %15 to <2 x i64>, !dbg !1021
%20 = extractelement <2 x i64> %19, i32 1, !dbg !1021
%21 = icmp slt i8 %17, 0, !dbg !1021
%22 = sub i64 0, %20, !dbg !1021
%23 = zext i8 %17 to i64, !dbg !1021
%24 = select i1 %21, i64 %22, i64 %23, !dbg !1021
%25 = extractelement <2 x i64> %19, i32 0, !dbg !1017
br label %L, !dbg !1017
L: ; preds = %L.preheader, %cont
%"#s480.0" = phi i64 [ %33, %cont ], [ 1, %L.preheader ]
%n.0 = phi i64 [ %37, %cont ], [ 0, %L.preheader ]
%26 = add i64 %"#s480.0", -1, !dbg !1017
br i1 %18, label %here, label %there, !dbg !1017
here: ; preds = %L
%27 = trunc i64 %26 to i32, !dbg !1017
%28 = extractelement <16 x i8> %16, i32 %27, !dbg !1017
br label %cont, !dbg !1017
there: ; preds = %L
%29 = add i64 %25, %26, !dbg !1017
%30 = inttoptr i64 %29 to i8*, !dbg !1017
%31 = load i8* %30, align 1, !dbg !1017
br label %cont, !dbg !1017
cont: ; preds = %there, %here
%32 = phi i8 [ %28, %here ], [ %31, %there ], !dbg !1017, !julia_type !1022
%33 = add i64 %"#s480.0", 1, !dbg !1017
%34 = and i8 %32, -64, !dbg !1021, !julia_type !1022
%35 = icmp ne i8 %34, -128, !dbg !1021
%36 = zext i1 %35 to i64, !dbg !1021
%37 = add i64 %36, %n.0, !dbg !1021
%38 = icmp slt i64 %24, %33, !dbg !1021
br i1 %38, label %L3, label %L, !dbg !1021
L3: ; preds = %cont, %top
%n.1 = phi i64 [ 0, %top ], [ %37, %cont ]
ret i64 %n.1, !dbg !1023
} The native code mirrors this branch structure but is less readable. This is pretty good, but it seems like it could be really tight if the two loops were separated and each was vectorized. |
I think the relevant LLVM pass is Unswitch Loops. It's in Julia's pass list. I don't know why it didn't kick in. We'll need to step through it to figure out why it didn't kick in. It's likely a cost/benefit estimate issue and there's a knob we can play with. Here's the LLVM 3.5.0 source for the knob:
|
Thanks, @ArchRobison. I will try to figure out how to step through that and see what it's doing. Or maybe just see if bumping up the threshold does it first. It's possible that it doesn't think that this transformation is worthwhile – and it may be right since branch prediction here may be perfect and thus this can execute quite fast. |
Value *lo_word = builder.CreateExtractElement(words, ConstantInt::get(T_int32, 0)); | ||
Value *addr = builder.CreateAdd(lo_word, i); | ||
Value *ptr = builder.CreateIntToPtr(addr, T_pint8); | ||
Value *there_byte = builder.CreateLoad(ptr, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have a tbaa_decorate(tbaa_user, ...)
around here, if the load is always from user-modifiable storage. Check hierarchy described in src/codegen.cpp
, around comment // type-based alias analysis nodes.
for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out – this memory should never change and so should be decorated as tbaa_const
, I believe, but @JeffBezanson may have something to say about that. One of the major goals of this rewrite is to much more heavily leverage the fact that strings are immutable.
@JeffBezanson, do I need to add |
We should maybe not even force this to go through Int32 since that's not what the LLVM instructions need anyway. I worry that this may cause weird extra ops that aren't really necessary.
This took some careful tweaking, but I managed to get this to generate the same code I was trying to get an intrinsic to produce. This makes me wonder if I shouldn't try to do the same thing with bytevec_eq and may some of the other bytevec intrinsics.
It turns out I can actually generate better, slightly faster code for the length of a ByteVec using bitshifts in Julia.
This one also turns out to be shorter and faster.
This speeds up s[i] significantly, but reveals that endof(s) is a pretty significant bottleneck as it is.
Ok, it seems that if I pepper index 47eeeb5..d93a39c 100644
--- a/src/codegen.cpp
+++ b/src/codegen.cpp
@@ -4662,8 +4662,10 @@ static void init_julia_llvm_env(Module *m)
FPM->add(createLoopRotatePass()); // Rotate loops.
// LoopRotate strips metadata from terminator, so run LowerSIMD afterwards
FPM->add(createLowerSimdLoopPass()); // Annotate loop marked with "simdloop" as LLVM parallel loop
+ FPM->add(createEarlyCSEPass()); //// ****
+ FPM->add(createJumpThreadingPass());
FPM->add(createLICMPass()); // Hoist loop invariants
- FPM->add(createLoopUnswitchPass()); // Unswitch loops.
+ FPM->add(createLoopUnswitchPass(500)); // Unswitch loops.
// Subsequent passes not stripping metadata from terminator
#ifndef INSTCOMBINE_BUG
FPM->add(createInstructionCombiningPass());
@@ -4683,6 +4685,7 @@ static void init_julia_llvm_env(Module *m)
#ifndef INSTCOMBINE_BUG
FPM->add(createInstructionCombiningPass()); // Clean up after the unroller
#endif
+ FPM->add(createEarlyCSEPass()); //// ****
FPM->add(createGVNPass()); // Remove redundancies
//FPM->add(createMemCpyOptPass()); // Remove memcpy / form memset
FPM->add(createSCCPPass()); // Constant prop with SCCP
@@ -4699,6 +4702,7 @@ static void init_julia_llvm_env(Module *m)
FPM->add(createAggressiveDCEPass()); // Delete dead instructions
//FPM->add(createCFGSimplificationPass()); // Merge & remove BBs
+ FPM->add(createEarlyCSEPass()); //// ****
FPM->doInitialization();
} Still haven't coaxed the loop unswitching into happening, but it's a bit closer... |
So the one that seems to have mattered is this: diff --git a/src/codegen.cpp b/src/codegen.cpp
index 47eeeb5..0513916 100644
--- a/src/codegen.cpp
@@ -4663,6 +4663,7 @@ static void init_julia_llvm_env(Module *m)
// LoopRotate strips metadata from terminator, so run LowerSIMD afterwards
FPM->add(createLowerSimdLoopPass()); // Annotate loop marked with "simdloop" as LLVM parallel loop
FPM->add(createLICMPass()); // Hoist loop invariants
+ FPM->add(createEarlyCSEPass());
FPM->add(createLoopUnswitchPass()); // Unswitch loops.
// Subsequent passes not stripping metadata from terminator
#ifndef INSTCOMBINE_BUG Unfortunately, while the LLVM code for this looks nicer, it is about 2x slower. Sigh. |
Just kidding, that was a different data set. This change improves the code but has no measurable impact on performance. I'm still hoping that coaxing the loop to unswitch might have a positive impact on performance. |
This doesn't cause loop unswitching to kick in but it does produce cleaner code in some cases. The CSE pass could probably be placed elsewhere but this spot seems to work well enough.
This isn't really sufficient to handle Latin-1 data smoothly since most non-ASCII Latin-1 characters are not UTF-8 continuation bytes. For that, you need to check if the decoded UTF-8 values is valid.
I tried applying my PR #6271, and with -O and LLVM 3.5, it appeared to not generate the duplicate The code difference arises from including |
Interesting – thanks for checking that out, @ArchRobison. I've decided that for now the best way forward is to to just replace the |
I used the sumchars example for simple benchmarking: function sumchars{S<:String}(a::Array{S}) t = Uint32(0) @inbounds for s in a, c in s t += Uint32(c) end return t end With this change benchmarking with median([ @Elapsed sumchars(words) for _ in 1:100 ]) I found the following performance characteristics: * Str vs. ASCIIString with `@inbounds` – same speed * Str vs. ASCIIString without `@inbounds` – 14% slowdown * Str vs. UTF8String with `@inbounds` – 2.55x speedup * Str vs. UTF8String without `@inbounds` – 75% speedup This strikes me as good enough to replace both string types with a single string type – the artist currently known as `Str`.
I would prefer to just drop the Or is that planned as a separate PR after this one lands? I'm confused because this PR includes a |
{ | ||
jl_bytevec_struct_t b; | ||
if (n < 2*sizeof(void*)) { | ||
memcpy(b.here.data, data, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If n == 2*sizeof(void*) - 1
, then it looks like the string data cannot be NUL-terminated. Julia currently guarantees NUL-termination (in array.c) so that strings can be passed to external C code expecting NUL-terminated strings.
Or is the plan to implement this on top of ByteVec
s, similar to how it is implemented for UTF-16 and UTF-32 strings (i.e. str.data
is actually a bytevec of length 1 more than the length of the string, and contains an explicit NUL terminator)? This might be cleaner, although it is a subtle breaking change for any code that currently looks at the data
field of strings. (We could rename all of the string-type data
fields to data0
in order to make the breakage noisy.)
What about doing the following, more like Python 3: You wouldn't get rid of UTF8String or UTF16String, they would still be available for use, for people who need to convert but So, less memory requirements, better performance all around... do you guys like? |
I think this sounds interesting. DirectIndexes are nice and simple, and operations that require working back from the end of a string are a little painful with UTF-8. Unicode 2 is UCS2, is that right? If so, could we call it that? Some folks might still require a full implementation of UTF-16. Presumably file and stream IO would still generally need to be UTF-8. You've probably seen the very nice work done to support unicode characters like math symbols, e.g. in the REPL and in IJulia. How would your strings play there? Some folks also seem to like emojis; even pizza, ahem. Presumably you're getting accustomed to thinking about type stability as you write Julia. It might be worth thinking through how that would work with your approach. Look forward to seeing more! |
Yes, Unicode2 could be considered UCS-2, you just have to remember that UCS-2 doesn't allow the characters between 0xd800 and 0xdfff. (I didn't call it that on purpose, because many people confuse UCS-2 with UTF-16). Unicode1 is also ANSI Latin1.... I wanted to indicate that it was really just a 1-byte subset of Unicode, as Unicode2 (UCS-2) is a 2-byte subset of Unicode. I believe that the experience with Python 3 was that this scheme saved space, and greatly improved performance... Note: I'm not talking about removing the UTF8String or UTF16String types, just not using the combination of ASCIIString & UTF8String to encode string literals, and instead only use DirectIndexString types, i.e. ASCIIString, Latin1String, UCS2String, and UTF32String. Yes - I think there is probably a bit less problem with type stability... |
@ScottPJones I'm not sure this is the best place to discuss this. Better open a new thread. Anyway, my two cents: the default string type must be able to handle all Unicode chars, to avoid the current situation where you end up with an That said, all kinds of custom string types which are more efficient in specific use cases can get first-class support in Julia. I don't see why you care so much about the type used by string literals. |
@nalimilan Happy to open a new thread - but I'm rather new to GitHub - do you mean make a new issue? I'll explain my reasoning to you then. |
@nalimilan Just one thing though - do you insist on only a single number type? That's where your reasoning leads... |
Yes, please open a new issue. (I don't think that's a valid reductio ad absurdum; there are only two default number types in Julia as it is. It doesn't preclude the existence or use of other types.) |
@pao, I'm sorry, but that's not at all what I've seen in Julia, and what is worse, the default numeric types are not even consistent in their behavior!
|
Oh, and by the way, I think those inconsistencies can lead to hard to spot bugs...
|
@ScottPJones Please, move this to yet another issue or mailing list thread. This is completely unrelated. |
@nalimilan Was that the right way? Thanks! |
|
I ended up going with @vtjnash's suggestion of representing this as an Int128 since that makes the code generation easier: you can cast the
Int128
to various vector types likeUint8 x 8
orInt x 2
, which are both handy for writingbytevec_len
andbytevec_ref
intrinsics.The code generation is already pretty decent, e.g.:
One area to investigate optimization is to try to make sure that the checks involved in loading a single byte can be hoisted out of loops and ideally, we want, instead of a loop with a branch inside of it a branch with a loop inside of each path, but that's a bit of a more tricky optimization than I think we're doing now. But with that code that iterates strings could really fly.
Next steps are to implement
bytevec_eq
andbytevec_cmp
intrinsics. I know these aren't strictly necessary, but I suspect that the built-in versions can be made very efficient.