ByteVec: immediate / remote immutable byte vectors + intrinsics. #8964

StefanKarpinski · 2014-11-10T17:03:48Z

I ended up going with @vtjnash's suggestion of representing this as an Int128 since that makes the code generation easier: you can cast the Int128 to various vector types like Uint8 x 8 or Int x 2, which are both handy for writing bytevec_len and bytevec_ref intrinsics.

The code generation is already pretty decent, e.g.:

julia> s = Str("Hello")
"Hello"

julia> @code_native sizeof(s)
    .section    __TEXT,__text,regular,pure_instructions
Filename: bytes.jl
Source line: 64
    push    RBP
    mov RBP, RSP
Source line: 64
    mov RAX, QWORD PTR [RDI + 8]
    vmovq   XMM0, QWORD PTR [RAX + 16]
    vmovq   XMM1, QWORD PTR [RAX + 8]
    vpunpcklqdq XMM0, XMM1, XMM0 ## xmm0 = xmm1[0],xmm0[0]
    vpextrq RAX, XMM0, 1
    neg RAX
    vpextrb ECX, XMM0, 15
    test    CL, CL
    cmovns  RAX, RCX
    pop RBP
    ret

One area to investigate optimization is to try to make sure that the checks involved in loading a single byte can be hoisted out of loops and ideally, we want, instead of a loop with a branch inside of it a branch with a loop inside of each path, but that's a bit of a more tricky optimization than I think we're doing now. But with that code that iterates strings could really fly.

Next steps are to implement bytevec_eq and bytevec_cmp intrinsics. I know these aren't strictly necessary, but I suspect that the built-in versions can be made very efficient.

@vtjnash

I ended up going with @vtjnash's suggestion of representing this as an Int128 since that makes the code generation easier: you can cast the Int128 to various vector types like Uint8 x 8 or Int x 2, which are both handy for writing bytevec_len and bytevec_ref intrinsics. The code generation is already pretty decent, e.g.: julia> s = Str("Hello") "Hello" julia> @code_native sizeof(s) .section __TEXT,__text,regular,pure_instructions Filename: bytes.jl Source line: 64 push RBP mov RBP, RSP Source line: 64 mov RAX, QWORD PTR [RDI + 8] vmovq XMM0, QWORD PTR [RAX + 16] vmovq XMM1, QWORD PTR [RAX + 8] vpunpcklqdq XMM0, XMM1, XMM0 ## xmm0 = xmm1[0],xmm0[0] vpextrq RAX, XMM0, 1 neg RAX vpextrb ECX, XMM0, 15 test CL, CL cmovns RAX, RCX pop RBP ret One area to investigate optimization is to try to make sure that the checks involved in loading a single byte can be hoisted out of loops *and* ideally, we want, instead of a loop with a branch inside of it a branch with a loop inside of each path, but that's a bit of a more tricky optimization than I think we're doing now. But with that code that iterates strings could really fly. Next steps are to implement bytevec_eq and bytevec_cmp intrinsics. I know these aren't strictly necessary, but I suspect that the builtin versions can be made very efficient.

stevengj · 2014-11-10T18:36:57Z

src/julia.h

+        } here;
+        struct {
+            uint8_t *data;
+            long neglen;


If you want a signed integer type that is big enough to hold the length of an arbitrary string (up to the factor of 2 lost by the sign bit), shouldn't this be intptr_t or ptrdiff_t? The long type always scares me because I'm never quite sure what it means for an arbitrary compiler.

Given the rampant assumptions we've made, I suspect that size_t would do but, yes, long may be a bit sketchy.

StefanKarpinski · 2014-11-11T15:47:45Z

The first issue that was standing in the way of SIMD and vectorization was that the getindex methods could throw exceptions. I've moved the bounds check into the bytevec_ref intrinsic and made it respect the @inbounds, which makes the bounds-check-free code pretty lean, but there's still a branch in there, which seems unavoidable since one of the branches does a memory fetch that's invalid in the other branch. As a result, vectorization does not seem to kick in here. What we need is an LLVM pass that can move the loop inside of each side of the branch and then each loop could separately be vectorized easily.

I noticed that the clever trick that I used to avoid a branch in the fast path for bytevec_ref when the bytes are immediate was causing a needless dependency on the index in the branch condition even in the case with no bounds checking. This could, of course, prevent certain optimization transforms, so I got rid of it.

StefanKarpinski · 2014-11-11T22:44:20Z

@JeffBezanson, @ArchRobison, @vtjnash, @Keno, @jakebolewski, do any of you have thoughts on coaxing LLVM into doing such a transformation? This the current code for length(s::Str):

julia> @code_llvm length(s)

define i64 @"julia_length;41343"(%jl_value_t*) {
top:
  %1 = getelementptr inbounds %jl_value_t* %0, i64 1, i32 0, !dbg !1017
  %2 = load %jl_value_t** %1, align 8, !dbg !1017, !tbaa %jtbaa_immut
  %3 = getelementptr %jl_value_t* %2, i64 1, !dbg !1017
  %4 = bitcast %jl_value_t* %3 to i128*, !dbg !1017
  %5 = load i128* %4, align 8, !dbg !1017, !tbaa %jtbaa_immut, !julia_type !1020
  %6 = bitcast i128 %5 to <16 x i8>, !dbg !1017
  %7 = extractelement <16 x i8> %6, i32 15, !dbg !1017
  %8 = bitcast i128 %5 to <2 x i64>, !dbg !1017
  %9 = extractelement <2 x i64> %8, i32 1, !dbg !1017
  %10 = icmp slt i8 %7, 0, !dbg !1017
  %11 = sub i64 0, %9, !dbg !1017
  %12 = zext i8 %7 to i64, !dbg !1017
  %13 = select i1 %10, i64 %11, i64 %12, !dbg !1017
  %14 = icmp slt i64 %13, 1, !dbg !1017
  br i1 %14, label %L3, label %L.preheader, !dbg !1017

L.preheader:                                      ; preds = %top
  %15 = load i128* %4, align 8, !dbg !1017, !tbaa %jtbaa_immut, !julia_type !1020
  %16 = bitcast i128 %15 to <16 x i8>, !dbg !1017
  %17 = extractelement <16 x i8> %16, i32 15, !dbg !1017
  %18 = icmp sgt i8 %17, -1, !dbg !1017
  %19 = bitcast i128 %15 to <2 x i64>, !dbg !1021
  %20 = extractelement <2 x i64> %19, i32 1, !dbg !1021
  %21 = icmp slt i8 %17, 0, !dbg !1021
  %22 = sub i64 0, %20, !dbg !1021
  %23 = zext i8 %17 to i64, !dbg !1021
  %24 = select i1 %21, i64 %22, i64 %23, !dbg !1021
  %25 = extractelement <2 x i64> %19, i32 0, !dbg !1017
  br label %L, !dbg !1017

L:                                                ; preds = %L.preheader, %cont
  %"#s480.0" = phi i64 [ %33, %cont ], [ 1, %L.preheader ]
  %n.0 = phi i64 [ %37, %cont ], [ 0, %L.preheader ]
  %26 = add i64 %"#s480.0", -1, !dbg !1017
  br i1 %18, label %here, label %there, !dbg !1017

here:                                             ; preds = %L
  %27 = trunc i64 %26 to i32, !dbg !1017
  %28 = extractelement <16 x i8> %16, i32 %27, !dbg !1017
  br label %cont, !dbg !1017

there:                                            ; preds = %L
  %29 = add i64 %25, %26, !dbg !1017
  %30 = inttoptr i64 %29 to i8*, !dbg !1017
  %31 = load i8* %30, align 1, !dbg !1017
  br label %cont, !dbg !1017

cont:                                             ; preds = %there, %here
  %32 = phi i8 [ %28, %here ], [ %31, %there ], !dbg !1017, !julia_type !1022
  %33 = add i64 %"#s480.0", 1, !dbg !1017
  %34 = and i8 %32, -64, !dbg !1021, !julia_type !1022
  %35 = icmp ne i8 %34, -128, !dbg !1021
  %36 = zext i1 %35 to i64, !dbg !1021
  %37 = add i64 %36, %n.0, !dbg !1021
  %38 = icmp slt i64 %24, %33, !dbg !1021
  br i1 %38, label %L3, label %L, !dbg !1021

L3:                                               ; preds = %cont, %top
  %n.1 = phi i64 [ 0, %top ], [ %37, %cont ]
  ret i64 %n.1, !dbg !1023
}

The native code mirrors this branch structure but is less readable. This is pretty good, but it seems like it could be really tight if the two loops were separated and each was vectorized.

ArchRobison · 2014-11-11T22:59:50Z

I think the relevant LLVM pass is Unswitch Loops. It's in Julia's pass list. I don't know why it didn't kick in. We'll need to step through it to figure out why it didn't kick in. It's likely a cost/benefit estimate issue and there's a knob we can play with. Here's the LLVM 3.5.0 source for the knob:

// The specific value of 100 here was chosen based only on intuition and a
// few specific examples.
static cl::opt<unsigned>
Threshold("loop-unswitch-threshold", cl::desc("Max loop size to unswitch"),
          cl::init(100), cl::Hidden);

StefanKarpinski · 2014-11-11T23:45:49Z

Thanks, @ArchRobison. I will try to figure out how to step through that and see what it's doing. Or maybe just see if bumping up the threshold does it first. It's possible that it doesn't think that this transformation is worthwhile – and it may be right since branch prediction here may be perfect and thus this can execute quite fast.

ArchRobison · 2014-11-12T17:56:49Z

src/intrinsics.cpp

+        Value *lo_word = builder.CreateExtractElement(words, ConstantInt::get(T_int32, 0));
+        Value *addr = builder.CreateAdd(lo_word, i);
+        Value *ptr = builder.CreateIntToPtr(addr, T_pint8);
+        Value *there_byte = builder.CreateLoad(ptr, false);


Should have a tbaa_decorate(tbaa_user, ...) around here, if the load is always from user-modifiable storage. Check hierarchy described in src/codegen.cpp, around comment // type-based alias analysis nodes. for more details.

Thanks for pointing that out – this memory should never change and so should be decorated as tbaa_const, I believe, but @JeffBezanson may have something to say about that. One of the major goals of this rewrite is to much more heavily leverage the fact that strings are immutable.

StefanKarpinski · 2014-11-12T22:16:08Z

@JeffBezanson, do I need to add t_func entries for these intrinsics in inference.jl? I tried it, but it doesn't seem to affect code generation at all (which leads me to wonder why so many intrinsics do have t_func entries).

We should maybe not even force this to go through Int32 since that's not what the LLVM instructions need anyway. I worry that this may cause weird extra ops that aren't really necessary.

This took some careful tweaking, but I managed to get this to generate the same code I was trying to get an intrinsic to produce. This makes me wonder if I shouldn't try to do the same thing with bytevec_eq and may some of the other bytevec intrinsics.

It turns out I can actually generate better, slightly faster code for the length of a ByteVec using bitshifts in Julia.

This one also turns out to be shorter and faster.

This speeds up s[i] significantly, but reveals that endof(s) is a pretty significant bottleneck as it is.

StefanKarpinski · 2014-12-02T23:33:52Z

Ok, it seems that if I pepper src/codegen.cpp with FPM->add(createEarlyCSEPass()); in various places, I can get rid of that redundancy and significantly simplify this code:

index 47eeeb5..d93a39c 100644
--- a/src/codegen.cpp
+++ b/src/codegen.cpp
@@ -4662,8 +4662,10 @@ static void init_julia_llvm_env(Module *m)
     FPM->add(createLoopRotatePass());           // Rotate loops.
     // LoopRotate strips metadata from terminator, so run LowerSIMD afterwards
     FPM->add(createLowerSimdLoopPass());        // Annotate loop marked with "simdloop" as LLVM parallel loop
+    FPM->add(createEarlyCSEPass()); //// ****
+    FPM->add(createJumpThreadingPass());
     FPM->add(createLICMPass());                 // Hoist loop invariants
-    FPM->add(createLoopUnswitchPass());         // Unswitch loops.
+    FPM->add(createLoopUnswitchPass(500));      // Unswitch loops.
     // Subsequent passes not stripping metadata from terminator
 #ifndef INSTCOMBINE_BUG
     FPM->add(createInstructionCombiningPass());
@@ -4683,6 +4685,7 @@ static void init_julia_llvm_env(Module *m)
 #ifndef INSTCOMBINE_BUG
     FPM->add(createInstructionCombiningPass()); // Clean up after the unroller
 #endif
+    FPM->add(createEarlyCSEPass()); //// ****
     FPM->add(createGVNPass());                  // Remove redundancies
     //FPM->add(createMemCpyOptPass());            // Remove memcpy / form memset
     FPM->add(createSCCPPass());                 // Constant prop with SCCP
@@ -4699,6 +4702,7 @@ static void init_julia_llvm_env(Module *m)

     FPM->add(createAggressiveDCEPass());         // Delete dead instructions
     //FPM->add(createCFGSimplificationPass());     // Merge & remove BBs
+    FPM->add(createEarlyCSEPass()); //// ****

     FPM->doInitialization();
 }

Still haven't coaxed the loop unswitching into happening, but it's a bit closer...

StefanKarpinski · 2014-12-03T00:01:54Z

So the one that seems to have mattered is this:

diff --git a/src/codegen.cpp b/src/codegen.cpp
index 47eeeb5..0513916 100644
--- a/src/codegen.cpp
@@ -4663,6 +4663,7 @@ static void init_julia_llvm_env(Module *m)
     // LoopRotate strips metadata from terminator, so run LowerSIMD afterwards
     FPM->add(createLowerSimdLoopPass());        // Annotate loop marked with "simdloop" as LLVM parallel loop
     FPM->add(createLICMPass());                 // Hoist loop invariants
+    FPM->add(createEarlyCSEPass());
     FPM->add(createLoopUnswitchPass());         // Unswitch loops.
     // Subsequent passes not stripping metadata from terminator
 #ifndef INSTCOMBINE_BUG

Unfortunately, while the LLVM code for this looks nicer, it is about 2x slower. Sigh.

StefanKarpinski · 2014-12-03T06:26:45Z

Just kidding, that was a different data set. This change improves the code but has no measurable impact on performance. I'm still hoping that coaxing the loop to unswitch might have a positive impact on performance.

This doesn't cause loop unswitching to kick in but it does produce cleaner code in some cases. The CSE pass could probably be placed elsewhere but this spot seems to work well enough.

This isn't really sufficient to handle Latin-1 data smoothly since most non-ASCII Latin-1 characters are not UTF-8 continuation bytes. For that, you need to check if the decoded UTF-8 values is valid.

ArchRobison · 2014-12-04T17:48:10Z

I tried applying my PR #6271, and with -O and LLVM 3.5, it appeared to not generate the duplicate icmp and the code looked a little cleaner. But still not unswitched, and no significant performance impact.

The code difference arises from including BasicAliasAnalysisPass in the passes. I've noticed before that EarlyCSEPass is easily confused by loads unless BasicAliasAnalysisPass is in the pass list. But the intermediate pass dumper kept crashing for the combined PRs, so I can't be sure if that's the case here.

StefanKarpinski · 2014-12-04T17:51:07Z

Interesting – thanks for checking that out, @ArchRobison. I've decided that for now the best way forward is to to just replace the data::Vector{UInt8} fields of UTF8String and ASCIIString with a ByteVec. That way ASCII decoding will remain as fast as it is now and UTF-8 decoding will only by 20% slower.

@inbounds

I used the sumchars example for simple benchmarking: function sumchars{S<:String}(a::Array{S}) t = Uint32(0) @inbounds for s in a, c in s t += Uint32(c) end return t end With this change benchmarking with median([ @Elapsed sumchars(words) for _ in 1:100 ]) I found the following performance characteristics: * Str vs. ASCIIString with `@inbounds` – same speed * Str vs. ASCIIString without `@inbounds` – 14% slowdown * Str vs. UTF8String with `@inbounds` – 2.55x speedup * Str vs. UTF8String without `@inbounds` – 75% speedup This strikes me as good enough to replace both string types with a single string type – the artist currently known as `Str`.

stevengj · 2015-01-14T17:18:30Z

I would prefer to just drop the ASCIIString/UTF8String distinction, which I thought Jeff had also suggested some time ago (see also #8872). This is currently one of the most painful and confusing parts of Julia string handling. Not only does it lead to type instability, but also one gets lots of inadvertently overtyped arrays, e.g. comprehensions yielding ASCIIString[...] where ByteString[...] was intended. Yes, iteration over characters is a bit faster if we know in advance that the data is ASCII, but how important is that operation, really?

Or is that planned as a separate PR after this one lands? I'm confused because this PR includes a Str type, and I'm not sure what that's for.

stevengj · 2015-01-14T18:00:17Z

src/alloc.c

+{
+    jl_bytevec_struct_t b;
+    if (n < 2*sizeof(void*)) {
+        memcpy(b.here.data, data, n);


If n == 2*sizeof(void*) - 1, then it looks like the string data cannot be NUL-terminated. Julia currently guarantees NUL-termination (in array.c) so that strings can be passed to external C code expecting NUL-terminated strings.

Or is the plan to implement this on top of ByteVecs, similar to how it is implemented for UTF-16 and UTF-32 strings (i.e. str.data is actually a bytevec of length 1 more than the length of the string, and contains an explicit NUL terminator)? This might be cleaner, although it is a subtle breaking change for any code that currently looks at the data field of strings. (We could rename all of the string-type data fields to data0 in order to make the breakage noisy.)

ScottPJones · 2015-04-29T16:58:52Z

What about doing the following, more like Python 3:
Standard strings are not null terminated.
Standard strings could possibly be one of: ASCII, Unicode1 (really ANSI Latin1), Unicode 2 (i.e. UTF16, but with no surrogate pairs, like Python does), or UTF32.
This makes lots of operations much faster (O(1) instead of O(n)), and generally saves a lot of space,
since most strings will be representable either by ASCII or Unicode1.
Conversion operations between the types don't require any checking of the contents... you can do very fast (optimized to use operations that may work up to 64 bytes at a time) to widen/narrow between
1, 2, or 4 bytes.
Conversions to/from UTF-8 and UTF-16 (very important) are also going to be fast...
In the very frequent case where you have a string marked internally as ASCII, no conversion is required to go to UTF8, for strings marked ASCII or Unicode1, going to UTF-16 or UTF-32 is just a widening operation, Unicode2 -> UTF-16 is a noop also, only Unicode1 to UTF-8 requires a bit more work, but that can be made faster by doing a check on chunks of bytes with a simple & operation, and then doing the
conversion of 0x80-0xff to 0xC2 0x8/9x or 0xC3 0x8/9x, while just copying unchanged blocks without any high bits set.
For conversions from UTF-8 to standard strings, my UTF-8 validator (written in Julia!) very quickly gets all the information needed to select one of the 4 internal representations, and what sort of conversion operations are needed (i.e. if the UTF-8 string only has ASCII values, it can just be copied in...)
Same thing for UTF-16 and UTF-32, the validator gives the information needed to determine the smallest internal type, and the # of characters it needs to allocate.
Substring operations, which I don't think work at all down in a performant fashion (don't you have to always create a new string object, and tack on that \0, unless the substring goes to the end of the string?) will be O(1) operations, not the O(n) mess now for UTF8 & UTF16 strings.
Even handling \0 at the end, is not really that much of an issue.
Keep a flag in the internal representation [BTW, I like very much this ByteVec stuff, how it is packing short strings, this approach would work well I think with ByteVec], that says whether the string does have a trailing \0.
I assume that the string allocation will be rounding up the memory to some 4, 8, or 16 byte boundary, correct?
So, in most cases, you already have the room to just stick a \0 byte or 2/4 byte word at the end.
(75%, 87.5%, or 93.75% chance for ASCII/Unicode1, 50%/75%/87.5% for Unicode2, or 0%/50%/75% for UTF4)
If there isn't room already, you could either, set the nul terminated flag to 0, or allocate an extra 4/8/16 bytes to be able to store a 0.
The Cstring/Cwstring types of course would still need to add a \0 for substrings that weren't already \0 terminated, but that would be pretty easy, and checked easily by a simple flag in the substring type
to see if you actually needed to do anything...

You wouldn't get rid of UTF8String or UTF16String, they would still be available for use, for people who need to convert but
string literals would not generate UTF8String, it would be one of the above 4 types (which includes the current ASCIIString) (and are all subtypes of DirectIndexString)

So, less memory requirements, better performance all around... do you guys like?
I'd love to work with @StefanKarpinski and @JeffBezanson to make this happen, and allow Julia
to have first class string processing performance.

catawbasam · 2015-04-29T23:23:32Z

I think this sounds interesting. DirectIndexes are nice and simple, and operations that require working back from the end of a string are a little painful with UTF-8.

Unicode 2 is UCS2, is that right? If so, could we call it that? Some folks might still require a full implementation of UTF-16.

Presumably file and stream IO would still generally need to be UTF-8.

You've probably seen the very nice work done to support unicode characters like math symbols, e.g. in the REPL and in IJulia. How would your strings play there? Some folks also seem to like emojis; even pizza, ahem.

Presumably you're getting accustomed to thinking about type stability as you write Julia. It might be worth thinking through how that would work with your approach.

Look forward to seeing more!

ScottPJones · 2015-04-30T13:50:50Z

Yes, Unicode2 could be considered UCS-2, you just have to remember that UCS-2 doesn't allow the characters between 0xd800 and 0xdfff. (I didn't call it that on purpose, because many people confuse UCS-2 with UTF-16). Unicode1 is also ANSI Latin1.... I wanted to indicate that it was really just a 1-byte subset of Unicode, as Unicode2 (UCS-2) is a 2-byte subset of Unicode.
File and stream I/O - it depends, that is really usually whatever encoding you need... UTF-8 is very common for the web (I think the most common now, over 50%), but a lot is ANSI Latin1 or CP1252...
I/O is really a separate issue from the internal encoding of strings...
The math symbols are all in the BMP, there are a few simple emoticons in the BMP, and emoji are all non-BMP, so they would be encoded with UTF-32.
The emoji take 4 bytes in all encodings, currently, or with my scheme...

I believe that the experience with Python 3 was that this scheme saved space, and greatly improved performance...

Note: I'm not talking about removing the UTF8String or UTF16String types, just not using the combination of ASCIIString & UTF8String to encode string literals, and instead only use DirectIndexString types, i.e. ASCIIString, Latin1String, UCS2String, and UTF32String.
(do you like Latin1String & UCS2String better?)

Yes - I think there is probably a bit less problem with type stability...
currently, a string literal can be either:
ASCIIString, which is <: DirectIndexString <: AbstractString, or UTF8String, which is <: AbstractString.
With my scheme, all string literals are of types <: DirectIndexString.

nalimilan · 2015-04-30T16:29:38Z

@ScottPJones I'm not sure this is the best place to discuss this. Better open a new thread.

Anyway, my two cents: the default string type must be able to handle all Unicode chars, to avoid the current situation where you end up with an ASCIIString or a UTF8String depending on its contents (which is annoying because e.g. a concretely-typed array of strings is only able to contain one of those types). So it cannot be ASCII nor Latin-1 (no idea why you call that "Unicode1"...), nor even a more complete Unicode subset like UCS-2. Thus, if you want a fixed-width encoding, all you can use is UTF-32, which leads to a waste of memory for most cases. This is why UTF-8, despite its complexity, is such a compelling choice.

That said, all kinds of custom string types which are more efficient in specific use cases can get first-class support in Julia. I don't see why you care so much about the type used by string literals.

ScottPJones · 2015-05-01T03:27:58Z

@nalimilan Happy to open a new thread - but I'm rather new to GitHub - do you mean make a new issue? I'll explain my reasoning to you then.

ScottPJones · 2015-05-01T03:29:30Z

@nalimilan Just one thing though - do you insist on only a single number type? That's where your reasoning leads...

pao · 2015-05-01T12:21:51Z

Yes, please open a new issue.

(I don't think that's a valid reductio ad absurdum; there are only two default number types in Julia as it is. It doesn't preclude the existence or use of other types.)

ScottPJones · 2015-05-01T13:57:09Z

@pao, I'm sorry, but that's not at all what I've seen in Julia, and what is worse, the default numeric types are not even consistent in their behavior!
There are two default string types from literals, and seven! from numeric types...

typeof(0x0) -> UInt8
typeof(0xfff) -> UInt16
typeof(0xfffff) -> UInt32
typeof(0xfffffffff) -> UInt64
typeof(0xffffffffffffff) -> UInt64
typeof(0xfffffffffffffffffff) -> UInt128
typeof(0xfffffffffffffffffffffffffffffff) -> UInt128
typeof(0xffffffffffffffffffffffffffffffffffffffff) -> Base.GMP.BigInt
typeof(0) -> Int64
typeof(123412341234213423) -> Int64
typeof(12341234123421342312341234234) -> Int128
typeof(1234123412342134231234123423412341234123423412342134) -> Base.GMP.BigInt
typeof(123.0) -> Float64
typeof(123123412342134234.0) -> Float64
typeof(1231234123421342312341234123412344.0) -> Float64
typeof(123123412342134231234123412341234124123412344.0) -> Float64
typeof(123123412342134231234123412341234124123412344124123432.0) -> Float64

typeof("abc") -> ASCIIString
typeof("\uff") -> UTF8String
typeof("\uffff") -> UTF8String
typeof("\U10ffff") -> UTF8String

ScottPJones · 2015-05-01T14:05:06Z

Oh, and by the way, I think those inconsistencies can lead to hard to spot bugs...

~0 -> -1
~0x0 -> 0xff
~0x00 -> 0xff
~0x000 -> 0xffff
~0x0000 -> 0xffff
~0x00000 -> 0xffffffff
~0x000000000 -> 0xffffffffffffffff
~0x00000000000000000 -> 0xffffffffffffffffffffffffffffffff
and the last two cases are really fun!
~0x00000000000000000000000000000000 -> 0xffffffffffffffffffffffffffffffff
~0x000000000000000000000000000000000 ->  -1

nalimilan · 2015-05-01T14:32:03Z

@ScottPJones Please, move this to yet another issue or mailing list thread. This is completely unrelated.

ScottPJones · 2015-05-01T14:52:13Z

@nalimilan Was that the right way? Thanks!

StefanKarpinski · 2015-05-01T16:08:08Z

0x000000000000000000000000000000000 should probably be a syntax error.

StefanKarpinski mentioned this pull request Nov 10, 2014

[WIP] bytes, strings: exploring more efficient string representation #8890

Closed

stevengj reviewed Nov 10, 2014
View reviewed changes

StefanKarpinski added 3 commits November 11, 2014 16:02

ByteVec: move bounds checking into the bytevec_ref intrinsic.

49fa2a6

ByteVec: make bytevec_ref respect @inbounds annotations.

8e7a309

Str: a couple of @inbounds annotations for Str methods.

4ac1a8e

ByteVec: long neglen => intptr_t neglen as per @stevengj.

585288b

ArchRobison reviewed Nov 12, 2014
View reviewed changes

ByteVec: decorate remote bytevec data as constant.

6291b61

MikeInnes force-pushed the master branch from 5c60996 to b1c3df3 Compare November 14, 2014 17:07

StefanKarpinski added 13 commits November 16, 2014 12:29

ByteVec: bytevec_ref32 fetching individual bytes, shifts and ors.

cefb779

ByteVec: use bytevec_ref32 intrinsic to decode UTF-8 characters.

51fbcdd

ByteVec: bytevec_eq intrinsic (warmup for bytevec_cmp).

5ac7e34

codegen: type T_intN more precisely as IntegerType pointers.

84057ec

bitshifts: don't do checked conversion to Int32.

ac9a0e3

We should maybe not even force this to go through Int32 since that's not what the LLVM instructions need anyway. I worry that this may cause weird extra ops that aren't really necessary.

ByteVec: fast cmp implementation.

59cbe1e

This took some careful tweaking, but I managed to get this to generate the same code I was trying to get an intrinsic to produce. This makes me wonder if I shouldn't try to do the same thing with bytevec_eq and may some of the other bytevec intrinsics.

ByteVec: length(::ByteVec) in Julia instead of as an intrinsic.

922294d

It turns out I can actually generate better, slightly faster code for the length of a ByteVec using bitshifts in Julia.

ByteVec: replace bytevec_eq with a Julia-native implementation.

b670161

This one also turns out to be shorter and faster.

ByteVec: bytevec_utf8_ref intrinsic for UTF-8 character decoding.

3dace93

This speeds up s[i] significantly, but reveals that endof(s) is a pretty significant bottleneck as it is.

ByteVec: fast endof(s::Str) using clever bit tricks and getu32.

ea1ba4f

ByteVec: rewrite bytevec_ref32 to only require non-vector code.

94b9253

ByteVec: make bytevec_ref also use only normal integer instructions.

e6c60b6

ByteVec: fix inverted bounds check elimination condition.

c5d69a7

StefanKarpinski added 3 commits December 3, 2014 11:46

src/codegen.cpp: add CSE pass right before loop unswitching.

84a65cd

This doesn't cause loop unswitching to kick in but it does produce cleaner code in some cases. The CSE pass could probably be placed elsewhere but this spot seems to work well enough.

ByteVec: use a single if for bounds check and error raising.

6add6b7

ByteVec/Str: return \ufffd when decoding a continuation byte.

db067d0

This isn't really sufficient to handle Latin-1 data smoothly since most non-ASCII Latin-1 characters are not UTF-8 continuation bytes. For that, you need to check if the decoded UTF-8 values is valid.

StefanKarpinski added 3 commits December 6, 2014 23:04

ByteVec/Str: handle invalid UTF-8 data correctly without branches.

d7e5f7c

UTF8String: use branch-free decoding. unfortunately, it's slower.

64119c9

stevengj reviewed Jan 14, 2015
View reviewed changes

peter1000 mentioned this pull request Apr 30, 2015

Group section by category only. MichaelHatherly/Lexicon.jl#99

Closed

ScottPJones mentioned this pull request May 1, 2015

improve documentation of numeric literals #11081

Closed

vtjnash mentioned this pull request May 1, 2015

Fix Unicode bugs with UTF-16/UTF-32 conversions (#10959) #11004

Closed

StefanKarpinski closed this Jun 21, 2016

KristofferC deleted the sk/bytevec branch June 4, 2018 08:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ByteVec: immediate / remote immutable byte vectors + intrinsics. #8964

ByteVec: immediate / remote immutable byte vectors + intrinsics. #8964

StefanKarpinski commented Nov 10, 2014

stevengj Nov 10, 2014

StefanKarpinski Nov 10, 2014

StefanKarpinski commented Nov 11, 2014

StefanKarpinski commented Nov 11, 2014

ArchRobison commented Nov 11, 2014

StefanKarpinski commented Nov 11, 2014

ArchRobison Nov 12, 2014

StefanKarpinski Nov 12, 2014

StefanKarpinski commented Nov 12, 2014

StefanKarpinski commented Dec 2, 2014

StefanKarpinski commented Dec 3, 2014

StefanKarpinski commented Dec 3, 2014

ArchRobison commented Dec 4, 2014

StefanKarpinski commented Dec 4, 2014

stevengj commented Jan 14, 2015

stevengj Jan 14, 2015

ScottPJones commented Apr 29, 2015

catawbasam commented Apr 29, 2015

ScottPJones commented Apr 30, 2015

nalimilan commented Apr 30, 2015

ScottPJones commented May 1, 2015

ScottPJones commented May 1, 2015

pao commented May 1, 2015

ScottPJones commented May 1, 2015

ScottPJones commented May 1, 2015

nalimilan commented May 1, 2015

ScottPJones commented May 1, 2015

StefanKarpinski commented May 1, 2015

ByteVec: immediate / remote immutable byte vectors + intrinsics. #8964

ByteVec: immediate / remote immutable byte vectors + intrinsics. #8964

Conversation

StefanKarpinski commented Nov 10, 2014

stevengj Nov 10, 2014

Choose a reason for hiding this comment

StefanKarpinski Nov 10, 2014

Choose a reason for hiding this comment

StefanKarpinski commented Nov 11, 2014

StefanKarpinski commented Nov 11, 2014

ArchRobison commented Nov 11, 2014

StefanKarpinski commented Nov 11, 2014

ArchRobison Nov 12, 2014

Choose a reason for hiding this comment

StefanKarpinski Nov 12, 2014

Choose a reason for hiding this comment

StefanKarpinski commented Nov 12, 2014

StefanKarpinski commented Dec 2, 2014

StefanKarpinski commented Dec 3, 2014

StefanKarpinski commented Dec 3, 2014

ArchRobison commented Dec 4, 2014

StefanKarpinski commented Dec 4, 2014

stevengj commented Jan 14, 2015

stevengj Jan 14, 2015

Choose a reason for hiding this comment

ScottPJones commented Apr 29, 2015

catawbasam commented Apr 29, 2015

ScottPJones commented Apr 30, 2015

nalimilan commented Apr 30, 2015

ScottPJones commented May 1, 2015

ScottPJones commented May 1, 2015

pao commented May 1, 2015

ScottPJones commented May 1, 2015

ScottPJones commented May 1, 2015

nalimilan commented May 1, 2015

ScottPJones commented May 1, 2015

StefanKarpinski commented May 1, 2015