Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why not an NTuple? #6

Closed
stevengj opened this issue Oct 1, 2021 · 9 comments
Closed

why not an NTuple? #6

stevengj opened this issue Oct 1, 2021 · 9 comments

Comments

@stevengj
Copy link
Member

stevengj commented Oct 1, 2021

I was looking at the source code here, and I was a little confused about why you chose to use a primitive type. Wouldn't it be simpler to work with a definition like:

struct InlineString{N} <: AbstractString
    data::NTuple{N, UInt8}
end

similar to StaticArrays.jl?

I'm guessing you already discussed/tried this option and found it problematic for some reason, but I couldn't find any information so I wanted to double-check.

@quinnj
Copy link
Member

quinnj commented Oct 2, 2021

Good question! I didn't explicitly code up the NTuple approach, but did consider it when contemplating various approaches to the representation (ShortStrings.jl, the original idea for this package, just wrapped UInts for the representation).

The answer for primitive vs. NTuple is that it's much easier to "work" with primitive types vs. NTuple for the operations I needed. In particular doing the byte-swap operation is just a call to the native LLVM instruction Base.bswap_int. This is a critical operation for construction and conversion to/from String/InlineStrings, so it's nice to just "pass off" the operation to LLVM instead of trying to fiddle something fast w/ NTuple.

I was also pleasantly surprised to learn that the native LLVM instructions "just work" on any size primitive, even up to the InlineString255 type that is defined like primitive type InlineString255 256 end, i.e. 256 bytes. So no additional code needed between different InlineString types/sizes, we can just generically call the same LLVM instructions no matter the primitive size.

I think if Julia had native support for "updating immutables" (I think Keno started a PR around this idea at one point), then it would probably be easier to use NTuple, but as of the current state of things, I think it ends up being a very simple, clever solution to have these strings types be so "close to the metal".

@stevengj
Copy link
Member Author

stevengj commented Oct 2, 2021

One big advantage of an NTuple is then you can easily write methods that work generically for InlineString{N} for any N, relying on Julia to do unrolling on N, without using generated functions. Nor would you need to define types like String7 — people could just use InlineString{7} directly.

I don't see why you would need byte-swapping at all if you used NTuples — byte-swapping seems like an artifact of trying to represent byte sequences by multibyte Integer types.

@mkitti
Copy link
Contributor

mkitti commented Sep 27, 2022

I started to implement a NTuple based AbstractString here: https://github.com/mkitti/NStrings.jl

@nickrobinson251
Copy link
Collaborator

Since there's both Jacob's answer above and now also the NStrings.jl package exploring the NTuple design option, i will close this issue.

Feel free to comment if there's a reason to keep this open. And thanks for the good question!

@mkitti
Copy link
Contributor

mkitti commented Oct 3, 2022

Update: NStrings.jl is now https://github.com/mkitti/StaticStrings.jl

@quinnj
Copy link
Member

quinnj commented Oct 3, 2022

I'm interested to see how the NTuple approach works @mkitti; it'd be great to compare benchmark notes at some point to see InlineString vs. StaticString vs. Base.String. There are additional bit-level optimizations I'd like to do for InlineStrings, but just haven't found the bandwidth. In general, there are some strong wins for InlineString vs. Base.String, but a couple corner cases where it ends up being just on-par or slightly slower. Nothing significantly worse though yet.

I've also sketched out an idea where you'd essentially have a string type like:

struct HybridString{T <: InlineString}
    str::Union{T, Base.String}
end

so essentially you have a string that might be inlined, but might also point to a Base.String. The idea here is that in big data workloads, you could work with a single "string" column type, like HybridString{String15}, and all the string columns would have the same type. And they're either inline (String15), or a pointer to a string (Base.String).

Anyway, I like seeing more experimentation with alternative string types in Julia, and I think eventually we'll be able to improve Julia's Base.String type to be enhanced with some inline-ability like InlineStrings or StaticStrings.

@mkitti
Copy link
Contributor

mkitti commented Oct 3, 2022

It would be good to reconcile the two packages at some point. There is currently some endian weirdness:

julia> using InlineStrings, StaticStrings

julia> is = String15("hello")
"hello"

julia> ss = StaticString{16}("hello")
"hello\0\0\0\0\0\0\0\0\0\0\0"

julia> reinterpret(StaticString{16}, [is])
1-element reinterpret(StaticString{16}, ::Vector{String15}):
 "\x05\0\0\0\0\0\0\0\0\0\0olleh"

julia> reinterpret(UInt8, [ss])
16-element reinterpret(UInt8, ::Vector{StaticString{16}}):
 0x68
 0x65
 0x6c
 0x6c
 0x6f
 0x0a
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00

julia> reinterpret(UInt8, [is])
16-element reinterpret(UInt8, ::Vector{String15}):
 0x06
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x00
 0x0a
 0x6f
 0x6c
 0x6c
 0x65
 0x68

@quinnj
Copy link
Member

quinnj commented Oct 3, 2022

Yes, we store strings in opposite endianess so that sorting an array of inlinestrings can use radixsort and be super fast (we have our own radixsort implementation in InlineStrings).

@mkitti
Copy link
Contributor

mkitti commented Oct 3, 2022

At least the AbstractString interface seems to be working as expected.

julia> using InlineStrings, StaticStrings

julia> is = String15("hello\n")
"hello\n"

julia> css = CStaticString{16}("hello\n")
"hello\n"

julia> CStaticString(is)
"hello\n"

julia> String15(css)
"hello\n"

We could likely optimize the interface between the packages with a byte swap.

One of my primary motivations for StaticStrings.jl is compatible storage with C's char cstring[N], while I think the objective of InlineStrings.jl is performance. For example, the following should work with CStaticString but not String15:

julia> ccall(:printf, Cint, (Ptr{Nothing},), Ref(css))
hello
6

julia> ccall(:printf, Cint, (Ptr{Nothing},), Ref(is))
1

I know that the following works due to Base.cconvert(...)::String

julia> ccall(:printf, Cint, (Ptr{Cchar},), is)
hello
6

julia> Base.cconvert(Ptr{Cchar}, is) |> typeof
String

julia> @which Base.cconvert(Ptr{Cchar}, is)
cconvert(::Type{Ptr{Int8}}, s::AbstractString) in Base at pointer.jl:63

This is good though. The packages are optimized for distinct purposes, yet are interoperable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants