copyuntil(out::IO, in::IO, delim) #48273

stevengj · 2023-01-14T01:54:55Z

This PR defines and exports ~~a new function readuntil!(s::IO, buffer::AbstractVector{UInt8}, delim)~~ new functions:

copyuntil(out::IO, s::IO, delim; keep=false)
copyline(out::IO, s::IO; keep=false)

that read data from s into ~~buffer in-place (resized if needed)~~ the out stream until delim is read/written or the end of the stream is reached.

The PR was inspired by this post from @jakobnissen: the goal is to make it easier implement an allocation-free eachline iterator. This can be done in a package, given readuntil with an IOBuffer, using the StringViews.jl package to return a string view of the in-place buffer on each iteration.

The reason it seemed like this needed a Base function, instead of living completely in a package, is that readuntil relies on a low-level jl_readuntil C function that would be difficult to replicate in a package. To obtain comparable performance, it seems like we need an analogous jl_readuntil_buf method (implemented in this PR) and a corresponding Julia API.

Moreover, relatively little new code was required because many of the existing readuntil methods used an IOBuffer internally, so it was merely a matter of refactoring and exporting this functionality. Also, we already had an optimized ios_copyuntil function for copying between IOStreams, which can now be exported in the new API.

To do:

Fix bootstrapping failures
Benchmarks. (Maybe it is faster just to read the file in 4k blocks into a buffer with readbytes! and then return StringViews on top of that? This could happen completely in a package. It's a lot easier to use something like readuntil!, however.)
Tests
More docs
NEWS
Fixes and tests for new out::IO variant
add readline(out::IO, in::IO) too?
more tests for readline(out::IO, in::IO) methods
more benchmarks and optimization

Before I do much more work on this, what do people think?

jakobnissen · 2023-01-14T09:13:08Z

I think it looks great, especially if it's a performance boon. Thanks for working on this.
Do Julia IOStreams have their own buffer? Otherwise I could imagine having to interact with the OS for every line might be quite slow.

stevengj · 2023-01-14T13:25:56Z

Yes, they do have an internal buffer, though I'm not sure how large it is. That's the main reason for this PR — there's no good way to re-implement jl_readline except in C because it needs to access the internal IOStream buffer (hidden inside a C-only data structure).

As I mentioned, the alternative is to do some manual buffering, e.g. reading a file in 4k chunks similar to #42225.

Of course, the advantage of manual buffering ala #42225 is that it can be implemented in pure Julia code in a package. On the the other hand, the big disadvantage of manual buffering is that it is problematic for non-seekable streams (e.g. pipes) — if you want to read only a few lines, and then do something else with the remainder of the file, the user wants to be able to read from the end of the last line, not from the end of the last buffer chunk.

(You can also use BufferedStreams.jl for additional buffering. That's composable with this PR, though in the long run we may want to add a more specialized method for readuntil!(::BufferedStream, ...))

stevengj · 2023-01-15T01:12:11Z

I ran some benchmarks on master with the current code in this PR. My benchmark was to open io = open("HISTORY.md", "r"), using the HISTORY.md file in the julia repo (7037 lines, ≈400kB), and count the length of each line:

old — sum(length, eachline(io)) — 860.583 μs (14091 allocations: 623.17 KiB)
this PR — readuntil! with a preallocated buffer and StringViews to wrap the buffer in a copy-free string: 671.451 μs (0 allocations: 0 bytes)
readuntil! with a preallocated buffer and StringViews, but without the low-level jl_readline_buf function: 18.419 ms (0 allocations: 0 bytes)
sum(length, eachline(io)) with a BufferedInputStream: 3.781 ms (31347 allocations: 1.97 MiB)
readuntil! with a preallocated buffer and StringViews on a BufferedInputStream: 1.144 ms (0 allocations: 0 bytes)

So, only about a 30% speed improvement by eliminating the String allocations. Decent but not overwhelming. Still satisfying to see 0 allocations, however. Also, note that the current speed of eachline relies heavily on the low-level jl_readuntil function, so it doesn't carry over to other types of streams like BufferedInputStream.

On the other hand, it is almost 30× slower if you use a plain IOStream but don't exploit the internal ios_t buffer via jl_readline_buf. That underscores why this functionality is hard to implement outside of Base.

If this PR is merged, it would be good to update BufferedStreams.jl to add a specialized readuntil! method for BufferedInputStream that just copies bytes in a big chunk directly from the buffer rather than the fallback of reading bytes one-by-one. (If you don't use memcpy, you lose.) Hopefully this will improve the speed even further.

Benchmark code

using StringViews, BenchmarkTools, BufferedStreams

io = open("HISTORY.md", "r")

@show sum(length, eachline(seekstart(io), keep=true))
@btime sum(length, eachline(seekstart($io), keep=true))

function doit!(io, buf, delim)
    s = 0
    while !eof(io)
        n = readuntil!(io, buf, delim)
        s += length(StringView(@view buf[1:n]))
    end
    return s
end

buf = Array{UInt8}(undef, 1024)
@show doit!(seekstart(io), buf, '\n')
@btime doit!(seekstart($io), $buf, '\n')

bio = BufferedInputStream(seekstart(io))

@show sum(length, eachline((seekstart(bio); bio), keep=true))
@btime sum(length, eachline((seekstart($bio); $bio), keep=true))

@show doit!((seekstart(bio); bio), buf, '\n')
@btime doit!((seekstart($bio); $bio), $buf, '\n')

close(bio)

stevengj · 2023-01-15T21:59:16Z

I added an optimized readuntil! for IOBuffer, and modified readuntil(::IOBuffer, ...) to use it. It seems to be a net win even for traditional eachline usage that allocates a string on each call. e.g.

iob = IOBuffer(read("HISTORY.md"))
@btime sum(length, eachline(seekstart($iob), keep=true))

previously gave 1.622 ms (23488 allocations: 1.96 MiB) and now gives 1.162 ms (23484 allocations: 1.96 MiB).

If I use an IOBuffer with readuntil! and a pre-allocated buffer with StringView as above, it gives 697.035 μs (0 allocations: 0 bytes).

One complication is that the optimal strategy for readuntil!(::IOBuffer, ...) depends the length of the "line" (or other data) being read. For very short lines (< 20 bytes) it is better to have a single "manual" loop that simultaneously copies the data and checks for the delimiter. For longer lines, it is better to call findfirst followed by copyto! (i.e. memchr followed by memcpy). I chose to optimize it for longer lines. I didn't see a way to get both without having a Julia-native memcpy. (This is a win for HISTORY.md because the mean line length is about 52 bytes.)

Of course, if you have all of your data in an in-memory buffer already, it is almost certainly even better to use the eachsplit iterator to loop over SubStrings, since that doesn't copy data at all.

stevengj · 2023-01-15T23:02:13Z

Should be ready for review.

stevengj · 2023-01-16T17:19:18Z

Other related threads:

stevengj · 2023-01-16T20:57:27Z

cc @rickbeeloo, author of the https://github.com/rickbeeloo/ViewReader package. On my machine, sum(length, ViewReader.eachlineV("HISTORY.md")) is currently around 600µs, a bit faster than readline! directly on the IOStream as above — probably because ViewReader implements its own buffered I/O with much larger buffers than IOStream.

Once we settle on an API, however, BufferedStreams.jl should be able to implement a specialized readline! method and get similar benefits. Then ViewReader (or a similar package) can become quite tiny, basically just combining BufferedStreams with StringViews and readuntil!.

stevengj · 2023-01-18T04:09:28Z

I was thinking about it some more, and arguably if you want to have maximum performance for an in-place eachline-like iteration, then you maybe want to read from the stream in large blocks (containing many linesa) ala BufferedStreams and then don't copy at all if possible — just return a StringView directly into the buffer. (Only a little data motion is required when you get to the end of the buffer, and need to move the final line fragment back to the beginning to read more.) I think this is what ViewReader.jl does?

From that perspective, however, a readuntil! API is unnecessary? Though it may still be more convenient for some applications, especially if you want to read other data in between reading lines.

jakobnissen · 2023-01-18T05:55:29Z

True, and indeed something like Rust's std::Io::BufReader might be nice to have. However, how would you keep track of whether it's safe to shift data in the buffer, given that StringViews may hold a reference to any data in the buffer?

rickbeeloo · 2023-01-18T08:33:58Z

I also agree reading bigger chunks would do better, especially for those reading from HDDs

Only a little data motion is required when you get to the end of the buffer, and need to move the final line fragment back to the beginning to read more

This works as long as the "to be finished line" is shorter than the currently allocated buffer. If not it might again not find the newline. That would require some extra logic to see if the newline is found yet and otherwise still increase the buffer and extend. Which, in the worst case would give a buffer ~2x the longest line in the file. I didn't really spend time optimizing that, instead, I allocated two buffers (each being the max line length - given by the user) then flip them around like you said and warn if no newline is found after

Though it may still be more convenient for some applications, especially if you want to read other data in between reading lines

readuntil! will also still be increasingly faster for very long lines (not present in HISTORY.md)

stevengj · 2023-02-09T17:53:21Z

Marking for triage since it would be good to get some feedback from core devs on whether this API is desired.

stevengj · 2023-02-10T15:50:12Z

I was thinking about this some more in connection with #48625, and I'm starting to feel that a better API would be:

readuntil(out::IO, in::IO, delim)

i.e. add an optional out::IO argument instead of a buffer. Advantages:

Includes functionality of a buffer, since you can pass a preallocated IOBuffer.
Greater functionality because now you can output to arbitrary streams.
Little new code since much of the code already uses IOBuffer internally; just needs to be refactored.
We already have an optimized ios_copyuntil function for copying between IOStreams. The new API would let us export this.

Philosophically, an IOBuffer is already essentially the way to do "in-place" string-like operations in Julia, so it makes a lot of sense to me to add ::IO output as an option to more string functions.

stevengj · 2023-02-11T20:48:37Z

Updated to use the new readuntil(out::IO, in::IO, delim) API. Not tested yet and probably slightly broken, but should give a good sense of what the code will look like.

fredrikekre · 2023-02-12T13:17:32Z

There is already write(out::IO, in::IO), which is more or less readuntil(out, in , eof). Perhaps this should also be a write method? Edit: Although I guess what you are doing "until" is to read, so write(out, in, until=...) doesn't read so nicely.

stevengj · 2023-02-12T14:14:34Z

@fredrikekre, I think it's also more discoverable to stick to the convention that you can add an out::IO method as the first argument of a "string" function to make it write to a buffer instead of returning a string. That way you don't have to go hunting around for "what is the buffer equivalent of foo(...)", since it's always foo(io, ...).

For example, we already have join(...) and join(io, ...), and I think we should also add an io argument to replace (#48625), and probably to other functions as well. They can't all be methods of write or print.

JeffBezanson · 2023-02-16T20:21:10Z

This is great functionality, triage is 👍 . However, we do like the idea of calling it copy and/or copyuntil since it both reads and writes. This can be merged when a name is agreed on, but it seems to me with the code here we should also be able to get versions that (1) copy the entire stream and (2) copy a set number of bytes as well.

Seelengrab · 2023-02-16T20:38:47Z

Posting a reference to the IO blocking behavior issue for future reference #24526

stevengj · 2023-02-16T20:45:05Z

copyuntil seems fine to me.

… buffer

Co-authored-by: Rafael Fourquet <fourquet.rafael@gmail.com>

stevengj · 2023-07-05T17:49:24Z

I'm having trouble seeing how the Asan segfault (in src/flisp/print.c) could be due to this PR.

(I added an additional parameter keep to ios_copyuntil, which is called by flisp, but by passing keep=1 the behavior should be identical to before.)

vtjnash · 2023-07-06T19:56:00Z

base/iostream.jl

+        end
+        (eof(s) || len == out.maxsize) && break
+        len = min(2len + 64, out.maxsize)
+        resize!(d, len)


This seems like it might not quite respect maxsize (not copying the last chunk if it exceeded maxsize during that read). Should it use ensureroom instead? The ensureroom call at the top also seems wrong, since if someone new the out stream was already sized correctly for a smaller read, that will forceably try to reallocate it.

I'm not sure what you mean by "not copying the last chunk". By the time it hits the len == out.maxsize branch, jl_readuntil_buf has already copied out.maxsize bytes into the buffer. The jl_readuntil_buf never reads a chunk exceeding maxsize, because the number of bytes that it reads is bounded by len - ptr + 1.

ensureroom seems like it has quite a few extra checks that aren't needed in the inner loop here given the ensureroom call at the top.

I guess it wouldn't hurt to change the ensureroom call at the top to something like ensureroom(out, isempty(out.data) ? 16 : 0)? Hmm, no, that wouldn't work either… I need at least ensureroom(out, 1) to be certain that the jl_readuntil_buf will read at least 1 byte if there is something to read, as otherwise the iszero(n) && break check is wrong. Probably it's fine to just do ensureroom(out, 1) here. (The main application of this method is probably to read repeatedly into the same seekstart(buf), as in the examples above, in which case ensureroom will do nothing … the buffer will already be as big as the largest line previously read.)

Ah, I guess that makes sense. The checks were a bit distant so it wasn't immediately obvious that was being ensured.

Yes, I think checking isempty first makes sense. I realized now that ensureroom already truncates the request to maxsize, so there is no issue with calling that.

should be resolved by #50485

JeffBezanson · 2023-07-07T19:14:13Z

base/io.jl

    end
    return out
 end
+readuntil(s::IO, delim::T; keep::Bool=false) where T =
+    _copyuntil(Vector{T}(), s, delim, keep)
+readuntil(s::IO, delim::UInt8; keep::Bool=false) =


This overwrites the definition on line 525, giving a warning during sysimage build.

Is there a way to catch those warnings and turn them into errors on CI when building the sysimage? I guess we don't want this kind of situations.

~~Will file a PR to fix this shortly.~~ Filed #50485

stevengj added domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:strings "Strings!" labels Jan 14, 2023

stevengj added performance Must go faster needs tests Unit tests are required for this change needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Jan 14, 2023

stevengj removed the needs docs Documentation for this change is required label Jan 15, 2023

stevengj removed needs tests Unit tests are required for this change needs news A NEWS entry is required for this change labels Jan 15, 2023

stevengj added the status:triage This should be discussed on a triage call label Feb 9, 2023

stevengj changed the title ~~in-place readuntil!~~ readuntil(out::IO, in::IO, delim) Feb 11, 2023

JeffBezanson removed the status:triage This should be discussed on a triage call label Feb 16, 2023

stevengj and others added 18 commits July 5, 2023 13:48

add at least 128 bytes on resize, in case caller starts with an empty…

942bead

… buffer

bug fixes and improvements in jl_readuntil_buf

1e7831f

add readuntil! to manual

d7d7b72

optimized IOBuffer readuntil

88b2155

tests, fixes

53e1731

NEWS

635d50e

bugfix

2489441

readuntil(out::IO, ...) instead of readuntil!

dcdff44

bugfixes

b4b3e58

rm stray semicolon

20644c8

readline(out, ...)

b170971

readuntil -> copyuntil

06703c5

add manual entries for copyuntil and copyline

6580502

try calling cleanup() more often in test for Windows

6eb8917

use _unsafe_take and a few other tweaks

243a15b

bugfix: missing ensureroom

1f4ef8a

Update base/io.jl

7061e7f

Co-authored-by: Rafael Fourquet <fourquet.rafael@gmail.com>

Update base/io.jl

13bbd48

Co-authored-by: Rafael Fourquet <fourquet.rafael@gmail.com>

stevengj force-pushed the sgj/readuntil_inplace branch from c50ee4e to 13bbd48 Compare July 5, 2023 17:48

vtjnash merged commit c14d4bb into master Jul 6, 2023
1 check passed

vtjnash deleted the sgj/readuntil_inplace branch July 6, 2023 19:50

vtjnash reviewed Jul 6, 2023

View reviewed changes

JeffBezanson reviewed Jul 7, 2023

View reviewed changes

This was referenced Jul 9, 2023

cleanups to copyuntil #50485

Merged

optimized copyuntil JuliaIO/BufferedStreams.jl#76

Merged

remove allocation in skipchars #50526

Open

gbaraldi mentioned this pull request Jul 20, 2023

Regression in string readuntil benchmark #50615

Open

brenhinkeller removed the status:merge me PR is reviewed. Merge when all tests are passing label Aug 8, 2023

stevengj mentioned this pull request Mar 21, 2024

register this package? JuliaStrings/ViewReader.jl#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copyuntil(out::IO, in::IO, delim) #48273

copyuntil(out::IO, in::IO, delim) #48273

stevengj commented Jan 14, 2023 •

edited

Loading

jakobnissen commented Jan 14, 2023

stevengj commented Jan 14, 2023 •

edited

Loading

stevengj commented Jan 15, 2023 •

edited

Loading

stevengj commented Jan 15, 2023 •

edited

Loading

stevengj commented Jan 15, 2023

stevengj commented Jan 16, 2023 •

edited

Loading

stevengj commented Jan 16, 2023 •

edited

Loading

stevengj commented Jan 18, 2023 •

edited

Loading

jakobnissen commented Jan 18, 2023

rickbeeloo commented Jan 18, 2023 •

edited

Loading

stevengj commented Feb 9, 2023

stevengj commented Feb 10, 2023 •

edited

Loading

stevengj commented Feb 11, 2023

fredrikekre commented Feb 12, 2023 •

edited

Loading

stevengj commented Feb 12, 2023 •

edited

Loading

JeffBezanson commented Feb 16, 2023

Seelengrab commented Feb 16, 2023

stevengj commented Feb 16, 2023

stevengj commented Jul 5, 2023 •

edited

Loading

vtjnash Jul 6, 2023 •

edited

Loading

stevengj Jul 9, 2023 •

edited

Loading

vtjnash Jul 11, 2023

stevengj Jul 11, 2023

JeffBezanson Jul 7, 2023

giordano Jul 7, 2023

stevengj Jul 9, 2023 •

edited

Loading

copyuntil(out::IO, in::IO, delim) #48273

copyuntil(out::IO, in::IO, delim) #48273

Conversation

stevengj commented Jan 14, 2023 • edited Loading

jakobnissen commented Jan 14, 2023

stevengj commented Jan 14, 2023 • edited Loading

stevengj commented Jan 15, 2023 • edited Loading

stevengj commented Jan 15, 2023 • edited Loading

stevengj commented Jan 15, 2023

stevengj commented Jan 16, 2023 • edited Loading

stevengj commented Jan 16, 2023 • edited Loading

stevengj commented Jan 18, 2023 • edited Loading

jakobnissen commented Jan 18, 2023

rickbeeloo commented Jan 18, 2023 • edited Loading

stevengj commented Feb 9, 2023

stevengj commented Feb 10, 2023 • edited Loading

stevengj commented Feb 11, 2023

fredrikekre commented Feb 12, 2023 • edited Loading

stevengj commented Feb 12, 2023 • edited Loading

JeffBezanson commented Feb 16, 2023

Seelengrab commented Feb 16, 2023

stevengj commented Feb 16, 2023

stevengj commented Jul 5, 2023 • edited Loading

vtjnash Jul 6, 2023 • edited Loading

Choose a reason for hiding this comment

stevengj Jul 9, 2023 • edited Loading

Choose a reason for hiding this comment

vtjnash Jul 11, 2023

Choose a reason for hiding this comment

stevengj Jul 11, 2023

Choose a reason for hiding this comment

JeffBezanson Jul 7, 2023

Choose a reason for hiding this comment

giordano Jul 7, 2023

Choose a reason for hiding this comment

stevengj Jul 9, 2023 • edited Loading

Choose a reason for hiding this comment

stevengj commented Jan 14, 2023 •

edited

Loading

stevengj commented Jan 14, 2023 •

edited

Loading

stevengj commented Jan 15, 2023 •

edited

Loading

stevengj commented Jan 15, 2023 •

edited

Loading

stevengj commented Jan 16, 2023 •

edited

Loading

stevengj commented Jan 16, 2023 •

edited

Loading

stevengj commented Jan 18, 2023 •

edited

Loading

rickbeeloo commented Jan 18, 2023 •

edited

Loading

stevengj commented Feb 10, 2023 •

edited

Loading

fredrikekre commented Feb 12, 2023 •

edited

Loading

stevengj commented Feb 12, 2023 •

edited

Loading

stevengj commented Jul 5, 2023 •

edited

Loading

vtjnash Jul 6, 2023 •

edited

Loading

stevengj Jul 9, 2023 •

edited

Loading

stevengj Jul 9, 2023 •

edited

Loading