Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copyuntil(out::IO, in::IO, delim) #48273

Merged
merged 24 commits into from
Jul 6, 2023
Merged

copyuntil(out::IO, in::IO, delim) #48273

merged 24 commits into from
Jul 6, 2023

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Jan 14, 2023

This PR defines and exports a new function readuntil!(s::IO, buffer::AbstractVector{UInt8}, delim) new functions:

copyuntil(out::IO, s::IO, delim; keep=false)
copyline(out::IO, s::IO; keep=false)

that read data from s into buffer in-place (resized if needed) the out stream until delim is read/written or the end of the stream is reached.

The PR was inspired by this post from @jakobnissen: the goal is to make it easier implement an allocation-free eachline iterator. This can be done in a package, given readuntil with an IOBuffer, using the StringViews.jl package to return a string view of the in-place buffer on each iteration.

The reason it seemed like this needed a Base function, instead of living completely in a package, is that readuntil relies on a low-level jl_readuntil C function that would be difficult to replicate in a package. To obtain comparable performance, it seems like we need an analogous jl_readuntil_buf method (implemented in this PR) and a corresponding Julia API.

Moreover, relatively little new code was required because many of the existing readuntil methods used an IOBuffer internally, so it was merely a matter of refactoring and exporting this functionality. Also, we already had an optimized ios_copyuntil function for copying between IOStreams, which can now be exported in the new API.

To do:

  • Fix bootstrapping failures
  • Benchmarks. (Maybe it is faster just to read the file in 4k blocks into a buffer with readbytes! and then return StringViews on top of that? This could happen completely in a package. It's a lot easier to use something like readuntil!, however.)
  • Tests
  • More docs
  • NEWS
  • Fixes and tests for new out::IO variant
  • add readline(out::IO, in::IO) too?
  • more tests for readline(out::IO, in::IO) methods
  • more benchmarks and optimization

Before I do much more work on this, what do people think?

@stevengj stevengj added domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:strings "Strings!" labels Jan 14, 2023
@jakobnissen
Copy link
Contributor

I think it looks great, especially if it's a performance boon. Thanks for working on this.
Do Julia IOStreams have their own buffer? Otherwise I could imagine having to interact with the OS for every line might be quite slow.

@stevengj
Copy link
Member Author

stevengj commented Jan 14, 2023

Yes, they do have an internal buffer, though I'm not sure how large it is. That's the main reason for this PR — there's no good way to re-implement jl_readline except in C because it needs to access the internal IOStream buffer (hidden inside a C-only data structure).

As I mentioned, the alternative is to do some manual buffering, e.g. reading a file in 4k chunks similar to #42225.

Of course, the advantage of manual buffering ala #42225 is that it can be implemented in pure Julia code in a package. On the the other hand, the big disadvantage of manual buffering is that it is problematic for non-seekable streams (e.g. pipes) — if you want to read only a few lines, and then do something else with the remainder of the file, the user wants to be able to read from the end of the last line, not from the end of the last buffer chunk.

(You can also use BufferedStreams.jl for additional buffering. That's composable with this PR, though in the long run we may want to add a more specialized method for readuntil!(::BufferedStream, ...))

@stevengj stevengj added performance Must go faster needs tests Unit tests are required for this change needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Jan 14, 2023
@stevengj
Copy link
Member Author

stevengj commented Jan 15, 2023

I ran some benchmarks on master with the current code in this PR. My benchmark was to open io = open("HISTORY.md", "r"), using the HISTORY.md file in the julia repo (7037 lines, ≈400kB), and count the length of each line:

  • oldsum(length, eachline(io)) — 860.583 μs (14091 allocations: 623.17 KiB)
  • this PRreaduntil! with a preallocated buffer and StringViews to wrap the buffer in a copy-free string: 671.451 μs (0 allocations: 0 bytes)
  • readuntil! with a preallocated buffer and StringViews, but without the low-level jl_readline_buf function: 18.419 ms (0 allocations: 0 bytes)
  • sum(length, eachline(io)) with a BufferedInputStream: 3.781 ms (31347 allocations: 1.97 MiB)
  • readuntil! with a preallocated buffer and StringViews on a BufferedInputStream: 1.144 ms (0 allocations: 0 bytes)

So, only about a 30% speed improvement by eliminating the String allocations. Decent but not overwhelming. Still satisfying to see 0 allocations, however. Also, note that the current speed of eachline relies heavily on the low-level jl_readuntil function, so it doesn't carry over to other types of streams like BufferedInputStream.

On the other hand, it is almost 30× slower if you use a plain IOStream but don't exploit the internal ios_t buffer via jl_readline_buf. That underscores why this functionality is hard to implement outside of Base.

If this PR is merged, it would be good to update BufferedStreams.jl to add a specialized readuntil! method for BufferedInputStream that just copies bytes in a big chunk directly from the buffer rather than the fallback of reading bytes one-by-one. (If you don't use memcpy, you lose.) Hopefully this will improve the speed even further.


Benchmark code

using StringViews, BenchmarkTools, BufferedStreams

io = open("HISTORY.md", "r")

@show sum(length, eachline(seekstart(io), keep=true))
@btime sum(length, eachline(seekstart($io), keep=true))

function doit!(io, buf, delim)
    s = 0
    while !eof(io)
        n = readuntil!(io, buf, delim)
        s += length(StringView(@view buf[1:n]))
    end
    return s
end

buf = Array{UInt8}(undef, 1024)
@show doit!(seekstart(io), buf, '\n')
@btime doit!(seekstart($io), $buf, '\n')

bio = BufferedInputStream(seekstart(io))

@show sum(length, eachline((seekstart(bio); bio), keep=true))
@btime sum(length, eachline((seekstart($bio); $bio), keep=true))

@show doit!((seekstart(bio); bio), buf, '\n')
@btime doit!((seekstart($bio); $bio), $buf, '\n')

close(bio)

@stevengj stevengj removed the needs docs Documentation for this change is required label Jan 15, 2023
@stevengj
Copy link
Member Author

stevengj commented Jan 15, 2023

I added an optimized readuntil! for IOBuffer, and modified readuntil(::IOBuffer, ...) to use it. It seems to be a net win even for traditional eachline usage that allocates a string on each call. e.g.

iob = IOBuffer(read("HISTORY.md"))
@btime sum(length, eachline(seekstart($iob), keep=true))

previously gave 1.622 ms (23488 allocations: 1.96 MiB) and now gives 1.162 ms (23484 allocations: 1.96 MiB).

If I use an IOBuffer with readuntil! and a pre-allocated buffer with StringView as above, it gives 697.035 μs (0 allocations: 0 bytes).

One complication is that the optimal strategy for readuntil!(::IOBuffer, ...) depends the length of the "line" (or other data) being read. For very short lines (< 20 bytes) it is better to have a single "manual" loop that simultaneously copies the data and checks for the delimiter. For longer lines, it is better to call findfirst followed by copyto! (i.e. memchr followed by memcpy). I chose to optimize it for longer lines. I didn't see a way to get both without having a Julia-native memcpy. (This is a win for HISTORY.md because the mean line length is about 52 bytes.)

Of course, if you have all of your data in an in-memory buffer already, it is almost certainly even better to use the eachsplit iterator to loop over SubStrings, since that doesn't copy data at all.

@stevengj stevengj removed needs tests Unit tests are required for this change needs news A NEWS entry is required for this change labels Jan 15, 2023
@stevengj
Copy link
Member Author

Should be ready for review.

@stevengj
Copy link
Member Author

stevengj commented Jan 16, 2023

@stevengj
Copy link
Member Author

stevengj commented Jan 16, 2023

cc @rickbeeloo, author of the https://github.com/rickbeeloo/ViewReader package. On my machine, sum(length, ViewReader.eachlineV("HISTORY.md")) is currently around 600µs, a bit faster than readline! directly on the IOStream as above — probably because ViewReader implements its own buffered I/O with much larger buffers than IOStream.

Once we settle on an API, however, BufferedStreams.jl should be able to implement a specialized readline! method and get similar benefits. Then ViewReader (or a similar package) can become quite tiny, basically just combining BufferedStreams with StringViews and readuntil!.

@stevengj
Copy link
Member Author

stevengj commented Jan 18, 2023

I was thinking about it some more, and arguably if you want to have maximum performance for an in-place eachline-like iteration, then you maybe want to read from the stream in large blocks (containing many linesa) ala BufferedStreams and then don't copy at all if possible — just return a StringView directly into the buffer. (Only a little data motion is required when you get to the end of the buffer, and need to move the final line fragment back to the beginning to read more.) I think this is what ViewReader.jl does?

From that perspective, however, a readuntil! API is unnecessary? Though it may still be more convenient for some applications, especially if you want to read other data in between reading lines.

@jakobnissen
Copy link
Contributor

True, and indeed something like Rust's std::Io::BufReader might be nice to have. However, how would you keep track of whether it's safe to shift data in the buffer, given that StringViews may hold a reference to any data in the buffer?

@rickbeeloo
Copy link

rickbeeloo commented Jan 18, 2023

I also agree reading bigger chunks would do better, especially for those reading from HDDs

Only a little data motion is required when you get to the end of the buffer, and need to move the final line fragment back to the beginning to read more

This works as long as the "to be finished line" is shorter than the currently allocated buffer. If not it might again not find the newline. That would require some extra logic to see if the newline is found yet and otherwise still increase the buffer and extend. Which, in the worst case would give a buffer ~2x the longest line in the file. I didn't really spend time optimizing that, instead, I allocated two buffers (each being the max line length - given by the user) then flip them around like you said and warn if no newline is found after

Though it may still be more convenient for some applications, especially if you want to read other data in between reading lines

readuntil! will also still be increasingly faster for very long lines (not present in HISTORY.md)

@stevengj stevengj added the status:triage This should be discussed on a triage call label Feb 9, 2023
@stevengj
Copy link
Member Author

stevengj commented Feb 9, 2023

Marking for triage since it would be good to get some feedback from core devs on whether this API is desired.

@stevengj
Copy link
Member Author

stevengj commented Feb 10, 2023

I was thinking about this some more in connection with #48625, and I'm starting to feel that a better API would be:

readuntil(out::IO, in::IO, delim)

i.e. add an optional out::IO argument instead of a buffer. Advantages:

  • Includes functionality of a buffer, since you can pass a preallocated IOBuffer.
  • Greater functionality because now you can output to arbitrary streams.
  • Little new code since much of the code already uses IOBuffer internally; just needs to be refactored.
  • We already have an optimized ios_copyuntil function for copying between IOStreams. The new API would let us export this.

Philosophically, an IOBuffer is already essentially the way to do "in-place" string-like operations in Julia, so it makes a lot of sense to me to add ::IO output as an option to more string functions.

@stevengj stevengj changed the title in-place readuntil! readuntil(out::IO, in::IO, delim) Feb 11, 2023
@stevengj
Copy link
Member Author

Updated to use the new readuntil(out::IO, in::IO, delim) API. Not tested yet and probably slightly broken, but should give a good sense of what the code will look like.

@fredrikekre
Copy link
Member

fredrikekre commented Feb 12, 2023

There is already write(out::IO, in::IO), which is more or less readuntil(out, in , eof). Perhaps this should also be a write method? Edit: Although I guess what you are doing "until" is to read, so write(out, in, until=...) doesn't read so nicely.

@stevengj
Copy link
Member Author

stevengj commented Feb 12, 2023

@fredrikekre, I think it's also more discoverable to stick to the convention that you can add an out::IO method as the first argument of a "string" function to make it write to a buffer instead of returning a string. That way you don't have to go hunting around for "what is the buffer equivalent of foo(...)", since it's always foo(io, ...).

For example, we already have join(...) and join(io, ...), and I think we should also add an io argument to replace (#48625), and probably to other functions as well. They can't all be methods of write or print.

@JeffBezanson
Copy link
Sponsor Member

This is great functionality, triage is 👍 . However, we do like the idea of calling it copy and/or copyuntil since it both reads and writes. This can be merged when a name is agreed on, but it seems to me with the code here we should also be able to get versions that (1) copy the entire stream and (2) copy a set number of bytes as well.

@JeffBezanson JeffBezanson removed the status:triage This should be discussed on a triage call label Feb 16, 2023
@Seelengrab
Copy link
Contributor

Posting a reference to the IO blocking behavior issue for future reference #24526

@stevengj
Copy link
Member Author

copyuntil seems fine to me.

@stevengj
Copy link
Member Author

stevengj commented Jul 5, 2023

I'm having trouble seeing how the Asan segfault (in src/flisp/print.c) could be due to this PR.

(I added an additional parameter keep to ios_copyuntil, which is called by flisp, but by passing keep=1 the behavior should be identical to before.)

@vtjnash vtjnash merged commit c14d4bb into master Jul 6, 2023
1 check passed
@vtjnash vtjnash deleted the sgj/readuntil_inplace branch July 6, 2023 19:50
end
(eof(s) || len == out.maxsize) && break
len = min(2len + 64, out.maxsize)
resize!(d, len)
Copy link
Sponsor Member

@vtjnash vtjnash Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it might not quite respect maxsize (not copying the last chunk if it exceeded maxsize during that read). Should it use ensureroom instead? The ensureroom call at the top also seems wrong, since if someone new the out stream was already sized correctly for a smaller read, that will forceably try to reallocate it.

Copy link
Member Author

@stevengj stevengj Jul 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by "not copying the last chunk". By the time it hits the len == out.maxsize branch, jl_readuntil_buf has already copied out.maxsize bytes into the buffer. The jl_readuntil_buf never reads a chunk exceeding maxsize, because the number of bytes that it reads is bounded by len - ptr + 1.

ensureroom seems like it has quite a few extra checks that aren't needed in the inner loop here given the ensureroom call at the top.

I guess it wouldn't hurt to change the ensureroom call at the top to something like ensureroom(out, isempty(out.data) ? 16 : 0)? Hmm, no, that wouldn't work either… I need at least ensureroom(out, 1) to be certain that the jl_readuntil_buf will read at least 1 byte if there is something to read, as otherwise the iszero(n) && break check is wrong. Probably it's fine to just do ensureroom(out, 1) here. (The main application of this method is probably to read repeatedly into the same seekstart(buf), as in the examples above, in which case ensureroom will do nothing … the buffer will already be as big as the largest line previously read.)

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess that makes sense. The checks were a bit distant so it wasn't immediately obvious that was being ensured.

Yes, I think checking isempty first makes sense. I realized now that ensureroom already truncates the request to maxsize, so there is no issue with calling that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be resolved by #50485

end
return out
end
readuntil(s::IO, delim::T; keep::Bool=false) where T =
_copyuntil(Vector{T}(), s, delim, keep)
readuntil(s::IO, delim::UInt8; keep::Bool=false) =
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overwrites the definition on line 525, giving a warning during sysimage build.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to catch those warnings and turn them into errors on CI when building the sysimage? I guess we don't want this kind of situations.

Copy link
Member Author

@stevengj stevengj Jul 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will file a PR to fix this shortly. Filed #50485

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:strings "Strings!" performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.