-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read to separator fixes #2961
Read to separator fixes #2961
Conversation
Thank you for your pull request and welcome to our community! To contribute, please sign the Oracle Contributor Agreement (OCA). To sign the OCA, please create an Oracle account and sign the OCA in Oracle's Contributor Agreement Application. When signing the OCA, please provide your GitHub username. After signing the OCA and getting an OCA approval from Oracle, this PR will be automatically updated. If you are an Oracle employee, please make sure that you are a member of the main Oracle GitHub organization, and your membership in this organization is public. |
1a125e6
to
097bee0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix, this looks quite complicated.
@andrykonchin Could you review this PR as well, especially the bug fixes themselves, and integrate this PR?
I wonder why s.prepend(str)
vs appending to str
(already there before). It seems to make it much harder to reason about it.
This area indeed feels like it could benefit from a big refactor or even rewriting it from scratch, it's particularly hard to follow the logic and likely it optimizes performance for flat strings but since we have ropes we have different performance concerns. |
@nirvdrum Could you add a changelog entry, under 22.3 |
Yes, will do 👌. |
We can revisit that. I remember doing that years ago to avoid string allocations in the most common case. I think this most often gets called as part of The |
^ ah OK, then it'd be useful if you can add a comment explaining the rationale there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes looks good 👍 (except related to read_to_separator_with_limit
). I've left some comments.
src/main/ruby/truffleruby/core/io.rb
Outdated
wanted = limit | ||
|
||
next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose in this case (separator isn't found in @buffer
) we still should check content of str
and a case when separator is partially in str
and partially in @buffer
.
We cannot just yield
str
content.
src/main/ruby/truffleruby/core/io.rb
Outdated
next | ||
else | ||
str << s | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After str << s
we should probably adjust wanted
as well, e.g. wanted -= bytes_to_read
as far as wanted = @limit - str.bytesize
is supposed to always be true.
What if str
is longer than @limit
now?
src/main/ruby/truffleruby/core/io.rb
Outdated
str.clear | ||
last_scan_start = 0 | ||
|
||
yield res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could res
be longer than @limit
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's confusing and contrary to the documentation for gets
with a limit value. Ruby will read enough points to complete a codepoint. That's why this method uses @buffer.read_to_char_boundary
. I didn't change any of that logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see. I meant that we ignore wanted
value and return all bytes in str
before the separator.
src/main/ruby/truffleruby/core/io.rb
Outdated
# will have the correct bytes. | ||
if offset < str.bytesize | ||
@buffer.put_back(str.byteslice(offset, str.bytesize - offset)) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After @buffer.put_back
we should probably adjust wanted
as well I suppose. I mean wanted += str.bytesize - offset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right that wanted
should be adjusted, but I think it needs to be reset to limit
since we've found a match. Putting the bytes back affects @buffer
, but that's state separately managed. I overlooked this with gets
because the method is never re-entered after the first string is yielded. I'll look at a spec for readlines
with a limit.
I looked into it and TruffleString optimizes for empty strings in an append/concat situation for either |
I've noticed 2 more issues related to the
def put_back(str)
length = str.bytesize
# A simple case, which is common and can be done efficiently
if @start >= length
# ...
else
@storage = @storage.prepend(str)
@total += length
@start = 0
@used += length
end
end |
Do we have any callers doing this? The only cases of If it's just a general design note, let's start a new issue for that discussion. A lot of this code only works under specific circumstances, but it's all internal-only. A lot of it came from Rubinius and hasn't been touched much. Some of it is written weirdly to optimize because we had very IO performance in the past. Just to keep this PR focused, I'd like to keep the comments focused on things to change in this PR. |
c2e4d6c
to
ce1577b
Compare
The previous code expected the entire delimiter to be present in any given read. If the separator was present in the IO source, but the bytes were split between multiple reads, the search for the separator would fail. Co-authored-by: Vinicius Stock <vinicius.stock@shopify.com>
…t's only used internally by TruffleRuby.
ce1577b
to
18e4ae0
Compare
I've removed the commit that was intended to fix reading with a multi-byte separator and a limit. @andrykonchin found a couple of issues, which indicated the existing specs weren't as extensive as I thought. I tried to write new specs. I'm pretty sure I fixed the problems, but while tracing the specs they didn't perform the reads and writes in the same order every time. That's a problem because the specs would pass if the read occurs after two writes, but fail if a read came between two writes. I'm not sure why, but the While I think I fixed the issues, I don't have confidence without a reliable test suite. Moreover, we haven't received any bug reports about multi-byte delimiters with a limit. I only attempted to fix it because the code is similar to multi-byte delimiters without a limit. Due to the risk involved and how difficult it is to follow, I think we should defer the work with a delimiter, while retaining the work without a delimiter. That can be addressed in a more comprehensive cleanup and hopefully be much easier to reason about. |
Agree, my bad. |
This PR fixes
IO::EachReader#read_to_separator
andIO::EachReader#read_to_separator_with_limit
with multi-byte delimiters. If the delimiter appears in the IO object being read from but it split over multiple reads, the old search logic would fail to find the delimiter. Confusingly, if the delimiter appeared in full later in the IO object, that second occurrence's byte position would be returned.I feel quite confident about the
IO::EachReader#read_to_separator
fix. This original issue was detected when working on getting the ruby-lsp test suite passing on TruffleRuby. That test suite uses a pipe to communicate with a subprocess running the Ruby LSP server. With this PR in place those tests now pass. The newgets
spec was based on the observed problem with that test suite. Our first attempt was insufficient because it failed to re-insert any bytes past the delimiter back into the read buffer. The test accounts for this situation.I'm less confident about the
IO::EachReader#read_to_separator_with_limit
changes. I based them on the changes forread_to_separator
and wrote a newgets
spec that initially failed, but now passes. However, I don't have any real life code using this call so I can't test against that. Unfortunately, the logic is a bit more involved than simply finding the delimiter or reading the first N bytes because Ruby will read beyond N bytes to ensure only complete codepoints are returned.Beyond that, I switched some internal calls over to primitives. And I did a minor code clean up of
IO::EachReader
. I think that code in general could benefit from a more comprehensive pass. I've left that for future work. I was aiming to keep this PR as bug fixes only. Each commit is logically standalone so we can selectively apply what's needed for now.Fixes #2938: IO reads with a delimiter broken for multi-byte delimiters.
/cc @vinistock