Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reverse iteration for eachline #42225

Merged
merged 24 commits into from
Dec 20, 2021
Merged

reverse iteration for eachline #42225

merged 24 commits into from
Dec 20, 2021

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Sep 12, 2021

This PR implements Iterators.Reverse iteration for the EachLine iterator, so that now e.g. last(eachline(io)) will work. (This seems to have been requested multiple times on discourse etcetera, and is nontrivial for users implement efficiently themselves. It also seems to be a pretty common question in Python, where the most common answers involve reading the whole file to get the last few lines., or executing an external tail program.)

For seekable streams, it works by reading the data from the end in 4kiB chunks. Performance seems comparable to forwards iteration.

For non-seekable streams, it falls back to reading the whole stream into memory. However, for last(eachline(...)) and take(Reverse(eachline(...)), n) it uses an algorithm (similar to how the Unix tail function works) that requires only bounded memory — it reads the file in a sequence of 4kiB chunks, but discards earlier chunks if they are not needed for the requested number of lines. it throws an exception when seekend or seek is called.

I noticed that first(r::Reverse) was calling last(r.itr), which fails for non-indexable iterators because last(itr) defaults to itr[end], so I changed it to use the default implementation (in abstractarrays.jl) which iterates once. Otherwise, you can't implement last(itr::MyType) by first(Iterators.Reverse(itr)). This is analogous to how last(itr, n) is already implemented. (It might be good to change the default last(itr) method, but that's a topic for another PR.)

@stevengj stevengj added domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:strings "Strings!" labels Sep 12, 2021
base/io.jl Outdated Show resolved Hide resolved
@tkf
Copy link
Member

tkf commented Sep 12, 2021

I don't think non-seekable IO can satisfy the last interface since it is defined to be O(1)

julia/base/abstractarray.jl

Lines 456 to 460 in 4805d54

last(coll)
Get the last element of an ordered collection, if it can be computed in O(1) time. This is
accomplished by calling [`lastindex`](@ref) to get the last index. Return the end
point of an [`AbstractRange`](@ref) even if it is empty.

A similar discussion on length: #35947

@stevengj
Copy link
Member Author

stevengj commented Sep 12, 2021

It would be easy to restrict this to seekable io and to throw an error otherwise, if desired.

Update: I've changed the PR to drop non-seekable support, since that seemed controversial. (I saved the non-seekable branch at https://github.com/stevengj/julia/tree/sgj/reverse-eachline-nonseekable.)

@stevengj
Copy link
Member Author

Unrelated CI error on freebsd64:

Error in testset LinearAlgebra/lu:
Error During Test at none:1
  Got exception outside of a @test
  ProcessExitedException(5)

@stevengj
Copy link
Member Author

stevengj commented Sep 28, 2021

Apparently unrelated error on freebsd64 (Error in testset LinearAlgebra/lapack from ProcessExitedException(6)) and unrelated error on win32:

Error in testset Test:
Test Failed at C:\buildbot\worker-tabularasa\tester_win32\build\share\julia\stdlib\v1.8\Test\test\runtests.jl:777
  Expression: orig == Random.default_rng()
   Evaluated: Xoshiro(0xc1fb896c346f54ba, 0x3f32061aaae1bb3a, 0x6256794da49331ff, 0x85db582666fa8259) == TaskLocalRNG()
Error in testset Test:
Test Failed at C:\buildbot\worker-tabularasa\tester_win32\build\share\julia\stdlib\v1.8\Test\test\runtests.jl:778
  Expression: rand(orig) == rand()
   Evaluated: 0.544086357518677 == 0.6916330293666764

This PR should be ready for review?

base/io.jl Outdated Show resolved Hide resolved
base/io.jl Outdated Show resolved Hide resolved
@stevengj
Copy link
Member Author

stevengj commented Oct 1, 2021

Updated to handle CRLF line endings, with a test.

Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stevengj
Copy link
Member Author

stevengj commented Oct 1, 2021

(CI failures on freebsd64 and win32 are unrelated: ProcessExitedException in testset LinearAlgebra. Is there an issue tracking this? I'm seeing it on several PRs.)

@tkf
Copy link
Member

tkf commented Oct 1, 2021

Looks like #42415?

@stevengj
Copy link
Member Author

Rebased. Hopefully can be merged if green?

@stevengj stevengj added the status:merge me PR is reviewed. Merge when all tests are passing label Dec 20, 2021
@DilumAluthge DilumAluthge merged commit fda5769 into master Dec 20, 2021
@DilumAluthge DilumAluthge deleted the sgj/reverse-eachline branch December 20, 2021 17:52
@DilumAluthge DilumAluthge removed the status:merge me PR is reviewed. Merge when all tests are passing label Dec 20, 2021
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
@stevengj stevengj mentioned this pull request Jan 14, 2023
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants