-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add an iterator implementation for String splitting #20688
Conversation
At a minimum, this patch is required to get this to bootstrap: diff --git a/base/fft/FFTW.jl b/base/fft/FFTW.jl
index d68e302c98..952700283c 100644
--- a/base/fft/FFTW.jl
+++ b/base/fft/FFTW.jl
@@ -20,7 +20,7 @@ export export_wisdom, import_wisdom, import_system_wisdom, forget_wisdom,
const libfftw = Base.libfftw_name
const libfftwf = Base.libfftwf_name
-const version = convert(VersionNumber, split(unsafe_string(cglobal((:fftw_version,Base.DFT.FFTW.libfftw), UInt8)), ['-', ' '])[2])
+const version = convert(VersionNumber, collect(split(unsafe_string(cglobal((:fftw_version,Base.DFT.FFTW.libfftw), UInt8)), ['-', ' ']))[2])
## Direction of FFT
diff --git a/base/markdown/GitHub/table.jl b/base/markdown/GitHub/table.jl
index dc31fad954..e87411df2c 100644
--- a/base/markdown/GitHub/table.jl
+++ b/base/markdown/GitHub/table.jl
@@ -8,7 +8,7 @@ end
function parserow(stream::IO)
withstream(stream) do
line = readline(stream)
- row = split(line, r"(?<!\\)\|")
+ row = collect(split(line, r"(?<!\\)\|"))
length(row) == 1 && return
row[1] == "" && shift!(row)
map!(x -> strip(replace(x, "\\|", "|")), row, row)
diff --git a/base/markdown/render/plain.jl b/base/markdown/render/plain.jl
index 01a3a8994d..f0ca89a7c2 100644
--- a/base/markdown/render/plain.jl
+++ b/base/markdown/render/plain.jl
@@ -56,7 +56,7 @@ end
function plain(io::IO, f::Footnote)
print(io, "[^", f.id, "]:")
s = sprint(plain, f.text)
- lines = split(rstrip(s), "\n")
+ lines = collect(split(rstrip(s), "\n"))
# Single line footnotes are printed on the same line as their label
# rather than taking up an additional line.
if length(lines) == 1
diff --git a/base/markdown/render/rst.jl b/base/markdown/render/rst.jl
index ca5661520b..697db58a65 100644
--- a/base/markdown/render/rst.jl
+++ b/base/markdown/render/rst.jl
@@ -60,7 +60,7 @@ end
function rst(io::IO, f::Footnote)
print(io, ".. [", f.id, "]")
s = sprint(rst, f.text)
- lines = split(rstrip(s), "\n")
+ lines = collect(split(rstrip(s), "\n"))
# Single line footnotes are printed on the same line as their label
# rather than taking up an additional line.
if length(lines) == 1
diff --git a/base/markdown/render/terminal/formatting.jl b/base/markdown/render/terminal/formatting.jl
index 1d14f1d60b..048113599a 100644
--- a/base/markdown/render/terminal/formatting.jl
+++ b/base/markdown/render/terminal/formatting.jl
@@ -54,8 +54,8 @@ function ansi_length(s)
replace(s, r"\e\[[0-9]+m", "") |> length
end
-words(s) = split(s, " ")
-lines(s) = split(s, "\n")
+words(s) = collect(split(s, " "))
+lines(s) = collect(split(s, "\n"))
# This could really be more efficient
function wrapped_lines(s::AbstractString; width = 80, i = 0)
diff --git a/base/path.jl b/base/path.jl
index 6b04246a5f..c65777814d 100644
--- a/base/path.jl
+++ b/base/path.jl
@@ -216,7 +216,7 @@ function normpath(path::String)
isabs = isabspath(path)
isdir = isdirpath(path)
drive, path = splitdrive(path)
- parts = split(path, path_separator_re)
+ parts = collect(split(path, path_separator_re))
filter!(x->!isempty(x) && x!=".", parts)
while true
clean = true
@@ -339,8 +339,8 @@ function relpath(path::String, startpath::String = ".")
curdir = "."
pardir = ".."
path == startpath && return curdir
- path_arr = split(abspath(path), path_separator_re)
- start_arr = split(abspath(startpath), path_separator_re)
+ path_arr = collect(split(abspath(path), path_separator_re))
+ start_arr = collect(split(abspath(startpath), path_separator_re))
i = 0
while i < min(length(path_arr), length(start_arr))
i += 1
diff --git a/base/socket.jl b/base/socket.jl
index 421e6fece7..022e34c96f 100644
--- a/base/socket.jl
+++ b/base/socket.jl
@@ -164,7 +164,7 @@ If the address is in octal or hexadecimal, convert it to decimal, otherwise remo
"""
function parse(::Type{IPv4}, str::AbstractString)
- fields = split(str,'.')
+ fields = collect(split(str,'.'))
i = 1
ret = 0
for f in fields
@@ -215,7 +215,7 @@ end
parseipv6fields(fields) = parseipv6fields(fields,8)
function parse(::Type{IPv6}, str::AbstractString)
- fields = split(str,':')
+ fields = collect(split(str,':'))
if length(fields) > 8
throw(ArgumentError("too many fields in IPv6 address"))
elseif length(fields) == 8
diff --git a/base/version.jl b/base/version.jl
index 956763ce27..54312f712b 100644
--- a/base/version.jl
+++ b/base/version.jl
@@ -79,7 +79,7 @@ const VERSION_REGEX = r"^
$"ix
function split_idents(s::AbstractString)
- idents = split(s, '.')
+ idents = collect(split(s, '.'))
ntuple(length(idents)) do i
ident = idents[i]
ismatch(r"^\d+$", ident) ? parse(Int, ident) : String(ident) The fact that so many calling locations need this not to be lazy is a bit of a strong indicator that changing this to be lazy is not a great idea, also this change is massively breaking since the iterator object doesn't have length methods or random access indexing. It might make more sense to do something like have a new |
13f8ea5
to
fc56929
Compare
|
||
iteratorsize(::Type{SplitIterator}) = SizeUnknown() | ||
iteratoreltype(::Type{SplitIterator}) = HasEltype() | ||
eltype(::SplitIterator) = SubString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be eltype(::Type{SplitIterator})
@@ -248,6 +248,123 @@ julia> rpad("March",20) | |||
rpad(s, n::Integer, p=" ") = rpad(string(s),n,string(p)) | |||
cpad(s, n::Integer, p=" ") = rpad(lpad(s,div(n+strwidth(s),2),p),n,p) | |||
|
|||
immutable SplitIterator | |||
str::AbstractString | |||
splitter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any algorithmic reason why eachsplit
is slower than split
? would parameterizing the SplitIterator
on the types of str
and splitter
remove that gap, and allow to define split(...) = collect(eachsplit(...))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This phenomena also seems to occur with the graphemes
vs. split(str, "")
which do nearly similar things. I've tried to maintain the invariants between the iterator version and the split
version. We'll see which invariants I've failed to satisfy when I'm able to debug these tests. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe regular split(str, "")
will split graphemes up. That might or might not be a good thing; I think it's not ideal.
immutable SplitIterator | ||
str::AbstractString | ||
splitter | ||
limit::Integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Int
is better here
|
||
function _eachsplit(str::AbstractString, splitter, limit::Integer, keep_empty::Bool) | ||
# Empty string splitter means you want to iterate over the characters | ||
splitter == "" ? graphemes(str) : SplitIterator(str, splitter, limit, keep_empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regular split
does not have this behavior; it will happily split a grapheme into two parts. perhaps that's not desirable but the SplitIterator seems to also do that if splitting on part of a grapheme.
state.i > state.n || (iter.limit > 0 && state.s == iter.limit) | ||
end | ||
|
||
function next(iter::SplitIterator, state::SplitIteratorState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the SplitIteratorState
is small enough that you could consider not mutating it and returning a new state, which has some advantages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was concerned about extra memory allocation with this approach given that it's already slower than split
. I'll run a tests to see whether this is worth the cost since performance seems to be the major concern here.
replaced by #39245 |
Per-request of #20603 and also addresses #20628. Simple timing benchmarks show that using and collecting from these iterators are slower and cause more memory allocation.
Here's a sample timing after warming up the JIT:
For a small string:
This performance issue also seems to afflict the
GraphemeIterator
which is constructed bygraphemes(::AbstractString)
which can be optimized by implementing it assplit(str, "")
rather than returning an iterator.Some tests are currently broken since downstream uses of
split
expect an array-like value wherelength
andgetindex
can be called.Looking for feedback before fixing these tests to see whether the maintainers-at-large still think that this abstraction is worth the performance cost or whether the unexpected allocations should be investigated.