Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add an iterator implementation for String splitting #20688

Closed

Conversation

Hydrotoast
Copy link

Per-request of #20603 and also addresses #20628. Simple timing benchmarks show that using and collecting from these iterators are slower and cause more memory allocation.

Here's a sample timing after warming up the JIT:

julia> str = randstring(1_000_000);

# without iterator
julia> @time split(str, "a");
  0.005106 seconds (16.20 k allocations: 762.641 KB)

# with iterator
julia> @time collect(split(str, "a"));
  0.012568 seconds (194.29 k allocations: 3.957 MiB)

For a small string:

julia> small_str = "how-now-brown-cow";

# without iterator
julia> @time collect(split(small_str, "-"));
  0.000012 seconds (12 allocations: 544 bytes)

# with iterator
julia> @time collect(split(small_str, "-"));
  0.000048 seconds (20 allocations: 768 bytes)

This performance issue also seems to afflict the GraphemeIterator which is constructed by graphemes(::AbstractString) which can be optimized by implementing it as split(str, "") rather than returning an iterator.

Some tests are currently broken since downstream uses of split expect an array-like value where length and getindex can be called.

Looking for feedback before fixing these tests to see whether the maintainers-at-large still think that this abstraction is worth the performance cost or whether the unexpected allocations should be investigated.

@Hydrotoast Hydrotoast changed the title Add an iterator implementation for String splitting WIP: Add an iterator implementation for String splitting Feb 20, 2017
@kshyatt kshyatt added the domain:strings "Strings!" label Feb 20, 2017
@StefanKarpinski
Copy link
Sponsor Member

At a minimum, this patch is required to get this to bootstrap:

diff --git a/base/fft/FFTW.jl b/base/fft/FFTW.jl
index d68e302c98..952700283c 100644
--- a/base/fft/FFTW.jl
+++ b/base/fft/FFTW.jl
@@ -20,7 +20,7 @@ export export_wisdom, import_wisdom, import_system_wisdom, forget_wisdom,
 const libfftw = Base.libfftw_name
 const libfftwf = Base.libfftwf_name

-const version = convert(VersionNumber, split(unsafe_string(cglobal((:fftw_version,Base.DFT.FFTW.libfftw), UInt8)), ['-', ' '])[2])
+const version = convert(VersionNumber, collect(split(unsafe_string(cglobal((:fftw_version,Base.DFT.FFTW.libfftw), UInt8)), ['-', ' ']))[2])

 ## Direction of FFT

diff --git a/base/markdown/GitHub/table.jl b/base/markdown/GitHub/table.jl
index dc31fad954..e87411df2c 100644
--- a/base/markdown/GitHub/table.jl
+++ b/base/markdown/GitHub/table.jl
@@ -8,7 +8,7 @@ end
 function parserow(stream::IO)
     withstream(stream) do
         line = readline(stream)
-        row = split(line, r"(?<!\\)\|")
+        row = collect(split(line, r"(?<!\\)\|"))
         length(row) == 1 && return
         row[1] == "" && shift!(row)
         map!(x -> strip(replace(x, "\\|", "|")), row, row)
diff --git a/base/markdown/render/plain.jl b/base/markdown/render/plain.jl
index 01a3a8994d..f0ca89a7c2 100644
--- a/base/markdown/render/plain.jl
+++ b/base/markdown/render/plain.jl
@@ -56,7 +56,7 @@ end
 function plain(io::IO, f::Footnote)
     print(io, "[^", f.id, "]:")
     s = sprint(plain, f.text)
-    lines = split(rstrip(s), "\n")
+    lines = collect(split(rstrip(s), "\n"))
     # Single line footnotes are printed on the same line as their label
     # rather than taking up an additional line.
     if length(lines) == 1
diff --git a/base/markdown/render/rst.jl b/base/markdown/render/rst.jl
index ca5661520b..697db58a65 100644
--- a/base/markdown/render/rst.jl
+++ b/base/markdown/render/rst.jl
@@ -60,7 +60,7 @@ end
 function rst(io::IO, f::Footnote)
     print(io, ".. [", f.id, "]")
     s = sprint(rst, f.text)
-    lines = split(rstrip(s), "\n")
+    lines = collect(split(rstrip(s), "\n"))
     # Single line footnotes are printed on the same line as their label
     # rather than taking up an additional line.
     if length(lines) == 1
diff --git a/base/markdown/render/terminal/formatting.jl b/base/markdown/render/terminal/formatting.jl
index 1d14f1d60b..048113599a 100644
--- a/base/markdown/render/terminal/formatting.jl
+++ b/base/markdown/render/terminal/formatting.jl
@@ -54,8 +54,8 @@ function ansi_length(s)
     replace(s, r"\e\[[0-9]+m", "") |> length
 end

-words(s) = split(s, " ")
-lines(s) = split(s, "\n")
+words(s) = collect(split(s, " "))
+lines(s) = collect(split(s, "\n"))

 # This could really be more efficient
 function wrapped_lines(s::AbstractString; width = 80, i = 0)
diff --git a/base/path.jl b/base/path.jl
index 6b04246a5f..c65777814d 100644
--- a/base/path.jl
+++ b/base/path.jl
@@ -216,7 +216,7 @@ function normpath(path::String)
     isabs = isabspath(path)
     isdir = isdirpath(path)
     drive, path = splitdrive(path)
-    parts = split(path, path_separator_re)
+    parts = collect(split(path, path_separator_re))
     filter!(x->!isempty(x) && x!=".", parts)
     while true
         clean = true
@@ -339,8 +339,8 @@ function relpath(path::String, startpath::String = ".")
     curdir = "."
     pardir = ".."
     path == startpath && return curdir
-    path_arr  = split(abspath(path),      path_separator_re)
-    start_arr = split(abspath(startpath), path_separator_re)
+    path_arr  = collect(split(abspath(path),      path_separator_re))
+    start_arr = collect(split(abspath(startpath), path_separator_re))
     i = 0
     while i < min(length(path_arr), length(start_arr))
         i += 1
diff --git a/base/socket.jl b/base/socket.jl
index 421e6fece7..022e34c96f 100644
--- a/base/socket.jl
+++ b/base/socket.jl
@@ -164,7 +164,7 @@ If the address is in octal or hexadecimal, convert it to decimal, otherwise remo
 """

 function parse(::Type{IPv4}, str::AbstractString)
-    fields = split(str,'.')
+    fields = collect(split(str,'.'))
     i = 1
     ret = 0
     for f in fields
@@ -215,7 +215,7 @@ end
 parseipv6fields(fields) = parseipv6fields(fields,8)

 function parse(::Type{IPv6}, str::AbstractString)
-    fields = split(str,':')
+    fields = collect(split(str,':'))
     if length(fields) > 8
         throw(ArgumentError("too many fields in IPv6 address"))
     elseif length(fields) == 8
diff --git a/base/version.jl b/base/version.jl
index 956763ce27..54312f712b 100644
--- a/base/version.jl
+++ b/base/version.jl
@@ -79,7 +79,7 @@ const VERSION_REGEX = r"^
 $"ix

 function split_idents(s::AbstractString)
-    idents = split(s, '.')
+    idents = collect(split(s, '.'))
     ntuple(length(idents)) do i
         ident = idents[i]
         ismatch(r"^\d+$", ident) ? parse(Int, ident) : String(ident)

The fact that so many calling locations need this not to be lazy is a bit of a strong indicator that changing this to be lazy is not a great idea, also this change is massively breaking since the iterator object doesn't have length methods or random access indexing. It might make more sense to do something like have a new eachsplit function that returns an iterator. Either that or produce an iterator when the thing you're splitting is an IO stream – that's a case where laziness is probably the right default. In fact that might be a good way to design these APIs in general: default to eagerness, but be lazy when the input is some kind of stream or iterator.

@Hydrotoast Hydrotoast force-pushed the gcc/string-split-iterator branch 2 times, most recently from 13f8ea5 to fc56929 Compare February 21, 2017 02:33

iteratorsize(::Type{SplitIterator}) = SizeUnknown()
iteratoreltype(::Type{SplitIterator}) = HasEltype()
eltype(::SplitIterator) = SubString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be eltype(::Type{SplitIterator})

@@ -248,6 +248,123 @@ julia> rpad("March",20)
rpad(s, n::Integer, p=" ") = rpad(string(s),n,string(p))
cpad(s, n::Integer, p=" ") = rpad(lpad(s,div(n+strwidth(s),2),p),n,p)

immutable SplitIterator
str::AbstractString
splitter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any algorithmic reason why eachsplit is slower than split? would parameterizing the SplitIterator on the types of str and splitter remove that gap, and allow to define split(...) = collect(eachsplit(...))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This phenomena also seems to occur with the graphemes vs. split(str, "") which do nearly similar things. I've tried to maintain the invariants between the iterator version and the split version. We'll see which invariants I've failed to satisfy when I'm able to debug these tests. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe regular split(str, "") will split graphemes up. That might or might not be a good thing; I think it's not ideal.

immutable SplitIterator
str::AbstractString
splitter
limit::Integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int is better here


function _eachsplit(str::AbstractString, splitter, limit::Integer, keep_empty::Bool)
# Empty string splitter means you want to iterate over the characters
splitter == "" ? graphemes(str) : SplitIterator(str, splitter, limit, keep_empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regular split does not have this behavior; it will happily split a grapheme into two parts. perhaps that's not desirable but the SplitIterator seems to also do that if splitting on part of a grapheme.

state.i > state.n || (iter.limit > 0 && state.s == iter.limit)
end

function next(iter::SplitIterator, state::SplitIteratorState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SplitIteratorState is small enough that you could consider not mutating it and returning a new state, which has some advantages

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerned about extra memory allocation with this approach given that it's already slower than split. I'll run a tests to see whether this is worth the cost since performance seems to be the major concern here.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Apr 19, 2021

replaced by #39245

@vtjnash vtjnash closed this Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!"
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants