WIP: Add an iterator implementation for String splitting #20688

Hydrotoast · 2017-02-20T08:14:50Z

Per-request of #20603 and also addresses #20628. Simple timing benchmarks show that using and collecting from these iterators are slower and cause more memory allocation.

Here's a sample timing after warming up the JIT:

julia> str = randstring(1_000_000);

# without iterator
julia> @time split(str, "a");
  0.005106 seconds (16.20 k allocations: 762.641 KB)

# with iterator
julia> @time collect(split(str, "a"));
  0.012568 seconds (194.29 k allocations: 3.957 MiB)

For a small string:

julia> small_str = "how-now-brown-cow";

# without iterator
julia> @time collect(split(small_str, "-"));
  0.000012 seconds (12 allocations: 544 bytes)

# with iterator
julia> @time collect(split(small_str, "-"));
  0.000048 seconds (20 allocations: 768 bytes)

This performance issue also seems to afflict the GraphemeIterator which is constructed by graphemes(::AbstractString) which can be optimized by implementing it as split(str, "") rather than returning an iterator.

Some tests are currently broken since downstream uses of split expect an array-like value where length and getindex can be called.

Looking for feedback before fixing these tests to see whether the maintainers-at-large still think that this abstraction is worth the performance cost or whether the unexpected allocations should be investigated.

StefanKarpinski · 2017-02-20T20:16:55Z

At a minimum, this patch is required to get this to bootstrap:

diff --git a/base/fft/FFTW.jl b/base/fft/FFTW.jl
index d68e302c98..952700283c 100644
--- a/base/fft/FFTW.jl
+++ b/base/fft/FFTW.jl
@@ -20,7 +20,7 @@ export export_wisdom, import_wisdom, import_system_wisdom, forget_wisdom,
 const libfftw = Base.libfftw_name
 const libfftwf = Base.libfftwf_name

-const version = convert(VersionNumber, split(unsafe_string(cglobal((:fftw_version,Base.DFT.FFTW.libfftw), UInt8)), ['-', ' '])[2])
+const version = convert(VersionNumber, collect(split(unsafe_string(cglobal((:fftw_version,Base.DFT.FFTW.libfftw), UInt8)), ['-', ' ']))[2])

 ## Direction of FFT

diff --git a/base/markdown/GitHub/table.jl b/base/markdown/GitHub/table.jl
index dc31fad954..e87411df2c 100644
--- a/base/markdown/GitHub/table.jl
+++ b/base/markdown/GitHub/table.jl
@@ -8,7 +8,7 @@ end
 function parserow(stream::IO)
     withstream(stream) do
         line = readline(stream)
-        row = split(line, r"(?<!\\)\|")
+        row = collect(split(line, r"(?<!\\)\|"))
         length(row) == 1 && return
         row[1] == "" && shift!(row)
         map!(x -> strip(replace(x, "\\|", "|")), row, row)
diff --git a/base/markdown/render/plain.jl b/base/markdown/render/plain.jl
index 01a3a8994d..f0ca89a7c2 100644
--- a/base/markdown/render/plain.jl
+++ b/base/markdown/render/plain.jl
@@ -56,7 +56,7 @@ end
 function plain(io::IO, f::Footnote)
     print(io, "[^", f.id, "]:")
     s = sprint(plain, f.text)
-    lines = split(rstrip(s), "\n")
+    lines = collect(split(rstrip(s), "\n"))
     # Single line footnotes are printed on the same line as their label
     # rather than taking up an additional line.
     if length(lines) == 1
diff --git a/base/markdown/render/rst.jl b/base/markdown/render/rst.jl
index ca5661520b..697db58a65 100644
--- a/base/markdown/render/rst.jl
+++ b/base/markdown/render/rst.jl
@@ -60,7 +60,7 @@ end
 function rst(io::IO, f::Footnote)
     print(io, ".. [", f.id, "]")
     s = sprint(rst, f.text)
-    lines = split(rstrip(s), "\n")
+    lines = collect(split(rstrip(s), "\n"))
     # Single line footnotes are printed on the same line as their label
     # rather than taking up an additional line.
     if length(lines) == 1
diff --git a/base/markdown/render/terminal/formatting.jl b/base/markdown/render/terminal/formatting.jl
index 1d14f1d60b..048113599a 100644
--- a/base/markdown/render/terminal/formatting.jl
+++ b/base/markdown/render/terminal/formatting.jl
@@ -54,8 +54,8 @@ function ansi_length(s)
     replace(s, r"\e\[[0-9]+m", "") |> length
 end

-words(s) = split(s, " ")
-lines(s) = split(s, "\n")
+words(s) = collect(split(s, " "))
+lines(s) = collect(split(s, "\n"))

 # This could really be more efficient
 function wrapped_lines(s::AbstractString; width = 80, i = 0)
diff --git a/base/path.jl b/base/path.jl
index 6b04246a5f..c65777814d 100644
--- a/base/path.jl
+++ b/base/path.jl
@@ -216,7 +216,7 @@ function normpath(path::String)
     isabs = isabspath(path)
     isdir = isdirpath(path)
     drive, path = splitdrive(path)
-    parts = split(path, path_separator_re)
+    parts = collect(split(path, path_separator_re))
     filter!(x->!isempty(x) && x!=".", parts)
     while true
         clean = true
@@ -339,8 +339,8 @@ function relpath(path::String, startpath::String = ".")
     curdir = "."
     pardir = ".."
     path == startpath && return curdir
-    path_arr  = split(abspath(path),      path_separator_re)
-    start_arr = split(abspath(startpath), path_separator_re)
+    path_arr  = collect(split(abspath(path),      path_separator_re))
+    start_arr = collect(split(abspath(startpath), path_separator_re))
     i = 0
     while i < min(length(path_arr), length(start_arr))
         i += 1
diff --git a/base/socket.jl b/base/socket.jl
index 421e6fece7..022e34c96f 100644
--- a/base/socket.jl
+++ b/base/socket.jl
@@ -164,7 +164,7 @@ If the address is in octal or hexadecimal, convert it to decimal, otherwise remo
 """

 function parse(::Type{IPv4}, str::AbstractString)
-    fields = split(str,'.')
+    fields = collect(split(str,'.'))
     i = 1
     ret = 0
     for f in fields
@@ -215,7 +215,7 @@ end
 parseipv6fields(fields) = parseipv6fields(fields,8)

 function parse(::Type{IPv6}, str::AbstractString)
-    fields = split(str,':')
+    fields = collect(split(str,':'))
     if length(fields) > 8
         throw(ArgumentError("too many fields in IPv6 address"))
     elseif length(fields) == 8
diff --git a/base/version.jl b/base/version.jl
index 956763ce27..54312f712b 100644
--- a/base/version.jl
+++ b/base/version.jl
@@ -79,7 +79,7 @@ const VERSION_REGEX = r"^
 $"ix

 function split_idents(s::AbstractString)
-    idents = split(s, '.')
+    idents = collect(split(s, '.'))
     ntuple(length(idents)) do i
         ident = idents[i]
         ismatch(r"^\d+$", ident) ? parse(Int, ident) : String(ident)

The fact that so many calling locations need this not to be lazy is a bit of a strong indicator that changing this to be lazy is not a great idea, also this change is massively breaking since the iterator object doesn't have length methods or random access indexing. It might make more sense to do something like have a new eachsplit function that returns an iterator. Either that or produce an iterator when the thing you're splitting is an IO stream – that's a case where laziness is probably the right default. In fact that might be a good way to design these APIs in general: default to eagerness, but be lazy when the input is some kind of stream or iterator.

TotalVerb · 2017-02-22T07:32:05Z

base/strings/util.jl

+
+iteratorsize(::Type{SplitIterator}) = SizeUnknown()
+iteratoreltype(::Type{SplitIterator}) = HasEltype()
+eltype(::SplitIterator) = SubString


should be eltype(::Type{SplitIterator})

TotalVerb · 2017-02-22T07:34:40Z

base/strings/util.jl

@@ -248,6 +248,123 @@ julia> rpad("March",20)
 rpad(s, n::Integer, p=" ") = rpad(string(s),n,string(p))
 cpad(s, n::Integer, p=" ") = rpad(lpad(s,div(n+strwidth(s),2),p),n,p)

+immutable SplitIterator
+    str::AbstractString
+    splitter


is there any algorithmic reason why eachsplit is slower than split? would parameterizing the SplitIterator on the types of str and splitter remove that gap, and allow to define split(...) = collect(eachsplit(...))?

This phenomena also seems to occur with the graphemes vs. split(str, "") which do nearly similar things. I've tried to maintain the invariants between the iterator version and the split version. We'll see which invariants I've failed to satisfy when I'm able to debug these tests. :)

I believe regular split(str, "") will split graphemes up. That might or might not be a good thing; I think it's not ideal.

TotalVerb · 2017-02-22T07:35:09Z

base/strings/util.jl

+immutable SplitIterator
+    str::AbstractString
+    splitter
+    limit::Integer


Int is better here

TotalVerb · 2017-02-22T07:37:31Z

base/strings/util.jl

+
+function _eachsplit(str::AbstractString, splitter, limit::Integer, keep_empty::Bool)
+    # Empty string splitter means you want to iterate over the characters
+    splitter == "" ? graphemes(str) : SplitIterator(str, splitter, limit, keep_empty)


regular split does not have this behavior; it will happily split a grapheme into two parts. perhaps that's not desirable but the SplitIterator seems to also do that if splitting on part of a grapheme.

TotalVerb · 2017-02-22T07:44:42Z

base/strings/util.jl

+  state.i > state.n || (iter.limit > 0 && state.s == iter.limit)
+end
+
+function next(iter::SplitIterator, state::SplitIteratorState)


the SplitIteratorState is small enough that you could consider not mutating it and returning a new state, which has some advantages

I was concerned about extra memory allocation with this approach given that it's already slower than split. I'll run a tests to see whether this is worth the cost since performance seems to be the major concern here.

vtjnash · 2021-04-19T16:14:25Z

replaced by #39245

Hydrotoast changed the title ~~Add an iterator implementation for String splitting~~ WIP: Add an iterator implementation for String splitting Feb 20, 2017

kshyatt added the strings "Strings!" label Feb 20, 2017

Add an iterator implementation for String splitting

fc56929

Hydrotoast force-pushed the gcc/string-split-iterator branch 2 times, most recently from 13f8ea5 to fc56929 Compare February 21, 2017 02:33

TotalVerb reviewed Feb 22, 2017

View reviewed changes

anaveragehuman mentioned this pull request Jan 14, 2021

RFC: eachsplit for iterative splitting #39245

Merged

vtjnash closed this Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add an iterator implementation for String splitting #20688

WIP: Add an iterator implementation for String splitting #20688

Hydrotoast commented Feb 20, 2017

StefanKarpinski commented Feb 20, 2017

TotalVerb Feb 22, 2017

TotalVerb Feb 22, 2017

Hydrotoast Feb 22, 2017

TotalVerb Feb 22, 2017

TotalVerb Feb 22, 2017

TotalVerb Feb 22, 2017

TotalVerb Feb 22, 2017

Hydrotoast Feb 22, 2017

vtjnash commented Apr 19, 2021

WIP: Add an iterator implementation for String splitting #20688

WIP: Add an iterator implementation for String splitting #20688

Conversation

Hydrotoast commented Feb 20, 2017

StefanKarpinski commented Feb 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash commented Apr 19, 2021