Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add an iterator implementation for String splitting #20688

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions base/exports.jl
Original file line number Diff line number Diff line change
Expand Up @@ -753,6 +753,7 @@ export
digits!,
dump,
eachmatch,
eachsplit,
endswith,
escape_string,
graphemes,
Expand Down
117 changes: 117 additions & 0 deletions base/strings/util.jl
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,123 @@ julia> rpad("March",20)
rpad(s, n::Integer, p=" ") = rpad(string(s),n,string(p))
cpad(s, n::Integer, p=" ") = rpad(lpad(s,div(n+strwidth(s),2),p),n,p)

immutable SplitIterator
str::AbstractString
splitter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any algorithmic reason why eachsplit is slower than split? would parameterizing the SplitIterator on the types of str and splitter remove that gap, and allow to define split(...) = collect(eachsplit(...))?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This phenomena also seems to occur with the graphemes vs. split(str, "") which do nearly similar things. I've tried to maintain the invariants between the iterator version and the split version. We'll see which invariants I've failed to satisfy when I'm able to debug these tests. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe regular split(str, "") will split graphemes up. That might or might not be a good thing; I think it's not ideal.

limit::Integer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int is better here

keep_empty::Bool
end

iteratorsize(::Type{SplitIterator}) = SizeUnknown()
iteratoreltype(::Type{SplitIterator}) = HasEltype()
eltype(::SplitIterator) = SubString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be eltype(::Type{SplitIterator})


type SplitIteratorState
i::Int
j::Int
k::Int
n::Int
s::Int
end

"""
eachsplit(s::AbstractString, [chars]; limit::Integer=0, keep::Bool=true)

Return an iterator of substrings by splitting the given string on occurrences of the given
character delimiters, which may be specified in any of the formats allowed by `search`'s
second argument (i.e. a single character, collection of characters, string, or regular
expression). If `chars` is omitted, it defaults to the set of all space characters, and
`keep` is taken to be `false`. The two keyword arguments are optional: they are a
maximum size for the result and a flag determining whether empty fields should be kept in
the result.

This method is typically slower than `split`, but it does not preemptively allocate an
array.

```jldoctest
julia> a = "Ma.rch"
"Ma.rch"

julia> collect(eachsplit(a,"."))
2-element Array{SubString{String},1}:
"Ma"
"rch"
```
"""
function eachsplit(str::AbstractString, splitter; limit::Integer=0, keep::Bool=true)
_eachsplit(str, splitter, limit, keep)
end

eachsplit(str::AbstractString) = _eachsplit(_default_delims; limit=0, keep=false)

function _eachsplit(str::AbstractString, splitter, limit::Integer, keep_empty::Bool)
# Empty string splitter means you want to iterate over the characters
splitter == "" ? graphemes(str) : SplitIterator(str, splitter, limit, keep_empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regular split does not have this behavior; it will happily split a grapheme into two parts. perhaps that's not desirable but the SplitIterator seems to also do that if splitting on part of a grapheme.

end

function start(iter::SplitIterator)
i = start(iter.str)
n = endof(iter.str)

r = search(iter.str, iter.splitter, i)
j, k = first(r), nextind(iter.str, last(r))

# Could not find the splitter in the string
if j == 0
j = k = nextind(iter.str, n)
end

# Eat the prefix that matches the splitter
while !iter.keep_empty && i == j && i <= n
i = k
r = search(iter.str, iter.splitter, i)
j, k = first(r), nextind(iter.str, last(r))

# Could not find the splitter in the string
if j == 0
j = k = nextind(iter.str, n)
end
end

SplitIteratorState(i, j, k, n, 0)
end

function done(iter::SplitIterator, state::SplitIteratorState)
state.i > state.n || (iter.limit > 0 && state.s == iter.limit)
end

function next(iter::SplitIterator, state::SplitIteratorState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the SplitIteratorState is small enough that you could consider not mutating it and returning a new state, which has some advantages

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was concerned about extra memory allocation with this approach given that it's already slower than split. I'll run a tests to see whether this is worth the cost since performance seems to be the major concern here.

result = SubString(iter.str, state.i, prevind(iter.str, state.j))
# Move our iterator to the next position of a potential substring
state.i = state.k
state.s += 1

if done(iter, state)
return result, state
end

# Update the state to find the next end point, j, of the next substring
r = search(iter.str, iter.splitter, state.i)
state.j, state.k = first(r), nextind(iter.str, last(r))

if state.j == 0
state.j = state.k = nextind(iter.str, state.n)
end

while !iter.keep_empty && state.i == state.j && state.i <= state.n
state.i = state.k
r = search(iter.str, iter.splitter, state.i)
state.j, state.k = first(r), nextind(iter.str, last(r))

# Could not find the splitter in the string
if state.j == 0
state.j = state.k = nextind(iter.str, state.n)
end
end

result, state
end

# splitter can be a Char, Vector{Char}, AbstractString, Regex, ...
# any splitter that provides search(s::AbstractString, splitter)
split{T<:SubString}(str::T, splitter; limit::Integer=0, keep::Bool=true) = _split(str, splitter, limit, keep, T[])
Expand Down
41 changes: 41 additions & 0 deletions test/strings/util.jl
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,24 @@ end
#@test isequal(rsplit("a b c"), ["a","b","c"])
#@test isequal(rsplit("a b \t c\n"), ["a","b","c"])

@test isequal(collect(eachsplit("foo,bar,baz", 'x')), ["foo,bar,baz"])
@test isequal(collect(eachsplit("foo,bar,baz", ',')), ["foo","bar","baz"])
@test isequal(collect(eachsplit("foo,bar,baz", ",")), ["foo","bar","baz"])
@test isequal(collect(eachsplit("foo,bar,baz", r",")), ["foo","bar","baz"])
@test isequal(collect(eachsplit("foo,bar,baz", ','; limit=0)), ["foo","bar","baz"])
@test isequal(collect(eachsplit("foo,bar,baz", ','; limit=1)), ["foo,bar,baz"])
@test isequal(collect(eachsplit("foo,bar,baz", ','; limit=2)), ["foo","bar,baz"])
@test isequal(collect(eachsplit("foo,bar,baz", ','; limit=3)), ["foo","bar","baz"])
@test isequal(collect(eachsplit("foo,bar", "o,b")), ["fo","ar"])

@test isequal(collect(eachsplit("", ',')), [""])
@test isequal(collect(eachsplit(",", ',')), ["",""])
@test isequal(collect(eachsplit(",,", ',')), ["","",""])
@test isequal(collect(eachsplit(",,", ','; limit=2)), [",",""])
@test isequal(collect(eachsplit("", ',' ; keep=false)), [])
@test isequal(collect(eachsplit(",", ',' ; keep=false)), [])
@test isequal(collect(eachsplit(",,", ','; keep=false)), [])

let str = "a.:.ba..:..cba.:.:.dcba.:."
@test isequal(split(str, ".:."), ["a","ba.",".cba",":.dcba",""])
@test isequal(split(str, ".:."; keep=false), ["a","ba.",".cba",":.dcba"])
Expand All @@ -93,6 +111,14 @@ let str = "a.:.ba..:..cba.:.:.dcba.:."
@test isequal(rsplit(str, ".:."; limit=4), ["a.:.ba.", ".cba.:", "dcba", ""])
@test isequal(rsplit(str, ".:."; limit=5), ["a", "ba.", ".cba.:", "dcba", ""])
@test isequal(rsplit(str, ".:."; limit=6), ["a", "ba.", ".cba.:", "dcba", ""])

@test isequal(collect(eachsplit(str, ".:.")), ["a","ba.",".cba.:","dcba",""])
@test isequal(collect(eachsplit(str, ".:."; keep=false)), ["a","ba.",".cba.:","dcba"])
@test isequal(collect(eachsplit(str, ".:."; limit=2)), ["a.:.ba..:..cba.:.:.dcba", ""])
@test isequal(collect(eachsplit(str, ".:."; limit=3)), ["a.:.ba..:..cba.:", "dcba", ""])
@test isequal(collect(eachsplit(str, ".:."; limit=4)), ["a.:.ba.", ".cba.:", "dcba", ""])
@test isequal(collect(eachsplit(str, ".:."; limit=5)), ["a", "ba.", ".cba.:", "dcba", ""])
@test isequal(collect(eachsplit(str, ".:."; limit=6)), ["a", "ba.", ".cba.:", "dcba", ""])
end

# zero-width splits
Expand All @@ -113,6 +139,21 @@ end
@test isequal(split("abcd", r"d+"), ["abc",""])
@test isequal(split("abcd", r"[ad]?"), ["","b","c",""])

@test isequal(collect(eachsplit("", "")), [""])
@test isequal(collect(eachsplit("", r"")), [""])
@test isequal(collect(eachsplit("abc", "")), ["a","b","c"])
@test isequal(collect(eachsplit("abc", r"")), ["a","b","c"])
@test isequal(collect(eachsplit("abcd", r"b?")), ["a","c","d"])
@test isequal(collect(eachsplit("abcd", r"b*")), ["a","c","d"])
@test isequal(collect(eachsplit("abcd", r"b+")), ["a","cd"])
@test isequal(collect(eachsplit("abcd", r"b?c?")), ["a","d"])
@test isequal(collect(eachsplit("abcd", r"[bc]?")), ["a","","d"])
@test isequal(collect(eachsplit("abcd", r"a*")), ["","b","c","d"])
@test isequal(collect(eachsplit("abcd", r"a+")), ["","bcd"])
@test isequal(collect(eachsplit("abcd", r"d*")), ["a","b","c",""])
@test isequal(collect(eachsplit("abcd", r"d+")), ["abc",""])
@test isequal(collect(eachsplit("abcd", r"[ad]?")), ["","b","c",""])

# replace
@test replace("\u2202", '*', '\0') == "\u2202"

Expand Down