-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add an iterator implementation for String splitting #20688
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -753,6 +753,7 @@ export | |
digits!, | ||
dump, | ||
eachmatch, | ||
eachsplit, | ||
endswith, | ||
escape_string, | ||
graphemes, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -248,6 +248,123 @@ julia> rpad("March",20) | |
rpad(s, n::Integer, p=" ") = rpad(string(s),n,string(p)) | ||
cpad(s, n::Integer, p=" ") = rpad(lpad(s,div(n+strwidth(s),2),p),n,p) | ||
|
||
immutable SplitIterator | ||
str::AbstractString | ||
splitter | ||
limit::Integer | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
keep_empty::Bool | ||
end | ||
|
||
iteratorsize(::Type{SplitIterator}) = SizeUnknown() | ||
iteratoreltype(::Type{SplitIterator}) = HasEltype() | ||
eltype(::SplitIterator) = SubString | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should be |
||
|
||
type SplitIteratorState | ||
i::Int | ||
j::Int | ||
k::Int | ||
n::Int | ||
s::Int | ||
end | ||
|
||
""" | ||
eachsplit(s::AbstractString, [chars]; limit::Integer=0, keep::Bool=true) | ||
|
||
Return an iterator of substrings by splitting the given string on occurrences of the given | ||
character delimiters, which may be specified in any of the formats allowed by `search`'s | ||
second argument (i.e. a single character, collection of characters, string, or regular | ||
expression). If `chars` is omitted, it defaults to the set of all space characters, and | ||
`keep` is taken to be `false`. The two keyword arguments are optional: they are a | ||
maximum size for the result and a flag determining whether empty fields should be kept in | ||
the result. | ||
|
||
This method is typically slower than `split`, but it does not preemptively allocate an | ||
array. | ||
|
||
```jldoctest | ||
julia> a = "Ma.rch" | ||
"Ma.rch" | ||
|
||
julia> collect(eachsplit(a,".")) | ||
2-element Array{SubString{String},1}: | ||
"Ma" | ||
"rch" | ||
``` | ||
""" | ||
function eachsplit(str::AbstractString, splitter; limit::Integer=0, keep::Bool=true) | ||
_eachsplit(str, splitter, limit, keep) | ||
end | ||
|
||
eachsplit(str::AbstractString) = _eachsplit(_default_delims; limit=0, keep=false) | ||
|
||
function _eachsplit(str::AbstractString, splitter, limit::Integer, keep_empty::Bool) | ||
# Empty string splitter means you want to iterate over the characters | ||
splitter == "" ? graphemes(str) : SplitIterator(str, splitter, limit, keep_empty) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. regular |
||
end | ||
|
||
function start(iter::SplitIterator) | ||
i = start(iter.str) | ||
n = endof(iter.str) | ||
|
||
r = search(iter.str, iter.splitter, i) | ||
j, k = first(r), nextind(iter.str, last(r)) | ||
|
||
# Could not find the splitter in the string | ||
if j == 0 | ||
j = k = nextind(iter.str, n) | ||
end | ||
|
||
# Eat the prefix that matches the splitter | ||
while !iter.keep_empty && i == j && i <= n | ||
i = k | ||
r = search(iter.str, iter.splitter, i) | ||
j, k = first(r), nextind(iter.str, last(r)) | ||
|
||
# Could not find the splitter in the string | ||
if j == 0 | ||
j = k = nextind(iter.str, n) | ||
end | ||
end | ||
|
||
SplitIteratorState(i, j, k, n, 0) | ||
end | ||
|
||
function done(iter::SplitIterator, state::SplitIteratorState) | ||
state.i > state.n || (iter.limit > 0 && state.s == iter.limit) | ||
end | ||
|
||
function next(iter::SplitIterator, state::SplitIteratorState) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was concerned about extra memory allocation with this approach given that it's already slower than |
||
result = SubString(iter.str, state.i, prevind(iter.str, state.j)) | ||
# Move our iterator to the next position of a potential substring | ||
state.i = state.k | ||
state.s += 1 | ||
|
||
if done(iter, state) | ||
return result, state | ||
end | ||
|
||
# Update the state to find the next end point, j, of the next substring | ||
r = search(iter.str, iter.splitter, state.i) | ||
state.j, state.k = first(r), nextind(iter.str, last(r)) | ||
|
||
if state.j == 0 | ||
state.j = state.k = nextind(iter.str, state.n) | ||
end | ||
|
||
while !iter.keep_empty && state.i == state.j && state.i <= state.n | ||
state.i = state.k | ||
r = search(iter.str, iter.splitter, state.i) | ||
state.j, state.k = first(r), nextind(iter.str, last(r)) | ||
|
||
# Could not find the splitter in the string | ||
if state.j == 0 | ||
state.j = state.k = nextind(iter.str, state.n) | ||
end | ||
end | ||
|
||
result, state | ||
end | ||
|
||
# splitter can be a Char, Vector{Char}, AbstractString, Regex, ... | ||
# any splitter that provides search(s::AbstractString, splitter) | ||
split{T<:SubString}(str::T, splitter; limit::Integer=0, keep::Bool=true) = _split(str, splitter, limit, keep, T[]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any algorithmic reason why
eachsplit
is slower thansplit
? would parameterizing theSplitIterator
on the types ofstr
andsplitter
remove that gap, and allow to definesplit(...) = collect(eachsplit(...))
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This phenomena also seems to occur with the
graphemes
vs.split(str, "")
which do nearly similar things. I've tried to maintain the invariants between the iterator version and thesplit
version. We'll see which invariants I've failed to satisfy when I'm able to debug these tests. :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe regular
split(str, "")
will split graphemes up. That might or might not be a good thing; I think it's not ideal.