Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add version of split(str) that returns iterator #20603

Closed
cossio opened this issue Feb 13, 2017 · 4 comments · Fixed by #39245
Closed

add version of split(str) that returns iterator #20603

cossio opened this issue Feb 13, 2017 · 4 comments · Fixed by #39245
Labels
collections Data structures holding multiple items, e.g. sets strings "Strings!"

Comments

@cossio
Copy link
Contributor

cossio commented Feb 13, 2017

I would like to have a version of split(str) that doesn't instantiate the array. Instead returns an iterator over SubStrings.

@ararslan ararslan added collections Data structures holding multiple items, e.g. sets strings "Strings!" labels Feb 13, 2017
@Hydrotoast
Copy link

Hydrotoast commented Feb 16, 2017

Here's a quick draft:

type SplitIterator                                                              
    str::AbstractString                                                         
    splitter                                                                    
    limit::Integer                                                              
    keep_empty::Bool                                                            
end                                                                             
                                                                                
type SplitIteratorState                                                         
    i::Int                                                                      
    j::Int                                                                      
    k::Int                                                                      
    n::Int                                                                      
    s::Int                                                                      
end                                                                             
                                                                                
function _split(str::AbstractString, splitter, limit::Integer, keep_empty::Bool) 
    # Empty string splitter means you want to iterate over the characters
    splitter == "" ?  graphemes(str) : SplitIterator(str, splitter, limit, keep_empty)
end                                                                             
                                                                                
function start(iter::SplitIterator)                                             
    i = start(iter.str)                                                         
    n = endof(iter.str)                                                         
                                                                                
    r = search(iter.str, iter.splitter, i)                                      
    j, k = first(r), nextind(iter.str, last(r))                                 
                                                                                
    # Could not find the splitter in the string                                 
    if j == 0                                                                   
        j = k = nextind(iter.str, n)                                            
    end                                                                         
                                                                                
    # Eat the prefix that matches the splitter                                  
    while !iter.keep_empty && i == j && i <= n                                  
        i = k                                                                   
        r = search(iter.str, iter.splitter, i)                                  
        j, k = first(r), nextind(iter.str, last(r))                             
                                                                                
        # Could not find the splitter in the string                             
        if j == 0                                                               
            j = k = nextind(iter.str, n)                                        
        end                                                                     
    end                                                                         
                                                                                
    SplitIteratorState(i, j, k, n, 0)                                           
end     

function done(iter::SplitIterator, state::SplitIteratorState)                   
  state.i > state.n || state.s == iter.limit                                    
end                                                                             
                                                                                
function next(iter::SplitIterator, state::SplitIteratorState)                   
    result = SubString(iter.str, state.i, prevind(iter.str, state.j))           
    # Move our iterator to the next position of a potential substring
    state.i = state.k                                                           
    state.s += 1                                                                
                                                                                
    if done(iter, state)                                                        
        return result, state                                                    
    end                                                                         
                                                                                
    # Update the state to find the next end point, j, of the next substring
    r = search(iter.str, iter.splitter, state.i)                                
    state.j, state.k = first(r), nextind(iter.str, last(r))                     
                                                                                
    if state.j == 0                                                             
        state.j = state.k = nextind(iter.str, state.n)                          
    end                                                                         
                                                                                
    while !iter.keep_empty && state.i == state.j && state.i <= state.n          
        state.i = state.k                                                       
        r = search(iter.str, iter.splitter, state.i)                            
        state.j, state.k = first(r), nextind(iter.str, last(r))                 
                                                                                
        # Could not find the splitter in the string                             
        if state.j == 0                                                         
            state.j = state.k = nextind(iter.str, state.n)                      
        end                                                                     
    end                                                                         
                                                                                
    result, state                                                               
end

And an example usage:

julia> for x in _split("hello, world", ",", 4, true)
           println(x)
       end
hello
 world

julia> for x in _split("hello, world", "l", 3, false)
           println(x)
       end
he
o, wor
d

julia> for x in _split("hello, world", "l", 3, true)
           println(x)
       end
he

o, wor

The only problem right now is that collect-ing this to an array requires a length. After writing that, I can post some benchmarks against the current implementation.

@ScottPJones thoughts?

@TotalVerb
Copy link
Contributor

TotalVerb commented Feb 17, 2017

Should define iteratorsize(::Type{SplitIterator}) = SizeUnknown() so that collect does not require a length.

@TotalVerb
Copy link
Contributor

Also, please open this as a PR so that it is easier to review.

@jakewilliami
Copy link

We missed an opportunity to call this spliterate...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets strings "Strings!"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants