Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default last in strutils.find to -1 #18173

Closed
wants to merge 6 commits into from
Closed

Conversation

n5m
Copy link
Contributor

@n5m n5m commented Jun 4, 2021

Refs #18127 (comment). The docs for find say that last represents an inclusive end index, but last=0 is not treated as the 0th index. This PR fixes that.

Future directions:

  1. Raise an IndexDefect if last < start, which will help with edge cases for find and rfind differ on empty needle #18128.
  2. last being restricted to union(-1, Natural) is a bit inconvenient. Is there room for something adjacent to Natural like this?
type
  # Natural* = range[0..high(int)]
  StringIndex* = range[-1..high(int)]

This would be like ssize_t and this type could also be used as the return type of find and rfind. Alternatively, we could change the paramter in find to

func find(..., last: Natural = high(Natural))

instead of what it is in the current state of this PR, which is

func find(..., last: int = -1)

@timotheecour timotheecour reopened this Jun 4, 2021
@timotheecour
Copy link
Member

timotheecour commented Jun 4, 2021

pre-existing API is horrible, conflating a valid value with a different meaning func find*(a: SkipTable, s, sub: string, start: Natural = 0, last = 0): int

note that this breaks code, and the resulting API is still bad because it doesn't use openArray; instead how about this:

introduce openArray[char] overloads and deprecate the procs that have start/last, and make sure that the simple find("abc", "ab") will pick the non-deprecated proc as follows:

func find*(s: openArray[char], sub: string): int = ...
func find*(s, sub: string, start: Natural, last = -1): int {.deprecated: "use overload with openArray".} = ...
func find*(s, sub: string, last: int): int  {.deprecated: "use overload with openArray".} = ...

(the deprecated procs, as i wrote it, have at least 1 non-optional param (start or last), so that the correct overload is picked if no scuh param is used)

more generally

=> just summarized this in nim-lang/RFCs#381

links

@n5m
Copy link
Contributor Author

n5m commented Jun 20, 2021

@timotheecour I've added new variants of find that take openArray, but there were a few issues with this.

First, before this change, a call to find("abc", "xyz", 3) would be interpreted as find("abc", "xyz", start=3), but with this change--even with the technique you posted above to disambiguate the overloaded methods--that call would get silently reinterpreted as find("abc", "xyz", last=3). This is definitely a breaking change. Should it be added to the CHANGELOG?

Second, the disadvantage of these methods taking openArrays is that the returned indices are based on the start of the slice, not on the start of the string. I imagine that most people will have to redo logic like

Nim/lib/pure/strutils.nim

Lines 1857 to 1863 in 58159fc

if index == -1:
return -1
else:
# This version of find returns the index starting from the beginning of the
# string, but the openArray version of find returns the index starting from
# the beginning of the slice.
return start + index
Is there a way to offer this directly out of strutils?

For example,

# This seems much more useful...
doAssert find("dexxxdex", "de", start=5) == 5
# ...than this one
doAssert find("dexxxdex".toOpenArray(5, 7), "de") == 0

@n5m n5m marked this pull request as ready for review June 20, 2021 22:54
@timotheecour
Copy link
Member

timotheecour commented Jun 20, 2021

First, before this change, a call to find("abc", "xyz", 3) would be interpreted as find("abc", "xyz", start=3),

indeed, tough problem, it's hard to satisfy both of these:

find(a, b) # uses new API
find(a, b, 3) # means start = 3
find(a, b, last = 3) # means last = 3

and we can't introduce this silent breaking change

I need to think more here...
The easiest option would be to introduce a new API named findSub that takes openArray instead of adding an overload to find.

That would also address your 2nd point although i kind of disagree with that 2nd point; toOpenArray simplifies the interface / complexity of the API; you pass a container that knows its bounds and don't have to care about whether it was a slice from a bigger container.

@Araq
Copy link
Member

Araq commented Jun 21, 2021

The easiest option would be to introduce a new API named findSub that takes openArray instead of adding an overload to find.

These new operations don't have to be in strutils anyway...

@timotheecour
Copy link
Member

timotheecour commented Jun 21, 2021

These new operations don't have to be in strutils anyway...

ok for strbasics.findSub that would have no "baggage"

@n5m
Copy link
Contributor Author

n5m commented Jun 21, 2021

Okay great, we'll leave all overloads of strutils.find untouched and add a new family of methods to strbasics. How about a family of overloads named strbasics.indexOf?

Apache Commons Lang is an example of another popular library that has addressed what I raised in my second point in #18173 (comment). They have two methods:

  1. indexOf(openArray[char], openArray[char]): int
  2. indexOf(openArray[char], openArray[char], Natural): int

I strongly believe this is the API we should strive for. Notably, there is no need for a last parameter in either of these methods because Java's String.substring(int, int) method wholly solves for that, just as how toOpenArray(string, int, int) does in Nim, though this is inverted for lastIndexOf. The only decision that has to be made is what to do if the Natural param exceeds the length of the haystack: we can either raise an error or treat that as if the haystack is "".

@n5m
Copy link
Contributor Author

n5m commented Jun 21, 2021

Another nice-to-have feature would be that the return types of these indexOf are made StringIndex instead of int. I mentioned this in #18173 (comment). Any objections to adding this into strbasics, too? With it, we could even rewrite indexOf(openArray[char], openArray[char], Natural): int to indexOf(openArray[char], openArray[char], StringIndex): StringIndex where an argument of -1 means "use the whole haystack openArray" and a return value of -1 means "not found."

@timotheecour
Copy link
Member

timotheecour commented Jun 21, 2021

How about a family of overloads named strbasics.indexOf?

ok for strbasics.indexOf; D also uses that name for this purpose (https://dlang.org/library/std/string/index_of.html#1)

ok for:

proc indexOf*(s: openArray[char], sub: openArray[char] | char): int

I don't think a start param is essential and it can always be added in future work in any case.
start may add a bit of convenience in some cases but the caller can do the offset themselves after passing toOpenArray(s, start, s.high) and has the drawback of having to specify "what happens with start out of range" etc, as well as having to add start to all API's involving string/seq's for consistency. start also makes the proc raise, whereas the above form doesn't raise.

The place where start is really needed is when the API needs some lookbehind (eg regex), which isn't the case here.

StringIndex

This would just add un-necessary complexity for users which then need to do conversion between int and StringIndex everytime this is used.

@n5m n5m marked this pull request as draft June 23, 2021 01:40
@n5m
Copy link
Contributor Author

n5m commented Jun 23, 2021

My plan going forward (in order):

Would it be useful to always have a naive implementation available (useful because the naive implementation requires no preprocessing)?

@timotheecour
Copy link
Member

timotheecour commented Jun 23, 2021

@n5m IMO all those comments you link to our outdated because we now use c_strstr since #17672; It would be IMO very low priority to try to come up with a nim-native implementation that tries to beat strstr, which is heavily optimized. What would be ok would be a thin wrapper around https://github.com/RaphaelJ/fast_strstr (BSD3 license should be fine; but some more research would be needed to find the best library available) which shows favorable performance against strstr [1], but that can be done transparently at a later time without changing user facing API. find can be performance critical so having good performance there is worthwhile.

For now though, I'd recommend just having a version of indexOf in strbasics that can be reused for strutils.find (avoiding duplication), it'd be mostly a matter of moving the existing implementation and using openArray inputs (and keeping the fact that we use c_strstr under the hood for c backend)

[1] precedent: dragonbox, schubfach

@n5m
Copy link
Contributor Author

n5m commented Jul 11, 2021

I ran into some issues here. You'll note that in my most recent push, I have this:

Nim/lib/std/strbasics.nim

Lines 149 to 151 in 09a0df2

const hasCStringBuiltin: bool = false
else:
const hasCStringBuiltin: bool = false

This is because although the following linse work in a standalone .nim file, it won't work when part of tstrbasics.nim:

const haystack: seq[char] = @['a', 'b', 'c']
doAssert haystack.indexOf(@['a', 'b', 'c']) == 0

Do you know why these line fail in a test but work outside of it?

Even more problematic is that if we force the use of c_strstr, and remove the following lines

Nim/lib/std/strbasics.nim

Lines 232 to 236 in 09a0df2

if result > haystack.len - needle.len:
# c_strstr will look all the way until a null byte is found, so
# we must ensure that the return value is inside the
# openArray-defined bounds of the strings
result = -1

then

const haystack: string = "0123456789ABCDEFGH"
echo haystack.toOpenArray(5, 9).indexOf("A")

prints "5" when it should print out "-1". This makes the case that perhaps c_strstr should not be used as part of indexOf(openArray[char], openArray[char]).

@n5m
Copy link
Contributor Author

n5m commented Aug 21, 2021

@timotheecour I'm still having trouble using the c_strstr in the openArray/openArray variant of indexOf because c_strstr requires the haystack and needle passed to it to be NULL-terminated. How can that work with an openArray?

Example:

func c_strstr(haystack, needle: cstring): cstring {.
  importc: "strstr", header: "<string.h>".}

func naiveIndexOf(haystack, needle: openArray[char]): int =
  # naive implementation, doesn't handle -1 case
  let found = c_strstr(haystack[0].unsafeAddr, needle[0].unsafeAddr)
  return cast[ByteAddress](found) -% cast[ByteAddress](haystack[0].unsafeAddr)

echo naiveIndexOf("abcdef", "e")                    # prints 4
echo naiveIndexOf("abcdef".toOpenArray(1, 2), "e")  # prints 3
$ nim --version
Nim Compiler Version 1.5.1 [Linux: amd64]
Compiled at 2021-08-21
Copyright (c) 2006-2021 by Andreas Rumpf

git hash: 5b26f2bd81d6fc7d48befbfb4fa3317f713af787

5b26f2b

@timotheecour
Copy link
Member

if needle occurs nowhere in haystack, NULL is returned;

can't you do something like:

if found == nil: result = -1
else: ...

@n5m
Copy link
Contributor Author

n5m commented Aug 22, 2021

Yes, but unfortunately there is a bigger obstacle. I'll trim down my example from #18173 (comment)

func naiveIndexOf(haystack, needle: openArray[char]): int =
  # naive implementation, doesn't handle -1 case
  let found = c_strstr(haystack[0].unsafeAddr, needle[0].unsafeAddr)
  return cast[ByteAddress](found) -% cast[ByteAddress](haystack[0].unsafeAddr)

echo naiveIndexOf("abcdef".toOpenArray(1, 2), "e")  # prints 3

This shows that c_strstr will continue looking past the openArray bounds, which could have a huge performance penalty.

n5m added 6 commits January 5, 2022 02:18
# This is the 1st commit message:

default last in strutils.find to -1

# This is the commit message nim-lang#2:

Revert "default last in strutils.find to -1"

This reverts commit 3a99556.

# This is the commit message nim-lang#3:

add openArray variable of find with SkipTable

# This is the commit message nim-lang#4:

add find method that accepts an openArray

# This is the commit message nim-lang#5:

add tests

# This is the commit message nim-lang#6:

Revert "add find method that accepts an openArray"

This reverts commit 2ab5351.

# This is the commit message nim-lang#7:

Revert "add openArray variable of find with SkipTable"

This reverts commit 58159fc.
fix CI tests on public nimble modules
fix koch docs generation
@ringabout
Copy link
Member

Succeeded by #19761.

@ringabout ringabout closed this Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants