Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes, strings: add Cut #46336

Closed
rsc opened this issue May 24, 2021 · 47 comments
Closed

bytes, strings: add Cut #46336

rsc opened this issue May 24, 2021 · 47 comments

Comments

@rsc
Copy link
Contributor

rsc commented May 24, 2021

I propose to add to package strings:

// Cut cuts s around the first instance of sep,
// returning the text before and after sep.
// The found result reports whether sep appears in s.
// If sep does not appear in s, cut returns s, "", false.
func Cut(s, sep string) (before, after string, found bool) {
	if i := Index(s, sep); i >= 0 {
		return s[:i], s[i+len(sep):], true
	}
	return s, "", false
}

I similarly propose to add to package bytes:

// Cut cuts s around the first instance of sep,
// returning the text before and after sep.
// The found result reports whether sep appears in s.
// If sep does not appear in s, cut returns s, "", false.
func Cut(s, sep []byte) (before, after []byte, found bool)

Cut is a single function that replaces and simplifies the overwhelming majority of the usage of four different standard library functions at once (Index, IndexByte, IndexRune, and SplitN). It has also been invented twice in the main repo. For these reasons, I propose to add Cut to the standard library.

These were previously proposed as part of #40135, which turned into quite the bikeshed discussion about compiler optimization and so on. There were also various other functions proposed, which made the discussion even more wide-ranging.

This issue is a replacement for #40135, to start a new and I hope more focused discussion. Thanks very much to @ainar-g for starting the discussion on #40135.

Edit, May 24: Renamed the third result to found to avoid confusion with the comma-ok form for map access, channel receive, and so on, which always return zero values with ok = false.

The Unreasonable Effectiveness of Cut

To attempt to cut off the bikeshedding this time, let me present data showing why Cut is worth adding.

Anecdotally, after the discussion on #40135 I copied the suggested implementation of Cut into every program that I wrote that could use it for a while, and that turned out to be just about every program I wrote that did anything with strings. That made me think there is something here.

To check that belief, I recently searched for uses of strings.Index, strings.IndexByte, or strings.IndexRune in the main repo and converted the ones that could use strings.Cut instead. Calling those all “Index” for the sake of discussion (any call with a constant sep should use Index anyway), there were:

  • 311 Index calls outside examples and testdata.
  • 20 should have been Contains
  • 2 should have been 1 call to IndexAny
  • 2 should have been 1 call to ContainsAny
  • 1 should have been TrimPrefix
  • 1 should have been HasSuffix

That leaves 285 calls. Of those, 221 were better written as Cut, leaving 64 that were not.
That is, 77% of Index calls are more clearly written using Cut. That's incredible!

A typical rewrite is to replace:

 	// The first equals separates the key from the value.
	eq := strings.IndexByte(rec, '=')
	if eq == -1 {
		return "", "", s, ErrHeader
	}
	k, v = rec[:eq], rec[eq+1:]

with

	// The first equals separates the key from the value.
	k, v, ok = strings.Cut(rec, "=")
	if !ok {
		return "", "", s, ErrHeader
	}

I have long believed that this line from the first version is an elegant Go idiom:

	k, v = rec[:eq], rec[eq+1:]

However, it is undeniably more elegant not to write that line at all, as in the second version. More complex examples became quite a lot more readable by working with strings instead of indexes. It seems to me that indexes are the equivalent of C pointers. Working with actual strings instead, as returned by Cut, lets the code raise the level of abstraction significantly.

As noted in #40135, Cut is very much like SplitN with n=2, but without the awkwardness and potential inefficiency of the slice. A typical rewrite is to replace:

elts := strings.SplitN(flag, "=", 2)
name := elts[0]
value := ""
if len(elts) == 2 {
	value = elts[1]
}

with

name, value, _ := strings.Cut(flag, "=")

In the discussion on #40135, the point was raised that maybe we could make SplitN with count 2 allocate the result slice on the caller's stack, so that the first version would be as efficient as the second. I am sure we can with mid-stack inlining, and we probably should to help existing code, but making SplitN with n=2 more efficient does not make the code any less awkward to write. The reason to adopt Cut is clarity of code, not efficiency.

Of the 33 calls to SplitN in the main repo (excluding examples), 24 were using count 2 and were more clearly written using Cut instead.

That is, 72% of SplitN calls are also more clearly written using Cut. That's also incredible!

Looking in my public Go corpus, I found 88,230 calls to strings.SplitN. Of these, 77,176 (87%) use a fixed count 2. I expect that essentially all of them would be more clearly written using Cut.

It is also worth noting that something like Cut had been reinvented in two different packages as an unexported function: golang.org/x/mod/sumdb/note's chop (7 uses) and net/url's split (4 uses). Clearly, slicing out the separator is an incredibly common thing to do with the result of strings.Index.

The conversions described here can be viewed in CL 322210.

As noted in #40135, Cut is similar to Python's str.partition(sep), which returns (before, sep, after) if sep is found and (str, "", "") if not. The boolean result seems far more idiomatic in Go, and it was used directly as a boolean in over half the Cut calls I introduced in the main repo. That is, the fact that str.partition is useful in Python is added evidence for Cut, but we need not adopt the Pythonic signature. (The more idiomatic Go signature was first suggested by @nightlyone.)

Again, Cut is a single function that replaces and simplifies the overwhelming majority of the usage of four different standard library functions at once (Index, IndexByte, IndexRune, and SplitN), and it has been invented twice in the main repo. That seems above the bar for the standard library.

Why not other related functions?

A handful of other functions were suggested in #40135 as well. Here is some more data addressing those.

LastCut. This function would be

// LastCut cuts s around the last instance of sep,
// returning the text before and after sep.
// The found result reports whether sep appears in s.
// If sep does not appear in s, cut returns s, "", false.
func LastCut(s, sep string) (before, after string, found bool) {
	if i := LastIndex(s, sep); i >= 0 {
		return s[:i], s[i+len(sep):], true
	}
	return s, "", false
}

There are 107 calls to strings.LastIndex and strings.LastIndexByte in the main repo (compared to 285 for Index/IndexByte/IndexRune). I expect a substantial fraction of them would be made simpler using LastCut but have not investigated except for looking at a couple dozen random examples. I considered including LastCut in this proposal, but it seemed easier to limit it to the single function Cut.

It may be worth adding LastCut as well.

Until. This function would be

// Until returns the prefix of s until the first instance of sep,
// or the whole string s if sep does not appear.
func Until(s, sep string) string {
	s, _, _ = Cut(s, sep)
	return s
}

Of the 261 calls to strings.Cut I ended up with in the main repo, 51 (20%) were of this form. Having the single-result form would further simplify uses by allowing it to appear inside other expressions. Like with LastCut, I considered including Until in the proposal but it seemed easier to limit it to the single function Cut.

It may be worth adding Until as well.

(Until was also suggested as PrefixUntil.)

SuffixAfter. This function was suggested as meaning

func SuffixAfter(s, sep string) string {
	_, s, _ = strings.LastCut(s, sep)
	return s
}

Similarly, we could consider a variant using Cut instead of LastCut. In the 261 calls to strings.Cut I ended up with in the main repo, only 6 (2%) were of the form _, x, _ = strings.Cut. So the SuffixAfter form using the result of Cut is definitely not worth adding. It is possible that the data would come out differently for LastCut, but in the absence of that evidence, it is not worth proposing to add SuffixAfter. (Adding the LastCut form would require first adding LastCut as well.)

SplitFirst. This function was suggested as meaning

func SplitFirst(s, sep string) (before, after string) {
	before, after, _ = strings.Cut(s, sep)
	return before, after
}

Of the 261 calls to Cut I added to the main repo, only 106 (41%) discard the final result. Although the result can be reconstructed as len(before) != len(s), it seems wrong to force the majority of uses to do that, especially since there are multiple results either way.

Why not other unrelated functions?

There was concern in #40135 even among those who thought Cut was worth adding about where the bar was for standard library additions and whether adding Cut meant opening the floodgates. I am comfortable saying that the bar is somewhere near “replaces and simplifies the overwhelming majority of the usage of four different standard library functions at once” and similarly confident that establishing such a bar would not open any floodgates.

@gopherbot gopherbot added this to the Proposal milestone May 24, 2021
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/322210 mentions this issue: all: use strings.Cut

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/322209 mentions this issue: strings: add Cut

@eZanmoto
Copy link

I like this idea. I have a question regarding the decision to return s in the false case, and wondering if there are many other cases in Go where meaningful data is returned with the negative case. I know that io.Reader and io.Writer can give meaningful returns in the case of io.EOF, but I'm not as used to if !ok having meaningful data, such as in the case of reflection, channel reads, map access, etc. While it's elegant in how it simplifies the existing usages, it did cause me to double-take when reading it at first, so I wanted to highlight this in case there may be others that encountered the same. Ignoring the ok value of a function also struck me as a bit surprising.

@rsc
Copy link
Contributor Author

rsc commented May 24, 2021

@eZanmoto Usually there is no value to return in a "negative" case. Here there is. Instead of thinking of the final bool as indicating whether the first two results exist at all, think of it as adding one more piece of information to the overall answer. That is, it's a different kind of bool from a map access, like math/big.Rat.Float64. Or think of it as similar to the values returned on range errors from strconv.ParseInt.

I renamed the result in the declarations above to found, to try to help with this surprise. I do expect at most call sites people will still call it ok.

@seebs
Copy link
Contributor

seebs commented May 24, 2021

I had an idea too bad not to share:

prefix, ok := haystack[:"."]

I think this constitutes a vote for Cut.

I do think it's correct to return s when found==false, because s is indeed the set of characters in s prior to the first instance of a thing which was not found in s.

@kylelemons
Copy link
Contributor

Hearty +1 from me.

I am also in favor of Until, and would be perfectly comfortable with the analogous LastCut and After.

@beoran
Copy link

beoran commented May 25, 2021

The idea of the Cut function is great, but seeing hat his will be both for string and byte arrays, perhaps this could be a generic function?

@rsc
Copy link
Contributor Author

rsc commented May 25, 2021

Generics are not really helpful for unifying string and []byte functions, and the separate strings and bytes packages will not be going away.

@benhoyt
Copy link
Contributor

benhoyt commented May 26, 2021

I definitely support adding this feature, and would have used it many times already. I remember really appreciating when they added str.partition in Python -- it avoided the split ... if len(parts) == 2 awkwardness, and as you've shown, often avoids an if at all.

However, I don't think the name Cut is perfect. At took me a couple of reads of the description to figure out what it did. "Where does it cut?" "Oh, it cuts into more than two pieces, and 'cuts out' the separator?" Either way, I don't think the doc comment should use the word in its definition: "Cut cuts ...". Maybe "Cut splits the string before and after the first instance of sep" or something.

My vote would be for the exact same signature, but with the name Partition. Longer but I think more explanatory. And I don't think that's just because I've used Python's version ... "partition" seems to evoke carefully separating into pieces and returning those pieces. You partition a country into carefully-defined states. A partition (albeit the noun) is also a very close word: it's the space or material in between two other things -- exactly the sense here. Cut is vague; you can cut things into any shape or size. If you get a cut, it's an often-messy abrasion on your skin -- no before, after, or partition.

@ainar-g
Copy link
Contributor

ainar-g commented May 26, 2021

Fwiw, as the author of the original issue I can say that Cut solves it for me. I have no objections to the name or the implementation.

@zephyrtronium
Copy link
Contributor

zephyrtronium commented May 26, 2021

My vote would be for the exact same signature, but with the name Partition. Longer but I think more explanatory. And I don't think that's just because I've used Python's version ... "partition" seems to evoke carefully separating into pieces and returning those pieces. You partition a country into carefully-defined states. A partition (albeit the noun) is also a very close word: it's the space or material in between two other things -- exactly the sense here. Cut is vague; you can cut things into any shape or size. If you get a cut, it's an often-messy abrasion on your skin -- no before, after, or partition.

Maybe pedantic, but as a mathematician, to me "partition" means a set of disjoint sets the union of which equals another set. Since Cut, as proposed, does not include the separator, it does not give a partition under that technical definition (adjusted for strings). Python's partition does include the separator in the returned value, so it does truly form a partition.

@akupila
Copy link

akupila commented May 26, 2021

When I first saw this proposal, I assumed Cut was related to the TrimXXX functions. I'm not suggesting changing the name, but I agree with what @benhoyt said regarding the doc comment.

@hherman1
Copy link

I think Cut is a really illustrative metaphor for the operation. It takes a second to get it, but then once you get it it’s completely unambiguous. Also it’s short, so not v taxing in the page. I’m pro cut!

@networkimprov
Copy link

networkimprov commented May 27, 2021

I don't think "Cut" captures the meaning; it implies "remove/drop". Options:

Split2 - associates with current Split ops
Cleave
Bisect
Divide
Separate
Segment

Since many string-search ops support a LastOp() variant, I think LastCut should be added too.

@hherman1
Copy link

Chop?

@benhoyt
Copy link
Contributor

benhoyt commented May 28, 2021

In case it's helpful to anyone, some evocative images (several of them food-related!) that I think capture the spirit of each word, with a comment on each:

  • Cut - certainly not the worst word, but I think it more conveys cutting into a shape or cutting a string into two
  • Chop - I think this usually means "into small pieces", which doesn't apply
  • Cleave - a funny word that means its own opposite; in the "cut" sense it's usually a very forceful operation
  • Divide - not a bad word, though probably too close to "division" in the numbers sense
  • Bisect - almost always means "into equal parts" which doesn't apply (and overloaded in CS)
  • Partition - still my vote, conveys separating into two not-necessary-equal parts, with something in between
  • Segment - not particularly fitting, usually connotes multiple segments like split
  • Separate - doesn't quite fit, usually means carefully evaluating components and putting into similar piles

@fzipp
Copy link
Contributor

fzipp commented May 28, 2021

I don't think "Cut" captures the meaning; it implies "remove/drop".

But "Cut" does remove/drop the "sep" part.

@nightlyone
Copy link
Contributor

nightlyone commented May 28, 2021

Rust calls this operation split_once

@sina-devel
Copy link

For more readability and compatibility with other functions (Split, SplitN, ...), it is better to name it SplitOnce.

@networkimprov
Copy link

networkimprov commented May 28, 2021

Since the proposed API is most like the current Split APIs...

SplitAt - emphasizes separator
SplitPair - emphasizes return values
SplitN2 - references SplitN(..., 2)
SplitOnce - emphasizes search iterations

@gazerro
Copy link
Contributor

gazerro commented May 29, 2021

Let me suggest SplitAround

// SplitAround splits s around the first instance of sep,
// ...
func SplitAround(s, sep string) (before, after string, found bool) {
	...
}

@rsc
Copy link
Contributor Author

rsc commented Jun 2, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@rsc
Copy link
Contributor Author

rsc commented Jul 21, 2021

I think we probably want to wait on LastCut until we have more compelling data.

@earthboundkid
Copy link
Contributor

This hasn't been merged yet. Would it be helpful for me to open a CL?

@rsc
Copy link
Contributor Author

rsc commented Sep 21, 2021

I was just about to send one. Thanks though.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/351711 mentions this issue: all: use bytes.Cut, strings.Cut

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/351710 mentions this issue: bytes, strings: add Cut

gopherbot pushed a commit that referenced this issue Oct 6, 2021
Many uses of Index/IndexByte/IndexRune/Split/SplitN
can be written more clearly using the new Cut functions.
Do that. Also rewrite to other functions if that's clearer.

For #46336.

Change-Id: I68d024716ace41a57a8bf74455c62279bde0f448
Reviewed-on: https://go-review.googlesource.com/c/go/+/351711
Trust: Russ Cox <rsc@golang.org>
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@kjgorman
Copy link
Contributor

kjgorman commented Oct 6, 2021

Should the godoc for bytes specify that: 'If sep does not appear in s, cut returns s, nil, false'? (Rather than ""?)

edit: I thought I was too lazy to sign the CLA and setup gerrit etc to send a one word patch, but in fact this only required a slow Sunday afternoon: https://go-review.googlesource.com/c/go/+/354969

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/369981 mentions this issue: doc/go1.18: mention bytes.Cut and strings.Cut

gopherbot pushed a commit that referenced this issue Dec 7, 2021
For #47694.

Updates #46336.

Change-Id: Ibbd058a1fd4d6b0aa38d3e8dc15b560d1e149f7e
Reviewed-on: https://go-review.googlesource.com/c/go/+/369981
Trust: Austin Clements <austin@google.com>
Reviewed-by: Jeremy Faller <jeremy@golang.org>
@danieljl
Copy link

danieljl commented Feb 3, 2022

A reasonable argument against the name Cut:

My only concern is that people will miss "Cut" in the docs, because it won't group with "Split". There are several SplitX APIs, and folks won't guess that there's another Split-like API elsewhere in the docs.

"Cut" is a clear, simple term, but there's already a "SplitX" precedent. I haven't heard a compelling reason to break with it.

was replied by:

We can certainly mention Cut in the Split docs.

And yet we haven't mentioned it. Besides, we should mention Cut not only in the doc of Split, but also in that of SplitAfter, SplitAfterN, SplitN, and the future Split___ functions.

Moreover, in cases when users open the bytes doc page and search the word "split", they will unlikely find Cut because the word "split" doesn't even exist in the Cut doc: "Cut slices s around ..."

Maybe we should have named it SplitOnce or similar, after all.

@earthboundkid
Copy link
Contributor

We’re in the Go 1.18 beta period. I think the ship has sailed on the name, but it’s not too late to send docs fixes.

@ianlancetaylor
Copy link
Member

ianlancetaylor commented Feb 3, 2022

Sent https://go.dev/cl/382954 for the docs. @danieljl Thanks for pointing that out.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/382954 mentions this issue: bytes, strings: mention Cut in docs for Split and SplitN

@cristaloleg

This comment was marked as resolved.

gopherbot pushed a commit that referenced this issue Feb 8, 2022
For #46336

Change-Id: Idc23302085e14e24d571f5995d6d33ca964a0021
Reviewed-on: https://go-review.googlesource.com/c/go/+/382954
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
@rsc rsc moved this to Accepted in Proposals Aug 10, 2022
@rsc rsc added this to Proposals Aug 10, 2022
@rsc rsc removed this from Proposals Oct 19, 2022
@golang golang locked and limited conversation to collaborators Feb 8, 2023
@dmitshur dmitshur modified the milestones: Backlog, Go1.18 Jun 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests