Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More string concatenation operations? #1771

Closed
johnmyleswhite opened this issue Dec 16, 2012 · 49 comments
Closed

More string concatenation operations? #1771

johnmyleswhite opened this issue Dec 16, 2012 · 49 comments

Comments

@johnmyleswhite
Copy link
Member

I was a little surprised by the following:

julia> "a" * 'b'
no method *(ASCIIString,Char)

julia> 'b' * "a"
no method *(Char,ASCIIString)

Should these be added? I'm assuming that promoting Char's to strings would be a bad idea in general, but this case seems like one worth accounting for.

@toivoh
Copy link
Contributor

toivoh commented Dec 16, 2012

I agree in principle. But then it feels like 'a'*'b' should be "ab", and not 9506?

@StefanKarpinski
Copy link
Member

The String_Char operations definitely should be added and I do think that the concatenation of 'a' and 'b' should be "ab", but that example is actually giving me pause about using _ for string concatenation at all :-\

@binarybana
Copy link
Contributor

Don't lose faith in the mathematical purity Stefan! You were the one that originally convinced me when I found this old mailing list thread.

But on a more serious note, I was a bit confused when I first came from Python and found the inoperable "string " + "string." Perhaps we should throw a more useful warning ("try the '*' command instead. :)") instead of a method not found.

And if I wanted to multiply the ASCII value of two characters, I would think int('a') * int('b') to be more logical ... unless you wanted to stay within Uint8 space that is.

@toivoh
Copy link
Contributor

toivoh commented Dec 16, 2012

I could definitely see 'a'*'b' being "ab". I guess it comes down to the question of whether you think of a Char primarily as a character or a number. I would personally say character myself.

@johnmyleswhite
Copy link
Member Author

I personally see a Char as a characters, so I have no trouble having any arithmetic operation over Char behave as if it were a string concatenation operation.

@StefanKarpinski
Copy link
Member

For * there's no real problem since I'm unclear on why you'd be multiplying Char values anyway. However, subtraction of Char values definitely makes sense and is quite useful, and while addition of Char values is less obviously useful, it seems like something you might want to have by analogy (Char+Int=>Char makes more sense than Char+Char=>Char, however). But then + might lead to a clash with the idea of * and + forming a pattern ring where * means concatenation and + means alternation. That's what's giving me pause.

What might make sense is restricting arithmetic operations on Char values only to Char-Char=>Int, Char+Int=>Char and Int+Char=>Char. Then Char*Char can be safely defined as concatenation and Char+Char can be left undefined now but be potentially available as pattern alternation in the future.

@johnmyleswhite
Copy link
Member Author

Where does + mean alternation?

@StefanKarpinski
Copy link
Member

It doesn't at the moment, but it's by analogy to * for concatenation of patterns, with string literals as the base case. It's pretty common in parsing theory to talk about parsing semirings [1] and there's even some results relating matrix multiplication to CFG parsing [2].

[1] http://acl.ldc.upenn.edu/J/J99/J99-4004.pdf
[2] http://www.cs.cornell.edu/info/People/llee/papers/bmmcfl-jacm.pdf

@JeffBezanson
Copy link
Member

I like this change. Having Char*Char multiply character codes is really silly. The more a Char acts like a 1-character string, the better.

@StefanKarpinski
Copy link
Member

Ok, I'm sold then. Will change it.

StefanKarpinski added a commit that referenced this issue Dec 17, 2012
Spurred by #1771. The one I really wanted to get rid of was
Char+Char but that's used by iteration of Char ranges, e.g. here:

https://github.com/JuliaLang/julia/blob/63dd47153097c68643dd3462656a6c74358d04e3/base/intfuncs.jl#L220

In order to get rid of that usage, the way Range iteration works
would need to be changed. Currently it's a bit odd that 'a':'z'
is a Range of Chars, while 'a':1:'z' is a Range of Ints. Maybe
the step type of a Range shouldn't need to be the same as the
start type?
@diegozea
Copy link
Contributor

I didn't know about alternation.
From people coming from python 'a' + 'b' mean have sense.
I not used to think in * as concatenation.
'a' plus 'b' sound to "ab" but 'a' times 'b' doesn't.
I admit, I tried on Julia's REPL 'a'*4 waiting this operation gives me "aaaa"
Maybe because of python * again...
What do you think on * between Chars meaning and array of strings of the possible combinations?

'a'*'b'
["ab";"ba"]

@lsorber
Copy link

lsorber commented Aug 21, 2013

Two arguments for using + to concatenate strings are that it's the notation that most languages use for that purpose, including MATLAB, and that the symbol can be interpreted as a direct sum symbol instead of a plus symbol. The direct sum of two vectors of lengths n and m, respectively, is a vector of length of length n+m which is precisely the concatenation of the two vectors. The linear algebra analogue for string repetition would be the Kronecker product, though I think the ^ choice is a good one because it is less ambiguous than *.

@kmsquire
Copy link
Member

See also #2301 for some leisurely reading on this subject.

@StefanKarpinski
Copy link
Member

Our use of * for string concatenation is still one of the standard library features that I'm most uncomfortable with but I'm not ok with + either since it means addition, not direct sum – generic functions have a meaning, not just a symbol that can be interpreted at will, which is why we've moved away from puns like using | for command pipeline construction, since these puns mix many distinct meanings into a single generic function. Similarly, allowing + to mean both addition and direct sum would be mingling two distinct meanings into a single generic function. I worry that * as string concatenation is also muddling the meaning of * as multiplication, but at least when you view strings as a monoid, concatenation is "multiplication" of strings. It's the quotes that I felt compelled to put around that word that worry me.

@lsorber
Copy link

lsorber commented Aug 21, 2013

@StefanKarpinski So essentially you're saying + should not be interpreted as a direct sum, yet * may be interpreted as a binary operator on monoids? The former is as much a sum as the latter is a multiplication.

@StefanKarpinski
Copy link
Member

The direct sum is not a algebraic operation the way addition or multiplication are. String concatenation is the algebraic operation for the monoid of strings (under concatenation) and is commonly written mathematically as juxtaposition or multiplication. It does still seem slightly sketchy to me because it's not what one commonly thinks of as multiplication.

@lsorber
Copy link

lsorber commented Aug 22, 2013

AFAIK the monoid binary operator does not have a canonical representation: I have seen a star, a dot, an x and an empty circle used for this purpose. To me, it looks like you just choose to overload * with the implied meaning of your preference. It may as well have been a .. IMHO, this is no different from overloading + to mean a direct sum. Both are just ways attaching a meaningful interpretation to the string concatenation symbol, whatever it may be. Neither are unambiguous, but at least + has the advantage of being the one that people will be most used to.

@StefanKarpinski
Copy link
Member

I'm not sure why you're talking about this like I'm super gung-ho about using * and just forced it on everyone – I've repeatedly said that I'm uncomfortable about using * like this. However, using + is clearly worse than * from a mathematical correctness point of view – although general monoid operators don't have a canonical notation, string concatenation is consistently expressed as juxtaposition or multiplication in literature on the subject.

@StefanKarpinski
Copy link
Member

One option for an entirely new operator would be ++ which is use for string and list concatenation in Haskell. That would leave string repetition without an operator, but that's probably ok since it's not all that common of an operation and it would be fine to write it out as repeat(str,n), assuming that's consistent with the meaning of @johnmyleswhite's recently added repeat operation.

@JeffBezanson
Copy link
Member

I prefer the string function and/or the briefly-considered string juxtaposition syntax.

@StefanKarpinski
Copy link
Member

I liked the juxtaposition syntax too but people found foo "" bar awkward and there isn't an operator version of it. But I'm still in to try that.

@rizo
Copy link

rizo commented Feb 24, 2014

Here is a comparison table from Wikipedia with different string concatenation syntaxes.

I personally prefer the ~ operator used in D. It is less ambiguous and just looks nice with strings.

"Hello, " ~ "World!"

@johnmyleswhite
Copy link
Member Author

~ is already used in Julia.

@rizo
Copy link

rizo commented Feb 24, 2014

Yes, but it is only used as an unary operator, so there are no conflicts.

@johnmyleswhite
Copy link
Member Author

No, it's also a binary operator now and is parsed as a call to the @~ macro.

@StefanKarpinski
Copy link
Member

Ah, I was not aware that change had already been made. But in either case, binary ~ is spoken for.

@StefanKarpinski
Copy link
Member

@johnmyleswhite – is that being used in GLM, etc.?

@johnmyleswhite
Copy link
Member Author

Yes, it's used in GLM and DataFrames provides a definition for @~.

@lindahua
Copy link
Contributor

I am also not quite comfortable using * for concatenation. I understand the algebraic argument, but people usually don't think about strings from an algebraic perspective.

I am for a solution that uses a dedicated symbol for concatenation of sequences (vectors, strings, etc). If we run out of single-character symbols, we may consider compound symbols, such as ++, --, ~~, etc.

@JeffBezanson
Copy link
Member

I suppose .. might be available.

@StefanKarpinski
Copy link
Member

There was also the juxtaposition approach, which I rather liked.

@JeffBezanson
Copy link
Member

That was nice, but a more generic sequence concatenation operator might be useful. Adding juxtaposition would give us four syntaxes for concatenating strings (string, *, $, and spaces).

@StefanKarpinski
Copy link
Member

If we went with juxtaposition, we would remove * as concatenation. The string function is what you use if you need a named operator. I would argue that $ isn't really a concatenation operator, although I know you like to think of it as just a crappy concatenation syntax. I'm pretty sure the train has let on getting rid of $ – which is what we had proposed with the juxtaposition change. Just string and juxtaposition.

@daniel-levin
Copy link

I'm going to advocate the relaxation of absolute mathematical correctness for one exception - the concatenation of strings. Please can we just use +? It's really common in other languages and does not violate the principle of least surprise.

Also, I guarantee that developers who are used to using + for string concatenation are going to take 10 seconds to define:

function +(a::String, b::String)
    string(a, b)
end

which can lead to inhomogeneous (in the sense that some people do this, others don't) code bases. I think this defeats the purpose of having 100% algebraically sound operators in the first place.

@StefanKarpinski
Copy link
Member

I would consider ++ as a new operator. It seems like we're not going to use this for increment like C does, so it would be available for a Haskell-style string concatenation operator.

@ssfrr
Copy link
Contributor

ssfrr commented Mar 12, 2014

Would it also be used for more general sequence concatenation?

@nalimilan
Copy link
Member

++ has been proposed for Rust, because + (which they currently use) is associated with with numeric interfaces. The suggestion apparently didn't get much interest: https://mail.mozilla.org/pipermail/rust-dev/2012-January/001282.html

Dart completely removed their string concatenation operator (it was +) some time ago. Rationale is here: http://news.dartlang.org/2012/06/vm-to-remove-support-for-string.html (I personally find this choice quite extreme.)

FWIW, my feeling is that using * is the worst solution since you have the problems associated with + (overloading a numeric operator) without the advantage of using a standard syntax. It would make sense to either use + or a dedicated string concatenation operator, among which ++ sounds like a good candidate. Now, the problem with a dedicated operator is that you would need to also add the equivalents of .* and *= to offer the same level of convenience (I swear I use the former often in R via paste()).

@StefanKarpinski
Copy link
Member

Thanks for those links, @nalimilan – always helpful to see what other languages are doing.

@StefanKarpinski
Copy link
Member

I managed to find a working version of the "dart puzzlers" link on the way back machine:

http://web.archive.org/web/20130125055347/http://www.dartlang.org/articles/puzzlers/chapter-2.html

@StefanKarpinski
Copy link
Member

TIL that Java translates unicode escapes before parsing. That's the worst thing I've heard in a long time.

@johnmyleswhite
Copy link
Member Author

I find Dart's argument that people should be using IOBuffer instead of string concatenation pretty compelling.

@StefanKarpinski
Copy link
Member

Yeah, it's a pretty solid argument to get rid of the string concatenation operator altogether. There are, however, situations where you're building up a small string and you just want to do str *= whatever even though it's inefficient.

@toivoh
Copy link
Contributor

toivoh commented Mar 14, 2014

Yes, it should be easy to do string concatenation at least when speed is
not a concern.

On Fri, Mar 14, 2014 at 3:50 PM, Stefan Karpinski
notifications@github.comwrote:

Yeah, it's a pretty solid argument to get rid of the string concatenation
operator altogether. There are, however, situations where you're building
up a small string and you just want to do str *= whatever even though
it's inefficient.

Reply to this email directly or view it on GitHubhttps://github.com//issues/1771#issuecomment-37654900
.

@lindahua
Copy link
Contributor

Using IOBuffer is definitely advisable when you are composing a large string from many smaller ones. However, string concatenation is still quite handy when you just want to concatenate a small number of strings (say two or three) in a not very performance-demanding setting (e.g. composing a figure title, etc).

That being said, I found that string( ... ) and string interpolation are convenient enough. So, I wouldn't have a big problem if people decide to ditch the string concatenation operator altogether.

@johnmyleswhite
Copy link
Member Author

I agre with @lindahua: we've already got string, which seems easy enough.

@JeffBezanson
Copy link
Member

It's interesting to note that the way julia works, the performance of repeated application of a binary string concatenation operator is not really a problem, since we can define *(xs::String...) and concatenate a bunch of strings at once. Not too many languages have this.

But as a lisper, my preference for string(a, b, c, ...) is utterly predictable.

@carlobaldassi
Copy link
Member

I have to say, it surely wouldn't be the end of the world, but I honestly fail to see what would be the advantage of removing * for strings. It's more concise and readable than string(a,b,...), gives you *= and ^ for free (I have used both and found them useful), and generally makes sense with respect to the abstract meaning of *. Also, I can't make myself to believe it's such a big deal, as learning it is a matter of 0.2 seconds.

As an aside, I went through this thread again and I realized that, despite the general consensus, we still have Char*Char=Int64 (e.g. 'a'*'b'==9506) and no method String*Char.

@JeffBezanson
Copy link
Member

Yes, I greatly dislike Char*Char; this was one of the big motivations to make Char non-numeric. (e.g. #5844)
One reason Char*Char might be there now is for ranges, but that should really be changed. I think we should try again to make Char not an Integer, and have OrdinalRange that also handles things like DateTime.

@carlobaldassi
Copy link
Member

One reason Char*Char might be there now is for ranges

Where? I can't spot this in the code. I also tried removing *, + and div for Chars, and everything seems to work, make testall passes, Char ranges don't seem affected (at least I don't see any difference when doing some trivial tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests