titlecase: chars not starting a word can be converted to lowercase #23393

rfourquet · 2017-08-22T17:12:19Z

First commit:

A second argument strict is added to titlecase to control
whether to convert those chars to lowercase.
This is useful e.g. for REPL: implement Alt-{u,c,l} to change the case of the next word #23379.
The one-arg version is deprecated, and will be equivalent to
the new behavior (strict=true) in the future.
This is to be compatible with the istitle function, so that
istitle(titlecase(s)) == true when s has at least 1 letter.
This is also how some languages (e.g. python) implement it, and
is compatible with http://www.unicode.org/L2/L1999/99190.htm.

Second commit: "titlecase: all non-letters are considered word-separators"

The old behavior is deprecated. This PR is coupled with #23394 but independant in terms of working code.

StefanKarpinski · 2017-08-22T18:26:52Z

Can we just get rid of this function? Case manipulation is subtle and tricky and not something you want to have coupled with your language runtime version. Title case is worse since it depends not only on case changing but also on what characters are considered to separate words. This seems like a total morass that the standard library should not be getting into.

stevengj · 2017-08-22T18:51:17Z

The naive titlecase that we have now can be pretty useful for pretty printing, e.g. we use it in Base for generating the unicode table in the manual.

I agree that anything much more sophisticated in the way of case transformations should go into a package.

stevengj · 2017-08-22T18:54:04Z

As I argued in #19469, if you want the "strict" behavior you can always do titlecase(lowercase(s)), so the non-strict behavior is "strictly" more general.

StefanKarpinski · 2017-08-22T19:09:42Z

The trouble with public functions that are not the Right Way™ to do it, is that people use them and then we fall into a cycle of having to tell people to use the other implementation. This exactly the problem with {read,write}dlm which we keep having to tell people not to use and use some other CSV reader. (That situation is exacerbated by too many CSV readers and other data ecosystem fragmentation, but the point remains.) If the function is internally useful, we can have an non-exported simple version. That doesn't have this issue.

rfourquet · 2017-08-22T19:30:42Z

I rarely work with strings, mostly when hacking the REPL, but even there, I needed titlecase and friends to implement what is available in readline, i.e. changing the case, which I regularly miss in julia. Even though the naive titlecase is strictly more general, I thought that if we have the function, seems better to conform to what I perceived to be the usual behavior of other implementations (readline, python, emacs...). And it's not so sophisticated: the definition is one small sentence. If those functions are removed from base, I hope that we can keep them at least in the LineEdit module...

The naive titlecase that we have now can be pretty useful for pretty printing, e.g. we use it in Base for generating the unicode table in the manual.

I was not aware of that, but in the only instance I find in the /doc, it uses titlecase(lowercase(...)), so it shows at least that there is a case to be made that the "strict" version may be a good default.

rfourquet · 2017-12-13T14:56:40Z

What to do here? Either merge this (my vote), deprecate, or status quo... triage?

StefanKarpinski · 2017-12-13T16:38:09Z

Since this function is now part of the stdlib Unicode package it can technically be changed after 1.0, but I do think we should probably make some decision here. An important thing I realized recently is that all string functionality in Base should essentially be Unicode-version-independent. The new string overhaul does accomplish this, since it only depends on the basic mechanics of UTF-8, which aren't going to change. Any behavior that might change with a different Unicode version should go in the Unicode package so that programs can choose a Unicode version independent of their Julia version (even though we will by default ship Julia with support for a current version of Unicode).

stevengj · 2017-12-13T17:42:48Z

@StefanKarpinski, note that the parsing of Julia itself is Unicode version-dependent, since it depends on Unicode categories to determine what counts as an identifier.

StefanKarpinski · 2017-12-13T18:06:48Z

Fair enough, but that doesn't really affect string processing.

StefanKarpinski · 2017-12-14T19:45:39Z

This is now an issue for the Unicode module since it's not exported from Base, but it still should get resolved in short order.

StefanKarpinski · 2017-12-14T21:08:38Z

I'm marking this as 1.0 but note that it's "stdlib", so it does not block feature freeze or an alpha.

JeffBezanson · 2017-12-31T00:14:00Z

base/deprecated.jl

@@ -1708,6 +1708,9 @@ export hex2num
 # PR 23341
 @deprecate diagm(A::SparseMatrixCSC) spdiagm(sparsevec(A))

+# PR #23393
+@deprecate titlecase(s::AbstractString) titlecase(s, false, true)


This is tricky; we don't want to tell people to call the 3-argument version in their code.

JeffBezanson · 2017-12-31T00:15:20Z

+1, I think we should just do this (rebased appropriately of course).

Only small issue is how to deal with the old isspace behavior. We might want to just hard-break it.

rfourquet · 2018-01-03T10:47:00Z

There are 2 new arguments for compatibility (which could be turned into keyword arguments now):

strict: whether to convert to lowercase chars not starting a word
compat: whether only spaces define what a "word" is

If I understand correctly, you suggest 1) that we could just hard-break the compat behavior, and 2) that we don't want people to use the 3-arg version, i.e. strict, so do you mean to also hard-break it, or to keep the strict keyword for the deprecation period, but to have a custom depwarn message saying that the old behavior won't be supported anymore in next releases?

JeffBezanson · 2018-01-03T15:04:51Z

Maybe I misunderstood --- is the intent to permanently add a third argument, or only use it for the deprecation period? I guess it would be ok to permanently add the argument, but it should be called something involving "spaces" instead of "compat".

StefanKarpinski · 2018-01-03T18:05:28Z

There doesn't seem to be a problem statement anywhere that I can find so I'm having a hard time understanding what problem is being fixed here.

JeffBezanson · 2018-01-03T18:11:04Z

The problem is that our titlecase function only changes the first characters of words, and only considers spaces to be word separators. Other languages (following unicode recommendations) also lowercase non-initial word characters, and consider any non-letter to be a word separator.

rfourquet · 2018-01-04T07:14:27Z

is the intent to permanently add a third argument, or only use it for the deprecation period?

The intent for the compat argument was to stay only for the deprecation period, but I think it wouldn't hurt to have it permanently, but then of course its name should be changed. I propose the following:

instead of compat, we add a keyword argument wordsep::Function, which is a predicate indicating which characters must be considered as a word separator. The old behavior would correspond to wordsep=isspace, and the new one to wordsep = !iscased. As suggested by Jeff, I propose to make the latter the default with a "hard-break" (no deprecation).
make strict another keyword argument (strict=true by default, meaning convert to lowercase chars not starting a word). I don't know whether to do a hard-break of this one, or to force to use this keyword during the deprecation period (0.7) .

StefanKarpinski · 2018-01-04T17:36:32Z

I like the wordsep keyword idea and agree that it should just be a hard break to doing what other languages do and what Unicode recommends current uses are probably buggy anyway.

I still don't quite understand what strict would do. Is the idea that with strict=false this would only uppercase word-starting characters while with strict=true it would in addition lowercase non-word-starting characters? If so, the name strict isn't terribly evocative to me. What is the default behavior in other languages? If we leave titlecase doing the "non-strict" thing, then people can always do titlecase(lowercase(s)) and get the strict behavior. This doesn't seem like a case where performance is a major concern.

JeffBezanson · 2018-01-04T17:42:06Z

What is the default behavior in other languages?

That has been answered at least twice in this thread. It's the new behavior implemented here, of lowercasing other characters.

StefanKarpinski · 2018-01-04T17:48:22Z

I would be fine with strict=true and just making a hard break here. Again, this seems as likely to not have been doing what someone wants as to have been doing what they want. I'm just not 100% sold on the name strict for the keyword.

rfourquet · 2018-01-06T10:15:50Z

There doesn't seem to be a problem statement anywhere that I can find

While re-reading the OP few days ago, I realized how bad it was, sorry for that!

I'm just not 100% sold on the name strict for the keyword.

Me neither, I was hoping for a suggestion of a better name ;-)

StefanKarpinski · 2018-01-06T23:09:09Z

Let's just go with wordsep and strict and hard breaking behavior change here.

A keyword argument `strict` is added to `titlecase` to control whether to convert those chars to lowercase. The default value is `true`, which makes this change breaking. This is how some languages (e.g. Python) implement this function, and is compatible with http://www.unicode.org/L2/L1999/99190.htm.

rfourquet · 2018-01-08T12:38:16Z

Let's just go with wordsep and strict and hard breaking behavior change here.

Rebased accordingly. Compared to the initial version, I unexported the new iscased function (should it be exported?)

StefanKarpinski · 2018-01-09T14:37:43Z

I unexported the new iscased function (should it be exported?)

Let's just leave it for now. That would involve a whole discussion about what the best name is.

rfourquet force-pushed the rf/titlecase branch from d12bf25 to 7ff1b18 Compare August 22, 2017 17:18

StefanKarpinski added the triage This should be discussed on a triage call label Dec 13, 2017

StefanKarpinski added the stdlib Julia's standard library label Dec 14, 2017

StefanKarpinski removed the triage This should be discussed on a triage call label Dec 14, 2017

StefanKarpinski added this to the 1.0 milestone Dec 14, 2017

JeffBezanson reviewed Dec 31, 2017

View reviewed changes

rfourquet added 2 commits January 8, 2018 13:22

titlecase: all non-letters are considered word-separators

f94ab0a

rfourquet force-pushed the rf/titlecase branch from 7ff1b18 to f94ab0a Compare January 8, 2018 12:23

JeffBezanson added unicode Related to unicode characters and encodings strings "Strings!" labels Jan 8, 2018

JeffBezanson merged commit 8245356 into master Jan 8, 2018

JeffBezanson deleted the rf/titlecase branch January 8, 2018 20:50

stevengj mentioned this pull request Nov 25, 2020

titlecase(::String) should not break words inside graphemes #38575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

titlecase: chars not starting a word can be converted to lowercase #23393

titlecase: chars not starting a word can be converted to lowercase #23393

rfourquet commented Aug 22, 2017 •

edited

Loading

StefanKarpinski commented Aug 22, 2017

stevengj commented Aug 22, 2017

stevengj commented Aug 22, 2017

StefanKarpinski commented Aug 22, 2017 •

edited

Loading

rfourquet commented Aug 22, 2017

rfourquet commented Dec 13, 2017

StefanKarpinski commented Dec 13, 2017 •

edited

Loading

stevengj commented Dec 13, 2017

StefanKarpinski commented Dec 13, 2017

StefanKarpinski commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

JeffBezanson Dec 31, 2017

JeffBezanson commented Dec 31, 2017

rfourquet commented Jan 3, 2018

JeffBezanson commented Jan 3, 2018

StefanKarpinski commented Jan 3, 2018

JeffBezanson commented Jan 3, 2018

rfourquet commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018

JeffBezanson commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018

rfourquet commented Jan 6, 2018

StefanKarpinski commented Jan 6, 2018

rfourquet commented Jan 8, 2018

StefanKarpinski commented Jan 9, 2018

titlecase: chars not starting a word can be converted to lowercase #23393

titlecase: chars not starting a word can be converted to lowercase #23393

Conversation

rfourquet commented Aug 22, 2017 • edited Loading

StefanKarpinski commented Aug 22, 2017

stevengj commented Aug 22, 2017

stevengj commented Aug 22, 2017

StefanKarpinski commented Aug 22, 2017 • edited Loading

rfourquet commented Aug 22, 2017

rfourquet commented Dec 13, 2017

StefanKarpinski commented Dec 13, 2017 • edited Loading

stevengj commented Dec 13, 2017

StefanKarpinski commented Dec 13, 2017

StefanKarpinski commented Dec 14, 2017

StefanKarpinski commented Dec 14, 2017

JeffBezanson Dec 31, 2017

Choose a reason for hiding this comment

JeffBezanson commented Dec 31, 2017

rfourquet commented Jan 3, 2018

JeffBezanson commented Jan 3, 2018

StefanKarpinski commented Jan 3, 2018

JeffBezanson commented Jan 3, 2018

rfourquet commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018

JeffBezanson commented Jan 4, 2018

StefanKarpinski commented Jan 4, 2018

rfourquet commented Jan 6, 2018

StefanKarpinski commented Jan 6, 2018

rfourquet commented Jan 8, 2018

StefanKarpinski commented Jan 9, 2018

rfourquet commented Aug 22, 2017 •

edited

Loading

StefanKarpinski commented Aug 22, 2017 •

edited

Loading

StefanKarpinski commented Dec 13, 2017 •

edited

Loading