Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

titlecase: chars not starting a word can be converted to lowercase #23393

Merged
merged 2 commits into from
Jan 8, 2018

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Aug 22, 2017

First commit:

Second commit: "titlecase: all non-letters are considered word-separators"

The old behavior is deprecated. This PR is coupled with #23394 but independant in terms of working code.

@StefanKarpinski
Copy link
Sponsor Member

Can we just get rid of this function? Case manipulation is subtle and tricky and not something you want to have coupled with your language runtime version. Title case is worse since it depends not only on case changing but also on what characters are considered to separate words. This seems like a total morass that the standard library should not be getting into.

@stevengj
Copy link
Member

The naive titlecase that we have now can be pretty useful for pretty printing, e.g. we use it in Base for generating the unicode table in the manual.

I agree that anything much more sophisticated in the way of case transformations should go into a package.

@stevengj
Copy link
Member

As I argued in #19469, if you want the "strict" behavior you can always do titlecase(lowercase(s)), so the non-strict behavior is "strictly" more general.

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Aug 22, 2017

The trouble with public functions that are not the Right Way™ to do it, is that people use them and then we fall into a cycle of having to tell people to use the other implementation. This exactly the problem with {read,write}dlm which we keep having to tell people not to use and use some other CSV reader. (That situation is exacerbated by too many CSV readers and other data ecosystem fragmentation, but the point remains.) If the function is internally useful, we can have an non-exported simple version. That doesn't have this issue.

@rfourquet
Copy link
Member Author

I rarely work with strings, mostly when hacking the REPL, but even there, I needed titlecase and friends to implement what is available in readline, i.e. changing the case, which I regularly miss in julia. Even though the naive titlecase is strictly more general, I thought that if we have the function, seems better to conform to what I perceived to be the usual behavior of other implementations (readline, python, emacs...). And it's not so sophisticated: the definition is one small sentence. If those functions are removed from base, I hope that we can keep them at least in the LineEdit module...

The naive titlecase that we have now can be pretty useful for pretty printing, e.g. we use it in Base for generating the unicode table in the manual.

I was not aware of that, but in the only instance I find in the /doc, it uses titlecase(lowercase(...)), so it shows at least that there is a case to be made that the "strict" version may be a good default.

@rfourquet
Copy link
Member Author

What to do here? Either merge this (my vote), deprecate, or status quo... triage?

@StefanKarpinski StefanKarpinski added the status:triage This should be discussed on a triage call label Dec 13, 2017
@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Dec 13, 2017

Since this function is now part of the stdlib Unicode package it can technically be changed after 1.0, but I do think we should probably make some decision here. An important thing I realized recently is that all string functionality in Base should essentially be Unicode-version-independent. The new string overhaul does accomplish this, since it only depends on the basic mechanics of UTF-8, which aren't going to change. Any behavior that might change with a different Unicode version should go in the Unicode package so that programs can choose a Unicode version independent of their Julia version (even though we will by default ship Julia with support for a current version of Unicode).

@stevengj
Copy link
Member

@StefanKarpinski, note that the parsing of Julia itself is Unicode version-dependent, since it depends on Unicode categories to determine what counts as an identifier.

@StefanKarpinski
Copy link
Sponsor Member

Fair enough, but that doesn't really affect string processing.

@StefanKarpinski
Copy link
Sponsor Member

This is now an issue for the Unicode module since it's not exported from Base, but it still should get resolved in short order.

@StefanKarpinski StefanKarpinski added the stdlib Julia's standard library label Dec 14, 2017
@StefanKarpinski
Copy link
Sponsor Member

I'm marking this as 1.0 but note that it's "stdlib", so it does not block feature freeze or an alpha.

@StefanKarpinski StefanKarpinski removed the status:triage This should be discussed on a triage call label Dec 14, 2017
@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Dec 14, 2017
@@ -1708,6 +1708,9 @@ export hex2num
# PR 23341
@deprecate diagm(A::SparseMatrixCSC) spdiagm(sparsevec(A))

# PR #23393
@deprecate titlecase(s::AbstractString) titlecase(s, false, true)
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky; we don't want to tell people to call the 3-argument version in their code.

@JeffBezanson
Copy link
Sponsor Member

+1, I think we should just do this (rebased appropriately of course).

Only small issue is how to deal with the old isspace behavior. We might want to just hard-break it.

@rfourquet
Copy link
Member Author

There are 2 new arguments for compatibility (which could be turned into keyword arguments now):

  • strict: whether to convert to lowercase chars not starting a word
  • compat: whether only spaces define what a "word" is

If I understand correctly, you suggest 1) that we could just hard-break the compat behavior, and 2) that we don't want people to use the 3-arg version, i.e. strict, so do you mean to also hard-break it, or to keep the strict keyword for the deprecation period, but to have a custom depwarn message saying that the old behavior won't be supported anymore in next releases?

@JeffBezanson
Copy link
Sponsor Member

Maybe I misunderstood --- is the intent to permanently add a third argument, or only use it for the deprecation period? I guess it would be ok to permanently add the argument, but it should be called something involving "spaces" instead of "compat".

@StefanKarpinski
Copy link
Sponsor Member

There doesn't seem to be a problem statement anywhere that I can find so I'm having a hard time understanding what problem is being fixed here.

@JeffBezanson
Copy link
Sponsor Member

The problem is that our titlecase function only changes the first characters of words, and only considers spaces to be word separators. Other languages (following unicode recommendations) also lowercase non-initial word characters, and consider any non-letter to be a word separator.

@rfourquet
Copy link
Member Author

is the intent to permanently add a third argument, or only use it for the deprecation period?

The intent for the compat argument was to stay only for the deprecation period, but I think it wouldn't hurt to have it permanently, but then of course its name should be changed. I propose the following:

  • instead of compat, we add a keyword argument wordsep::Function, which is a predicate indicating which characters must be considered as a word separator. The old behavior would correspond to wordsep=isspace, and the new one to wordsep = !iscased. As suggested by Jeff, I propose to make the latter the default with a "hard-break" (no deprecation).
  • make strict another keyword argument (strict=true by default, meaning convert to lowercase chars not starting a word). I don't know whether to do a hard-break of this one, or to force to use this keyword during the deprecation period (0.7) .

@StefanKarpinski
Copy link
Sponsor Member

I like the wordsep keyword idea and agree that it should just be a hard break to doing what other languages do and what Unicode recommends current uses are probably buggy anyway.

I still don't quite understand what strict would do. Is the idea that with strict=false this would only uppercase word-starting characters while with strict=true it would in addition lowercase non-word-starting characters? If so, the name strict isn't terribly evocative to me. What is the default behavior in other languages? If we leave titlecase doing the "non-strict" thing, then people can always do titlecase(lowercase(s)) and get the strict behavior. This doesn't seem like a case where performance is a major concern.

@JeffBezanson
Copy link
Sponsor Member

What is the default behavior in other languages?

That has been answered at least twice in this thread. It's the new behavior implemented here, of lowercasing other characters.

@StefanKarpinski
Copy link
Sponsor Member

I would be fine with strict=true and just making a hard break here. Again, this seems as likely to not have been doing what someone wants as to have been doing what they want. I'm just not 100% sold on the name strict for the keyword.

@rfourquet
Copy link
Member Author

There doesn't seem to be a problem statement anywhere that I can find

While re-reading the OP few days ago, I realized how bad it was, sorry for that!

I'm just not 100% sold on the name strict for the keyword.

Me neither, I was hoping for a suggestion of a better name ;-)

@StefanKarpinski
Copy link
Sponsor Member

Let's just go with wordsep and strict and hard breaking behavior change here.

A keyword argument `strict` is added to `titlecase` to control
whether to convert those chars to lowercase. The default value
is `true`, which makes this change breaking.
This is how some languages (e.g. Python) implement this function,
and is compatible with http://www.unicode.org/L2/L1999/99190.htm.
@rfourquet
Copy link
Member Author

Let's just go with wordsep and strict and hard breaking behavior change here.

Rebased accordingly. Compared to the initial version, I unexported the new iscased function (should it be exported?)

@JeffBezanson JeffBezanson added domain:unicode Related to unicode characters and encodings domain:strings "Strings!" labels Jan 8, 2018
@JeffBezanson JeffBezanson merged commit 8245356 into master Jan 8, 2018
@JeffBezanson JeffBezanson deleted the rf/titlecase branch January 8, 2018 20:50
@StefanKarpinski
Copy link
Sponsor Member

I unexported the new iscased function (should it be exported?)

Let's just leave it for now. That would involve a whole discussion about what the best name is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!" domain:unicode Related to unicode characters and encodings stdlib Julia's standard library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants