Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

titlecase(::String) should not break words inside graphemes #38575

Merged
merged 1 commit into from
Dec 1, 2020

Conversation

stevengj
Copy link
Member

As mentioned in #38574 (comment), at the very least we should not consider characters in the middle of graphemes to be word separators — this PR fixes titlecase(::String) to give titlecase("bôrked") == "Bôrked" instead of "BôRked".

Also, the documentation said that the default word separator is any "non-letter" character, but actually it was !iscased. I see no reason to not actually use !isletter here.

In the future, it would be better to follow the Unicode standard for word boundaries, and ideally have a "word" iterator much like our current grapheme iterator, but this requires support from utf8proc (JuliaStrings/utf8proc#208). But the behavior in this PR is at least closer to the standard.

This is technically not backward compatible, but I think it is unlikely to be a problem — the old behavior seems more like a bug to me.

cc @rfourquet, who last touched this function in #23393.

@stevengj stevengj added domain:unicode Related to unicode characters and encodings needs news A NEWS entry is required for this change kind:minor change Marginal behavior change acceptable for a minor release labels Nov 25, 2020
@JeffBezanson
Copy link
Sponsor Member

I think this qualifies as a bugfix.

@stevengj stevengj added kind:bugfix This change fixes an existing bug and removed kind:minor change Marginal behavior change acceptable for a minor release needs news A NEWS entry is required for this change labels Nov 30, 2020
@stevengj
Copy link
Member Author

Should be good to merge, then.

@rfourquet
Copy link
Member

I can't remember why I chose !iscased instead of !isletter, do you have an example where this makes a difference, which could also be added as a test? Otherwise, I don't know about the isgraphemebreak function, so I can't review that part.

@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2020

The other letters are mainly phonetic symbols and letters from other scripts, as far as I can tell, so I don't think they will make any difference one way or the other in any situation where you want titlecase in practice.

@JeffBezanson JeffBezanson merged commit 626542e into master Dec 1, 2020
@JeffBezanson JeffBezanson deleted the sgj/titlecase-grapheme branch December 1, 2020 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings kind:bugfix This change fixes an existing bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants