-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese numerals are not recognized by char::is_numeric #84056
Comments
It seems like Unicode classifies it as a Lo / Letter Other with no numeric value. So I don't think this is a Rust specific problem and rather a problem with the Unicode standard (if even). |
Unfortunately I think this would be a breaking change even if it is desirable. As you say, Unicode classifies them as "other letters, including syllables and ideographs" instead of one of the number categories (see Unicode categories). The Rust documentation says specifically that the code point must be in the Nd, Nl or No categories so any change to that would be breaking the current Rust API. The only way I see around that is to do one of the following:
|
Whilst the General_Category of these characters is indeed Lo, their Numeric_Type property is Numeric. Perhaps that's what should be inspected instead, although as @ChrisDenton says this would be a breaking change to Rust. |
Then I suppose this is probably more of a Rust-side issue than Unicode's problem.
I contacted them about this.
What would this method be called? I believe it would cause more confusion than do good. Yes, I suppose this will mean a breaking change, but is there not some kind of special exception for bugs? This is not a new feature but an important bug and correctness matters. I don't think we should leave this bug be just because it might break something. We should also consider how much of a breaking change this really is.
But really, I don't know what this could actually break except for some very specific cases. I believe it would rather fix more things than break things.
|
What's been done before is for a new function to be created with a similar name and the old function be deprecated. Perhaps In either case the new functionality should, I think, be based on DerivedNumericType.txt which also tells how they should be derived from other Unicode data. DerivedNumericValues lists each value with the numeric value property. |
If that's what the libs team decides for, how about
That sounds good. And perhaps that doesn't just fix Chinese numerals not being detected but other specific cases too. Let's wait for what the libs team thinks. |
In spite of being "breaking" changes, small fixes in the implementation of a type are permitted by RFCs 1105 and 1122, for the same reason that it would obviously behoove Rust to fix the compiler if rustc suddenly started solving 0 + 1 for -3. In saying such, I am not immediately weighing in on whether this is such an error in need of fixing, merely noting that in principle this could be deemed such a change. |
@rustbot claim It seems like everyone agrees that we either need a new API for this, or we need to change the existing one. It's for the libs team to decide which. In the meantime, there's some background work I can get started on. |
Now that the prerequisite work is done, nominating for T-libs-api consideration. Problem statement: Question: Should we (1) provide a new function for checking whether a character is a number that works for Han characters, possibly deprecating |
I vote for 2. |
cc @Manishearth @SimonSapin do either of you have an opinion on this? Are there other folks we should CC? One question I have is whether Unicode specifically has a recommendation for these sorts of APIs. I'm not aware of any. It might also make sense to do a survey of how other standard libraries implement "is numeric" predicates. |
For background information, see Chapter 4 of the Unicode standard. Specifically 4.5 and 4.6. Basically there are two relevant things here. Each code point in the Unicode database belongs to a General Category. They can only belong to one category. From the standard (Chapter 4.5):
Code points also have properties. Specifically the See DerivedNumericType.txt for the data. Quoting from the standard (Chapter 4.6):
Quick overview from browsing docs of a few languages: Rust
Python
.NET
Valid numbers are signified by the Unicode designation "Nd" (number, decimal digit), "Nl" (number, letter), "No" (number, other). [Note: I'm paraphrasing so as to remove a level of indirection] Go
|
Précis: I don't think we should change the existing API, and I don't consider the existing API a "bug" beyond perhaps a less ambiguous choice of naming having been possible. I'm open to adding a new one to choose between. I do not think deprecation is the right way unless we are adding two functions. Prelude: Handling text in a cross-language wayOkay, so the main thing about text is that handling text cross-language is an incredibly hard problem. This is not due to Unicode; this is an intrinsic property of the vast conceptual diversity in text. Did you know that Unicode does not even attempt to define "character"? There's no single definition of the term that applies uniformly to all writing systems. More often than not what people call a "unicode problem" is actually just an intrinsic problem with trying to stuff this conceptual diversity into little boxes. In other words:
Typically my first reaction to almost every such question is "what are you actually trying to do?". With international text, and therefore with Unicode, often people are attempting to apply their intuitions from the writing systems they are familiar with and assuming concepts apply uniformly elsewhere. They're almost always wrong about that, and a bunch of work in teasing that out is to figure out what operation they are actually looking for. E.g. the operation "split a word into letters" does not make sense in general, but "split a word into letters for showing cursors to the user" or "split a word into letters for taking the first letter out for making acronyms" or "split a word into letters for backspace to work" do make more sense (and are different operations!). That's my reaction here as well. What do we actually want when we provide Numbers in Chinese(Going to try to keep the examples here in Mandarin, but my Cantonese is way better and I'm somewhat translating between the two, please pardon any mistakes) In modern Chinese, you see numbers done in two ways. Either (western) Arabic numerals are used (e.g. "請給我55塊錢", "please give me fifty five bucks"), or Chinese numerals are used (e.g. " "請給我五十五塊錢", where "五十五" is "five ten five", or "fifty five"). There's a bit of a distribution on what's used when. Chinese numerals tend to be used in sentences when counting stuff (eg 我有五本書 "I have five books"), sometimes for money (as above) and dates (eg 今天是五月五號 "today is May 5"). Western Arabic numerals tend to be used when talking about dates (eg 今天是5月5號 "today is May 5") and money, and almost always for phone numbers. It gets even wrinklier when you account for the fact that there's a separate character for saying "two" when you're counting stuff Note that the numeral 55 in text would be read identically as 五十五 (i.e. not as the english word "fifty-five", but rather as "wǔshíwǔ" or "ng⁵ sap⁶ ng⁵" or whatever). The system cycles around every myriad, so the number 555,555,555,555 would be 五千五百五十五億,五千五百五十五萬,五千五百五十五 (I have added the commas in for illustration, they are never used), using the same characters for 1000, 100, and 10 every cycle, but using new characters to mark every power of a myriad. We do something similar in English with powers of a thousand, e.g. 555,555,555 would be "five hundred fifty five million, five hundred fifty five thousand, five hundred and fifty five", reusing the same words for "hundred" and the "-ty" morpheme. Note that when talking about years and phone numbers Chinese numerals tend to be used similarly to Western ones: The year 2022 is written as 二零二二年 ("two zero two two year") not 二千二十二年 ("two thousand twen-ty two year"), and similarly with phone numbers. This is because years and phone numbers are typically spoken as a sequence of digits rather than a single number. We do the same thing in English when we read out phone numbers, though we're kinda haphazard about years, often reading them as two blocks ("twenty twenty two" or "nineteen sixty-five"). Are they numerals?Why am I saying all of this? Well, you might want to call 零/〇,一,二 (+ 兩?),三,四,五,六,七,八,九,百,千,萬,億,。。。"numerals" but one can easily argue that they are "words" (and "letters", if you're trying to shoehorn that concept to logographic writing systems) or "morphemes". In other words you can say that "五十五" is closer to saying "fifty-five" than it is to saying "55". The mere fact that for numbers greater than ten you are spelling out the word as spoken (rather than just writing a compact digit-by-digit representation) is a pretty clear indicator that these are words. Half the characters in the chinese representation of "555,555,555,555" are not "5"! The fact that they can be written "as a sequence of digits" as done in years and phone numbers because that's how those are spoken aloud, further bolsters their status as letters, to me. They are used in a way that character-for-character corresponds to the spoken word, either spelling out a number with tens and hundreds and stuff, or spelling out a sequence of digits without them, depending on what is needed in the situation. An interesting example (credit kourge) is the idiom 一五一十, which is comprised purely of numerals, but is not a number; it is an idiom (specifically, a chengyu) meaning "in full detail". Note that this is distinct from a number having an idiomatic meaning (like "420" in English), this is a sequence of numerals that do not form a number that really are just words forming an idiom. Similar to saying "ten-four"/"10-4" in English, while it's comprised of numbers, "ten-four" is not a number. No matter how you slice it, they are not just numerals, they are at the very least words/letters as well. What's a "numeral"?Here's a strawman set of consistent cross-language features for what are considered numerals:
When I say they are heterograms, I mean that they are pronounced differently in context. For example, "5" is pronounced "five", but "55" is not simply pronounced "five five". Nor is "5th" pronounced "five-th"¹. We use numerals as building blocks to denote words that have to be read as wholes, not as a sum of their parts. This set of criteria would determine western (and eastern) Arabic numerals to be numerals, as well as Roman numerals, but not Chinese numerals. It gets a bit hard to apply to Japanese which is already chock-full of heterograms, but you can be more specific about the "due to their usage as numerals" to make it work. But that's an illustrative strawman, to demonstrate the kind of work it takes to precisely define a concept of "numeral" cross language. As mentioned before, all of this depends on the use case. What should we do here?To me, I think It might be worth adding a second function that handles the I think the technically correct set of naming from a unicode standpoint would probably be that If we could go back and change things I think a potential avenue to have explored would have been to have two functions where neither has a shorter name (so they feel on equal footing) so people are forced to contend with this. I'm not actually sure if that's a good idea, mind you, I just think it would have been worth exploring further. ¹ In fact, this is one of the few cases in English where you get to see japanese-style heterogram disambiguation: it's not even that the 5 in 5th is pronounced "fif" or something (how would you extend this explanation to "1st"?). The 5 in 5th is pronounced "fifth", and the th is a disambiguation mark with no pronunciation of its own, telling you how to pronounce the previous letter. The symbol "5" just happens to have a whole bunch of pronunciations in English, including "five", "fifth", "fifty", etc. This is similar to how in Japanese, 読む "read", pronounced "yomu" is not actually 読 "yo" + む "mu" it is actually 読 "yomu" + む "telling you that the previous character shoudl take the pronunciation ending in 'mu' instead of the three other pronunciations", since 読 has multiple pronunciations. |
Thanks @Manishearth. Probably the best and fastest response to a ping than I could ever hope for haha. What you say about focusing on actual use cases is exactly what I was hoping to see. :)
Given everything you said, I'm inclined to agree here. And absolutely in favor of better docs (perhaps even including some portion of your comment) giving folks more info to decide with.
Aye yeah I'd potentially be open to this but would definitely like to see some concrete use cases motivating it. (And ideally, those would become part of the docs for this new method.) |
Yeah I think the way to go about it is to ask around for a use case and then design such a function with a name appropriate for the use case. |
I wonder whether stdlib should simply provide a more generic method for querying Unicode properties, through which users can establish for themselves whether |
I'd rather that exist in separate crates; wanting to query Unicode properties is a pretty niche thing. To some extent, even these methods are kinda niche and Rust has them because people expect them, not because they are necessarily the only way such methods would make sense. |
One major disadvantage of this is that anything we add to std is going to end up in our binaries. We're already going to Herculean efforts to take up as little space as possible for the information we do have. Adding the ability to query any property would increase binary size, but would likely not be used by most applications. |
To further complicate the situation, consider the character "幺" (U+5E7A). It has the Numeric property and listed in the DerivedNumericType.txt. But it is used not only as a digit for "one" when reading phone numbers to prevent ambiguity ("幺幺零" for 110, the emergency line),but also as a surname and an adjective for least ("幺妹" for youngest, "least aged", sister). This ambiguity is inherent from the language and cannot be distinguished without a context. People would be very surprised when they hit by this. I propose an API with name like |
Everyday Chinese do not use these numbers to express prices. Only for formal use. Like a year. 二〇二二年. Which includes a different zero... Is this a numeral?
The problem is that then also 百 (100),千(1000),万(10000) and so on needs to be recognized. Is 一百 (100) and 十万 (100.000) numeric? Yes... but they can't stand on their own. You can't say: '万' representing an actual numeral as they need to be preceded by a number. What if the financial numbers are used, what is called capitalized numbers 大写? This all introduces a lot more complexity for not much gain. These are the financial equivalent character (大写):
Agree with @Manishearth and @inquisitivecrystal that this better served by a crate, and not for a standard library. As this is pretty much like an timezone problem. You wanna keep this outside of the language as it is a very different complexity. And hate to say it, also subject to the whim of a culture. Note: wasn't able to see the whole conversation on my phone. Sorry if some got duplicated. The comment is also not meant to be snarky, as the timezone have some great examples. Interpretations can change over time. |
I think we should hew as closely to Unicode concepts as possible here. Not necessarily because Unicode is always brilliant, but rather because it's standard, and also because in general we're unlikely to do any better than it does. And if people don't want full Unicode support bloating up the stdlib, then by all means leave it to an external library (though I'd love a potential future where I can treat libstd as though it were an ordinary crate with Cargo features, so I can say |
Raku might serve as an example for @bstrie's suggestion; it exposes Unicode properties as described here: https://www.codesections.com/blog/raku-unicode/ For the given example:
IMO this is the only sensible way to avoid having weird Unicode vs. language API inconsistencies, and it shifts the discussion from fuzzy cultural and philosophical questions to "just do what Unicode does". |
After seeing this issue I looked through other usages of is_numeric in public git repos and most of them I would consider to be bugs, mostly due to people mistaking Examples of the bugs I mentioned: |
I fundamentally disagree with Unicode's decision to classify code points as "numeric or not". It is, in general, impossible to determine if a code point (not even a grapheme cluster a.k.a. "perceived character"!) is used as a numeral without context, because human languages are very high up the Chomsky hierarchy, to put it mildly. They don't even need to look very far to realise this. Is "I" a numeral? Well it isn't at the beginning of my comment, but it is on my clock. ... And to fix this, they added new codepoints specifically for Roman numerals, except that for the past two thousand years people didn't treat them as characters separate from the ones in the Latin alphabet, so the new codepoints are entirely "made up" so to speak, and people rarely use them (centuries are written in Roman numerals in many European languages so Roman numerals come up quite frequently. I don't think we should tell the Parisians that they are typing it wrong....) As for Chinese, well, "一万" is a number (ten thousand) and both characters should be numerals, but "万一" (roughly "just in case") isn't a number, are the characters still numerals? And there's also the issue that we don't always have Chinese characters in Unicode, we have Unified CJK characters, the same character may always be a numeral in one language, but never in another. |
I think this suggests that the documentation for |
Maybe we should move this discussion into a separate issue. |
As @cbeuw points out, Unicode is not a reliable source. It provides information, but no conclusive answer. It is way more about context, completeness, etc. and this is all subject to a lot of nuance
even "万" by itself does not have a real meaning as it needs a preceding numeral to make it meaningful and correct.
This points clearly why this would not work in a generalized way. It would make a lot more sense to allow people to use a library to solve their specific problem. Your example about Roman numerals is an interesting one. I can't recall a language that interprets this as a standard function of the top of my head. and the libraries I have seen are sometimes even mistaken. Even worse, Is |
To make it more fun, "千万" can mean either "ten(s) of million" or "please make certain". You need to go well into NLP territory to figure out which "千万" in "千万富翁千万不能乱花一千万元" ("millionaires must not waste ten million yuan") is a numeral While abstractions are nice, there is a limit. We cannot abstract the unabstractable, or find a common ground where none exists. |
I am not a native speaker of Chinese (though my wife is), I think this is a question to ask a native Chinese speaker who has experience with programming. Does this make sense implement? I think the previous edit make this clear:
|
I wonder if this could be attempted within a crate to figure out if there is or is not value through practical use for a more robust is_numeric, and then we could talk about merging it later. It does seem like both statements "this is intractable" and "a simpler approach that is tractable may still be useful" could both be true. A lot of the arguments made here also apply to concepts like "time", yet we do still attempt it. I suspect it will be difficult to fully understand what a useful implementation might even look like without prototypes that are actively used by native speakers. |
I am a native Chinese speaker. Before seeing this issue, I don't trust |
Hey folks, I don't think it's necessary to discuss those facets of this issue at this time unless the libs team wants more discussion here. I did share my comment to a wider audience but that doesn't mean we need wider input here, let's let the libs team decide next steps (or if they want to get more feedback). This isn't an RFC, and a long discussion only makes more work for the team. If y'all have new and relevant points to bring up you should. (Also, as far as why Unicode has such categories in the first place: as with most Unicode properties, there's a decently consistent definition being used internally, and the property has uses in various Unicode algorithms, however Unicode properties aren't really intended to be used based on vibes which is why it feels weird that Unicode is attempting such a distinction in the first place) |
As a native Chinese speaker, I think most programmers have the assumption that |
That already isn't the case though, for example |
To add to that: if you want to check if a The naming of these functions (especially the My proposal: add an "1234".chars().all(|c| c.is_ascii_digit()) The Chinese numerals could be included in its own function, in order to support them through the stdlib. |
I wonder what exact use cases people are currently using |
I think that is essentially the conclusion that can be drawn from #84056 (comment). I.e. most users of And from what's been said above, I feel like having I do still think that the docs for |
Ah yes, thanks.
I agree. |
Based on @Manishearth's comment and subsequent discussion, we talked about this in today's libs-api meeting, and we agreed that the code shouldn't change here, but the documentation should tell people that this probably isn't what they want, and point to We'd welcome a documentation PR. |
…c, r=joshtriplett add more docs regarding ideographic numbers This was discussed in the last lib meeting and I try to avoid forgetting to open a PR because I think having some docs can help people. However, I think we need to discuss a little bit if this is enough or if we need to add more clarification? Maybe an example? Inspiration Source: rust-lang#84056 (comment) Including suggestion rust-lang#99626 (comment) my bad command git close the PR
…triplett add more docs regarding ideographic numbers This was discussed in the last lib meeting and I try to avoid forgetting to open a PR because I think having some docs can help people. However, I think we need to discuss a little bit if this is enough or if we need to add more clarification? Maybe an example? Inspiration Source: rust-lang/rust#84056 (comment) Including suggestion rust-lang/rust#99626 (comment) my bad command git close the PR
I tried this code:
I expected it to evaluate to
true
.Instead, it evaluated to
false
.I would expect at least 零/〇、一、二、三、四、五、六、七、八、九 (0-9) to be recognized. As for other numeral systems, like the Arabic numerals, after 9 the number wouldn't fit into a
char
anymore and thus can't be recognized, but with Chinese numerals, beyond 0-9 there's many other numbers represented with a single character too, like for example 10: 十, which could still be achar
. I'm not sure whether this should be recognized, but perhaps it should.There is also financial numbers and many others, see https://en.wikipedia.org/wiki/Chinese_numerals#Standard_numbers for a comprehensive list.
I've been told that, the numerals are covered in the
UnicodeData.txt
file mentioned in the docs ofchar::is_numeric
, but they are listed in theLo
category which stands for Other Letter and so Rust doesn't consider them numeric, which doesn't make sense to me because clearly they are numerals and not letters. Rust should probably either recognize (some parts of) this category as numerals or the numerals should be added manually.Adding support for this would in turn also mean support for numerals of other East Asian languages, like Japanese and Hokkien.
Meta
This happens on the stable 1.51.0 channel and all others.
The text was updated successfully, but these errors were encountered: