-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify JS cstring len #14184
Clarify JS cstring len #14184
Conversation
Reverted |
lib/system.nim
Outdated
@@ -691,7 +691,9 @@ proc len*(x: string): int {.magic: "LengthStr", noSideEffect.} | |||
|
|||
proc len*(x: cstring): int {.magic: "LengthStr", noSideEffect.} | |||
## Returns the length of a compatible string. This is sometimes | |||
## an O(n) operation. | |||
## an O(n) operation. On the JS backend this may count Unicode | |||
## code points instead of characters, similar to `unicode.runeLen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think we should change js' behavior for len
(i have an RFC in the works...) to match other backends (including nim vm) but that's out of scope for this PR which at least tells the current truth.
How about this:
Returns the length of a compatible string. This is sometimes an O(n) operation.
On the JS backend this currently counts Unicode code points instead of bytes,
similar to `unicode.runeLen <unicode.html#runeLen%2Cstring>`_.
This specific js behavior might change in the future.
runnableExamples:
var a: cstring = "汉字"
when defined js:
doAssert a.len == 2
else:
doAssert a.len == 6
doAssert ($a).len == 6
doAssert "汉字".cstring.len == 6
const a2: cstring = "汉字"
doAssert a2.len == 6
(yes i know runnableExamples only runs in nim c currently but hopefully in future you'll be able to see docs for specific backends ; plus it documents exactly the current behavior)
- note: I'm avoiding
characters
in this context since its meaning is overloaded, and instead used bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDN recommends using [...str].length
which is our equivalent to $str.len
, so I'm going to recommend that instead. It's also not Unicode code points, it's UTF-16, so it's not exactly like unicode.runeLen, that has to be removed. I don't think giving possibly backwards incompatible runnableExamples is a good idea, I think it's best not to give concrete code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think giving possibly backwards incompatible runnableExamples is a good idea, I think it's best not to give concrete code here.
but at least mention the behavior for nim js may change so people can't complain...
using [...str].length which is our equivalent to $str.len,
that's not what I'm seeing in following example which I also find very puzzling:
when defined js:
{.emit:"""
function getCharacterLength (str) {
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#Unicode The string iterator that is used here iterates over characters, not mere code units
return [...str].length;
}
function fun(){
let a1 = "A\uD87E\uDC04Z"
let a2 = "汉字"
console.log(a1, "length: ", a1.length, getCharacterLength(a1));
console.log(a2, "length: ", a2.length, getCharacterLength(a2));
}
""".}
proc fun(){.importc.}
proc getCharacterLength(a: cstring): int {.importc.}
proc getLength(a: cstring): int {.importjs:"#.length".}
else:
proc getLength(a: cstring): int = a.len
import unicode
proc getCharacterLength(a: cstring): int =
a.`$`.runeLen
proc main()=
var strs = ["A\uD87E\uDC04Z".cstring, "汉字".cstring]
for a in strs:
echo ($a, a.len, ($a).len, a.getLength, a.getCharacterLength)
when defined js:
fun()
main()
nim c
("AZ", 8, 8, 8, 4)
("汉字", 6, 6, 6, 2)
nim js
("A������Z", 8, 20, 8, 8)
("汉字", 2, 6, 2, 2)
A你Z length: 4 3
汉字 length: 2 2
using [...str].length which is our equivalent to $str.len,
=> this seems wrong- there's at least a few issues at play here... EDIT: one of them is A\uD87E\uDC04Z string litteral gives incorrect byte sequence #14229
- i think the closest accurate description for nim js at runtime is number of UTF16 code units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after factoring out this issue #14229 (which is a separate confounding issue) this is now clearer:
nim c -r -d:case5 $timn_D/tests/nim/all/t10687.nim
(s: "汉字", len: 6, slen: 6, getLength: 6, getCharacterLength: 2)
(s: "A你Z", len: 6, slen: 6, getLength: 6, getCharacterLength: 3)
nim js -r -d:case5 $timn_D/tests/nim/all/t10687.nim
(s: "汉字", len: 2, slen: 6, getLength: 2, getCharacterLength: 2)
(s: "A你Z", len: 4, slen: 6, getLength: 4, getCharacterLength: 3)
when true: # D20200504T224530
when defined js:
{.emit:"""
function getCharacterLength (str) {
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#Unicode The string iterator that is used here iterates over characters, not mere code units
return [...str].length;
}
function getA3(){
return "A\uD87E\uDC04Z";
}
""".}
proc getCharacterLength(a: cstring): int {.importc.}
proc getLength(a: cstring): int {.importjs:"#.length".}
proc getA3(): cstring {.importc.}
else:
var s0 = [ 65'u8, 240, 175, 160, 132, 90, 0] # UTF8 bytes of "A\uD87E\uDC04Z"
var s = cast[cstring](s0[0].addr)
proc getA3(): cstring = s
proc getLength(a: cstring): int = a.len
import unicode
proc getCharacterLength(a: cstring): int =
a.`$`.runeLen
proc main()=
var strs = ["汉字".cstring, getA3()]
for a in strs:
echo (s: $a, len: a.len, slen: ($a).len, getLength: a.getLength, getCharacterLength: a.getCharacterLength)
main()
good move but I had some comments for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (just the nit in https://github.com/nim-lang/Nim/pull/14184/files#r419876049)
refs #10911