Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify JS cstring len #14184

Merged
merged 5 commits into from
May 5, 2020
Merged

Clarify JS cstring len #14184

merged 5 commits into from
May 5, 2020

Conversation

metagn
Copy link
Collaborator

@metagn metagn commented May 1, 2020

refs #10911

lib/system.nim Outdated Show resolved Hide resolved
@metagn metagn changed the title add unary sequtils.count, clarify JS cstring len Clarify JS cstring len May 5, 2020
@metagn
Copy link
Collaborator Author

metagn commented May 5, 2020

Reverted count, simple docs change now

lib/system.nim Outdated Show resolved Hide resolved
lib/system.nim Outdated
@@ -691,7 +691,9 @@ proc len*(x: string): int {.magic: "LengthStr", noSideEffect.}

proc len*(x: cstring): int {.magic: "LengthStr", noSideEffect.}
## Returns the length of a compatible string. This is sometimes
## an O(n) operation.
## an O(n) operation. On the JS backend this may count Unicode
## code points instead of characters, similar to `unicode.runeLen
Copy link
Member

@timotheecour timotheecour May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should change js' behavior for len (i have an RFC in the works...) to match other backends (including nim vm) but that's out of scope for this PR which at least tells the current truth.

How about this:

Returns the length of a compatible string. This is sometimes an O(n) operation.
On the JS backend this currently counts Unicode code points instead of bytes,
similar to `unicode.runeLen <unicode.html#runeLen%2Cstring>`_.
This specific js behavior might change in the future.

  runnableExamples:
    var a: cstring = "汉字"
    when defined js:
      doAssert a.len == 2
    else:
      doAssert a.len == 6
    doAssert ($a).len == 6
    doAssert "汉字".cstring.len == 6
    const a2: cstring = "汉字"
    doAssert a2.len == 6

(yes i know runnableExamples only runs in nim c currently but hopefully in future you'll be able to see docs for specific backends ; plus it documents exactly the current behavior)

  • note: I'm avoiding characters in this context since its meaning is overloaded, and instead used bytes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDN recommends using [...str].length which is our equivalent to $str.len, so I'm going to recommend that instead. It's also not Unicode code points, it's UTF-16, so it's not exactly like unicode.runeLen, that has to be removed. I don't think giving possibly backwards incompatible runnableExamples is a good idea, I think it's best not to give concrete code here.

Copy link
Member

@timotheecour timotheecour May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think giving possibly backwards incompatible runnableExamples is a good idea, I think it's best not to give concrete code here.

but at least mention the behavior for nim js may change so people can't complain...

using [...str].length which is our equivalent to $str.len,

that's not what I'm seeing in following example which I also find very puzzling:

when defined js:
  {.emit:"""
  function getCharacterLength (str) {
    // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#Unicode The string iterator that is used here iterates over characters, not mere code units
    return [...str].length;
  }
  function fun(){
    let a1 = "A\uD87E\uDC04Z"
    let a2 = "汉字"
    console.log(a1, "length: ", a1.length, getCharacterLength(a1));
    console.log(a2, "length: ", a2.length, getCharacterLength(a2));
  }
  """.}
  proc fun(){.importc.}
  proc getCharacterLength(a: cstring): int {.importc.}
  proc getLength(a: cstring): int {.importjs:"#.length".}

else:
  proc getLength(a: cstring): int = a.len
  import unicode
  proc getCharacterLength(a: cstring): int =
    a.`$`.runeLen

proc main()=
  var strs = ["A\uD87E\uDC04Z".cstring, "汉字".cstring]
  for a in strs:
    echo ($a, a.len, ($a).len, a.getLength, a.getCharacterLength)

  when defined js:
    fun()
main()
nim c
("AZ", 8, 8, 8, 4)
("汉字", 6, 6, 6, 2)

nim js
("A������Z", 8, 20, 8, 8)
("汉字", 2, 6, 2, 2)
A你Z length:  4 3
汉字 length:  2 2

Copy link
Member

@timotheecour timotheecour May 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after factoring out this issue #14229 (which is a separate confounding issue) this is now clearer:

nim c -r -d:case5 $timn_D/tests/nim/all/t10687.nim 
(s: "汉字", len: 6, slen: 6, getLength: 6, getCharacterLength: 2)
(s: "A你Z", len: 6, slen: 6, getLength: 6, getCharacterLength: 3)

nim js -r -d:case5 $timn_D/tests/nim/all/t10687.nim
(s: "汉字", len: 2, slen: 6, getLength: 2, getCharacterLength: 2)
(s: "A你Z", len: 4, slen: 6, getLength: 4, getCharacterLength: 3)
when true: # D20200504T224530
  when defined js:
    {.emit:"""
    function getCharacterLength (str) {
      // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/length#Unicode The string iterator that is used here iterates over characters, not mere code units
      return [...str].length;
    }
    function getA3(){
      return "A\uD87E\uDC04Z";
    }
    """.}
    proc getCharacterLength(a: cstring): int {.importc.}
    proc getLength(a: cstring): int {.importjs:"#.length".}
    proc getA3(): cstring {.importc.}
  else:
    var s0 = [ 65'u8, 240, 175, 160, 132, 90, 0] # UTF8 bytes of "A\uD87E\uDC04Z"
    var s = cast[cstring](s0[0].addr)
    proc getA3(): cstring = s
    proc getLength(a: cstring): int = a.len
    import unicode
    proc getCharacterLength(a: cstring): int =
      a.`$`.runeLen
  
  proc main()=
    var strs = ["汉字".cstring, getA3()]
    for a in strs:
      echo (s: $a, len: a.len, slen: ($a).len, getLength: a.getLength, getCharacterLength: a.getCharacterLength)
  main()

lib/system.nim Outdated Show resolved Hide resolved
@timotheecour
Copy link
Member

Reverted count, simple docs change now

good move but I had some comments for count, please ping me when you re-open a draft PR on that

@timotheecour timotheecour mentioned this pull request May 5, 2020
1 task
lib/system.nim Outdated Show resolved Hide resolved
Copy link
Member

@timotheecour timotheecour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Araq Araq merged commit 6b7b5fb into nim-lang:devel May 5, 2020
@metagn metagn deleted the count branch May 5, 2020 08:36
EchoPouet pushed a commit to EchoPouet/Nim that referenced this pull request Jun 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants