Use Rf_charIsASCII for IS_ASCII instead of testing LEVELS on R >= 4.5 #6422

aitap · 2024-08-28T16:03:32Z

This part is split from #6420 (but includes #6420; can rebase if needed) because the R version containing Rf_isCharASCII is not currently released. The overall issue is #6180. Not yet sure how to change NEWS.md in case we're expecting data.table releases in the meantime.

Since data.table now depends on R >= 3.3, the backports are no longer needed. Moreover, MAYBE_SHARED is currently a function, while MAYBE_REFERENCED expands to !NO_REFERENCES (which is a function). In debugging output, show MAYBE_REFERENCED (NAMED > 0) instead of NAMED.

getCharCE appeared in R-2.7, making it possible to check for strings _marked_ as UTF-8 or Latin-1. There is no marking as ASCII, so fixing IS_ASCII will have to wait for R >= 4.5.

There's no explicit encoding code for ASCII, so use charIsASCII() ("eapi", expected to appear in R-4.5.0).

github-actions · 2024-08-28T16:21:47Z

Generated via commit 090dc37

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	4 minutes and 44 seconds
Installing different package versions	8 minutes and 23 seconds
Running and plotting the test cases	2 minutes and 24 seconds

MichaelChirico · 2024-08-28T16:28:19Z

Not yet sure how to change NEWS.md in case we're expecting data.table releases in the meantime.

Don't mind that, we maintainers will take care of it :)

NEWS.md

MichaelChirico · 2024-08-28T16:40:24Z

set #6420 as the target of this PR to make the chain clearer

HughParsonage

Is it an issue that the pre 4.5.0 version returns 1 or 0 whereas the new version returns an Rboolean?

aitap · 2024-08-29T08:52:00Z

Good catch! It's a bit worse: LEVELS(x) & 64 returns 0 or 64. Here are the only users of IS_ASCII: data.table.h: #define NEED2UTF8(s) !(IS_ASCII(s) || (s)==NA_STRING || IS_UTF8(s)) fwriteR.c: #define TO_NATIVE(s) (native && (s)!=NA_STRING && !IS_ASCII(s)) forder.c: if (!anynotutf8 && /*...*/ !IS_ASCII(s)) { They all use the result of the macro in boolean context (if / || / &&), so any non-zero result is interpreted as true.

NEWS.md

src/data.table.h

MichaelChirico · 2025-03-04T17:14:15Z

src/data.table.h

@@ -42,7 +42,11 @@
 /* we mean the encoding bits, not CE_NATIVE in a UTF-8 locale */
 #define IS_UTF8(x)  (getCharCE(x) == CE_UTF8)
 #define IS_LATIN(x) (getCharCE(x) == CE_LATIN1)
-#define IS_ASCII(x) (LEVELS(x) & 64) // API expected in R >= 4.5
+#if R_VERSION < R_Version(4, 5, 0) || R_SVN_REVISION < 86789


Keeping this here even though there's another (4,5,0) check above at L21

The R_SVN_REVISIONs are different

The IS_UTF8() and IS_LATIN() macros are also right here

src/data.table.h

MichaelChirico

I think this is ready to go now, too, right? Please merge if you agree. Thanks!

codecov · 2025-03-04T17:23:32Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.59%. Comparing base (93a5305) to head (090dc37).
Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6422   +/-   ##
=======================================
  Coverage   98.59%   98.59%           
=======================================
  Files          79       79           
  Lines       14661    14661           
=======================================
  Hits        14455    14455           
  Misses        206      206

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

aitap added 4 commits August 28, 2024 16:36

Almost drop direct use of LEVELS

46dbfa9

getCharCE appeared in R-2.7, making it possible to check for strings _marked_ as UTF-8 or Latin-1. There is no marking as ASCII, so fixing IS_ASCII will have to wait for R >= 4.5.

NEWS entry for NAMED, LEVELS reduction

a524d14

Drop direct use of LEVELS in R >= 4.5

72cbd17

There's no explicit encoding code for ASCII, so use charIsASCII() ("eapi", expected to appear in R-4.5.0).

aitap requested review from HughParsonage and MichaelChirico as code owners August 28, 2024 16:03

MichaelChirico reviewed Aug 28, 2024

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

fine-tune NEWS

b409a40

MichaelChirico changed the base branch from master to nonapi_b_gone August 28, 2024 16:39

MichaelChirico mentioned this pull request Aug 28, 2024

Drop direct use of NAMED and LEVELS #6420

Merged

Base automatically changed from nonapi_b_gone to master August 28, 2024 16:42

HughParsonage reviewed Aug 29, 2024

View reviewed changes

MichaelChirico marked this pull request as draft September 12, 2024 03:55

MichaelChirico changed the title ~~Use Rf_isCharASCII for IS_ASCII instead of testing LEVELS on R >= 4.5~~ Use Rf_charIsASCII for IS_ASCII instead of testing LEVELS on R >= 4.5 Dec 9, 2024

MichaelChirico reviewed Mar 4, 2025

View reviewed changes

NEWS.md Outdated Show resolved Hide resolved

multiple PRs :)

01385c1

MichaelChirico reviewed Mar 4, 2025

View reviewed changes

src/data.table.h Outdated Show resolved Hide resolved

tag SVN revision

6a29ea9

MichaelChirico marked this pull request as ready for review March 4, 2025 17:11

Merge branch 'master' into R_4_5_isCharASCII

31e324b

MichaelChirico reviewed Mar 4, 2025

View reviewed changes

src/data.table.h Show resolved Hide resolved

TODO

9f5cc4f

MichaelChirico approved these changes Mar 4, 2025

View reviewed changes

fix merge

090dc37

aitap merged commit 6a8634e into master Mar 5, 2025
9 of 11 checks passed

aitap deleted the R_4_5_isCharASCII branch March 5, 2025 10:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Rf_charIsASCII for IS_ASCII instead of testing LEVELS on R >= 4.5 #6422

Use Rf_charIsASCII for IS_ASCII instead of testing LEVELS on R >= 4.5 #6422

aitap commented Aug 28, 2024

github-actions bot commented Aug 28, 2024 •

edited

Loading

MichaelChirico commented Aug 28, 2024

MichaelChirico commented Aug 28, 2024

HughParsonage left a comment

aitap commented Aug 29, 2024 via email

MichaelChirico Mar 4, 2025

MichaelChirico left a comment

codecov bot commented Mar 4, 2025 •

edited

Loading

Use Rf_charIsASCII for IS_ASCII instead of testing LEVELS on R >= 4.5 #6422

Use Rf_charIsASCII for IS_ASCII instead of testing LEVELS on R >= 4.5 #6422

Conversation

aitap commented Aug 28, 2024

github-actions bot commented Aug 28, 2024 • edited Loading

MichaelChirico commented Aug 28, 2024

MichaelChirico commented Aug 28, 2024

HughParsonage left a comment

Choose a reason for hiding this comment

aitap commented Aug 29, 2024 via email

MichaelChirico Mar 4, 2025

Choose a reason for hiding this comment

MichaelChirico left a comment

Choose a reason for hiding this comment

codecov bot commented Mar 4, 2025 • edited Loading

Codecov Report

github-actions bot commented Aug 28, 2024 •

edited

Loading

codecov bot commented Mar 4, 2025 •

edited

Loading