Accented column names: current regression #1726

fabnicol · 2016-06-03T00:55:38Z

A regression has crept in some time after March 12 (sha1 c250e9f) and before current master branch code as of June 2nd.

It is related to accented (column) variable names, specifically when the syntax dt[ , accented_variable] is used, i.e. dt[ , Année].
Error message says the Année object is not found.
The bug does not show up when the alternative syntax dt[ , "Année", with = FALSE] is used or with non-accented variable names.

Platform is: Windows10, libraries built using Rtools 3.3.0.1959 from source code, encoding is ISO-8859-1.
Edit: Bug shows up under Windows7 too.

The text was updated successfully, but these errors were encountered:

jangorecki · 2016-06-03T01:06:57Z

Could you provide reproducible example? I've tested on Ubuntu with the following code and was not able to reproduce issue, so maybe it is Windows related issue. Anyway reproducible example is important to address any issue.

library(data.table)
dt=data.table(Année=1)
dt[,Année]
#[1] 1

fabnicol · 2016-06-03T12:53:47Z

Partial solution: the issue comes from the "encoding" parameter of fread.
Minimal example:

A <- data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "Latin-1")      
A[, Année]`         
# Error in eval(expr, envir, enclos) : object 'Année' not found         

B <- data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "unknown")    
B[, Année]    
# [1] 2011

It does not seem that "Latin-1" is a wrong value however:

data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "ISO-8859-1")        
# Error in data.table::fread(input = "Année;Mois\n2011;1", sep = ";", encoding = "ISO-8859-1") :     
#   Argument 'encoding' must be 'unknown', 'UTF-8' or 'Latin-1'.

Currently I've been circumventing the issue using "unknown".
This is obviously not ideal, as the encoding parameter looks faulty.

jangorecki · 2016-06-03T17:00:22Z

fread manual is quite clear on allowed values for that argument, so no point in trying ISO-8859-1. Latin-1 should work here, it works on reading the data, but later that column cannot be accessed.
I'm able to reproduce it on Ubuntu using latin1.txt data and recent devel version.

library(data.table)
A = data.table::fread("https://github.com/Rdatatable/data.table/files/298049/latin1.txt", sep=";", encoding="Latin-1")
A[, Année]
#Error in eval(expr, envir, enclos) : object 'Année' not found 
Encoding(names(A))
#[1] "latin1"  "unknown"
sessionInfo()
#R version 3.3.0 (2016-05-03)
#Platform: x86_64-pc-linux-gnu (64-bit)
#Running under: Ubuntu 15.10
#
#locale:
# [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
# [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
# [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
# [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
# [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#
#attached base packages:
#[1] stats     graphics  grDevices utils     datasets  methods   base     
#
#other attached packages:
#[1] data.table_1.9.7
#
#loaded via a namespace (and not attached):
#[1] curl_0.9.7

Issue looks to be related to #1680

arunsrinivasan · 2016-06-03T19:39:48Z

Thanks for catching and reporting this. I've not looked at the code, but I think my assumption that mkCharLenCE() marks after checking iff the string's encoding matches desired encoding is incorrect. We'll need to add a check ourselves.

fabnicol · 2016-06-04T17:15:15Z

@arunsrinivasan
Correct assumption.
Regression was caused by fix on mkCharLenCE() at commit f91bba1 on April 27.
Retesting with library built from source code at preceding commit d6f7959, my above minimal test has correct output for A[, Année]

arunsrinivasan · 2016-08-31T20:40:03Z

@fabnicol why did you close this?

fabnicol · 2018-09-16T23:27:31Z

Follow-up on this issue and related bug (in my opinion same cause).
Commit: faeae2e
Same issue with 1.11.4.

Reproducible examples:

  A <- data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "Latin-1")      
  A[, Année]
  Error in `[.data.table`(A, , Année) : 
   j (the 2nd argument inside [...]) is a single symbol but column name 'Année' is not found. Perhaps you intended DT[, ..Année]. This difference to data.frame is deliberate and explained in FAQ 1.1.

   system("echo 'A[, Année][]' > a.R && iconv -f UTF-8 -t ISO-8859-1 a.R > b.R")
   source("b.R", encoding="ISO-8859-1")
  # same error message

fabnicol · 2018-09-16T23:31:28Z

I would advise, as this nagging issue seems not to be documented and is very annoying for non-English only coders, that a noticeable warning be issued in the official documentation, to the effect that Latin-1 bases should not have non-ASCII column names (but accented lines are OK).
Below is a hack that may come in handy to users and give ideas to devs:

 names(A) <- iconv(names(A), to = "UTF-8")
 A[, Année]
  #2011

I ususally turn things around in this (not ideal) way.
Pending deeper fixes, fread cound be patched as follows:

fread_ <- function(...) {
  DT <- data.table::fread(...)
  if (any(Encoding(names(DT)) == "latin1"))  names(DT) <- iconv(names(DT), to = "UTF-8")
  DT
 }

A <- fread_(input="Année;Mois\n2011;1", sep=";", encoding = "Latin-1")
A[, Année]

yields the expected 2011

jangorecki · 2020-05-26T18:50:00Z

I would say non-ascii names should be avoided in the first place, see #4351

fabnicol · 2020-05-26T23:29:47Z

I would say non-ascii names should be avoided in the first place, see #4351

I rather disagree with this. The point of this issue is that prior to commit c250e9f, accented column names were entirely OK. They are also OK, at least for Western latinate languages of the ISO-8859-1x family, with base R. So this cannot be an R problem, contrary to what is written in comments of issue #4351

jangorecki · 2020-05-27T08:51:16Z

@fabnicol Thanks for following up. Could you test if that yields expected results then? on the "OK" version

dt[ , Année := 1L]
dt[ , "Année2" := 2L]

AFAIU non-ascii names works in many places, but not in all.
In such case I would lean towards Tomas Kalibera advice.

fabnicol · 2020-05-27T14:44:47Z

Follow-up on this issue and related bug (in my opinion same cause).
Commit: faeae2e
Same issue with 1.11.4.

Reproducible examples:

  A <- data.table::fread(input="Année;Mois\n2011;1", sep=";",   encoding = "Latin-1")      
  A[, Année]
  Error in `[.data.table`(A, , Année) : 
   j (the 2nd argument inside [...]) is a single symbol but column name 'Année' is not found. Perhaps you intended DT[, ..Année]. This difference to data.frame is deliberate and explained in FAQ 1.1.

   system("echo 'A[, Année][]' > a.R && iconv -f UTF-8 -t ISO-8859-1 a.R > b.R")
   source("b.R", encoding="ISO-8859-1")
  # same error message

I'm using again my reproducible test in the post above, with R version 4.0.0 (2020-04-24) -- "Arbor Day" under W10.
Result is OK now: A[, Année] gives the expected value 2011.
Your assignment tests are OK too: A[, Année := 1L] yields 1 and changing variables into Année2 makes no difference.
So it looks like the bug introduced by commit c250e9f was cured somewhere along the way.
I would suggest closing the issue.

jangorecki · 2020-05-27T14:49:54Z

It seems that Année is utf8, but is not ascii. If you would try to use non-utf8 as a column name, then you would run in troubles. If it is fixed, then to close it we should submit a unit test so we can be informated if behavior will change.

fabnicol · 2020-05-27T15:01:54Z

Année cannot be ascii although it can be Latin-1 (i.e. ISO-8859-1) or UTF-8, as there are no accented vowels in the Ascii table
In my `fread example above, quoted from a 2016 post, you have noticed that the string is imported as Latin-1, not UTF8. So apparently the issue is closed. The above test stands as a unit test for me.
I've also tried with a real ISO-8859-1 csv input file, with or without encoding parameter, and it makes no difference under W10.
It would be interesting to precisely pinpoint which commit solved the issue just based on the simple above test.

fabnicol · 2020-08-18T20:18:34Z

Issue is currently closed as bug is now fixed with R 4.0.2

A <- data.table::fread(input="Année;Mois\n2011;1", sep=";", encoding = "Latin-1")

A[ , Année]
[1] 2011

jangorecki · 2020-08-19T05:55:09Z

I think it make sense to add a test for that. We can also escape that test for older versions of R.

fabnicol · 2020-08-19T16:50:32Z

An interesting side issue is that with the current R-devel-win branch for Windows UTF-8, the issue remains if encoding of the table is Latin-1, yet not for UTF-8.
This shows that data.table relies on R default system encoding of strings, whilst it should process the input considering both system encoding and parametrized value.

arunsrinivasan added bug regression labels Jun 3, 2016

arunsrinivasan mentioned this issue Jun 22, 2016

robust encoding in fread (like 'fread("iconv -f ISO-8859-1 -t UTF-8 mytextfile.txt")') #1748

Closed

fabnicol closed this as completed Aug 31, 2016

arunsrinivasan reopened this Aug 31, 2016

MichaelChirico mentioned this issue Aug 17, 2019

Multibytes colnames in non-native encoding cause errors in complex j expression with keyby #3722

Open

jangorecki added the encoding issues related to Encoding label May 26, 2020

fabnicol closed this as completed Aug 18, 2020

jangorecki reopened this Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accented column names: current regression #1726

Accented column names: current regression #1726

fabnicol commented Jun 3, 2016 •

edited

Loading

jangorecki commented Jun 3, 2016 •

edited

Loading

fabnicol commented Jun 3, 2016 •

edited by arunsrinivasan

Loading

jangorecki commented Jun 3, 2016 •

edited

Loading

arunsrinivasan commented Jun 3, 2016

fabnicol commented Jun 4, 2016

arunsrinivasan commented Aug 31, 2016

fabnicol commented Sep 16, 2018 •

edited

Loading

fabnicol commented Sep 16, 2018 •

edited

Loading

jangorecki commented May 26, 2020

fabnicol commented May 26, 2020

jangorecki commented May 27, 2020 •

edited

Loading

fabnicol commented May 27, 2020

jangorecki commented May 27, 2020 •

edited

Loading

fabnicol commented May 27, 2020 •

edited

Loading

fabnicol commented Aug 18, 2020

jangorecki commented Aug 19, 2020

fabnicol commented Aug 19, 2020

Accented column names: current regression #1726

Accented column names: current regression #1726

Comments

fabnicol commented Jun 3, 2016 • edited Loading

jangorecki commented Jun 3, 2016 • edited Loading

fabnicol commented Jun 3, 2016 • edited by arunsrinivasan Loading

jangorecki commented Jun 3, 2016 • edited Loading

arunsrinivasan commented Jun 3, 2016

fabnicol commented Jun 4, 2016

arunsrinivasan commented Aug 31, 2016

fabnicol commented Sep 16, 2018 • edited Loading

fabnicol commented Sep 16, 2018 • edited Loading

jangorecki commented May 26, 2020

fabnicol commented May 26, 2020

jangorecki commented May 27, 2020 • edited Loading

fabnicol commented May 27, 2020

jangorecki commented May 27, 2020 • edited Loading

fabnicol commented May 27, 2020 • edited Loading

fabnicol commented Aug 18, 2020

jangorecki commented Aug 19, 2020

fabnicol commented Aug 19, 2020

fabnicol commented Jun 3, 2016 •

edited

Loading

jangorecki commented Jun 3, 2016 •

edited

Loading

fabnicol commented Jun 3, 2016 •

edited by arunsrinivasan

Loading

jangorecki commented Jun 3, 2016 •

edited

Loading

fabnicol commented Sep 16, 2018 •

edited

Loading

fabnicol commented Sep 16, 2018 •

edited

Loading

jangorecki commented May 27, 2020 •

edited

Loading

jangorecki commented May 27, 2020 •

edited

Loading

fabnicol commented May 27, 2020 •

edited

Loading