Feature Request: finer control of `strip.white` in `fread`? Dealing with BOM #1465

MichaelChirico · 2015-12-11T04:19:02Z

I'm working with a series of files, one of which has the UTF-8 BOM marking the beginning of the file: \0xef \0xbb \0xbf

As noted here, the default behavior of read.csv is now to detect and delete the BOM. Unfortunately, for me at least, fread seems to have converted the three characters into a space.

Fortunately, strip.white removes this before returning the data.table; unfortunately, my file also has lots of important trailing white space, so I need to set strip.white = FALSE, negating this.

Here's a link to the file I'm working with (caveat clickor: it's a scary executable link, and also non-trivial size, ~80 MB. For whatever reason they decided to "zip" the file with an executable. My only word of reassurance is that you can tell it's a US government website): http://lbstat.dpi.wi.gov/sites/default/files/imce/lbstat/exe/11STAFF.exe

To see the BOM, run:

r<-readBin("11STAFF.txt",raw(),file.info("11STAFF.txt")$size)
> r[1:10]
 [1] ef bb bf 30 30 30 30 36 37 31
> r[1] == as.raw(0xef)
[1] TRUE

Here's some relevant output from fread with verbose = TRUE:

> fread("11STAFF.txt", sep = "^", header = FALSE, verbose = TRUE)
...
First 10 characters: 0000671

That is, it has treated the first 3 characters as being a space. With strip.white = TRUE, this space disappears in the output.

I compare this to the behavior of read.csv (also a nuisance to use because the file is on the large side):

> read.csv("11STAFF.txt", sep = "^", header = FALSE, stringsAsFactors = FALSE)$V1[1]
[1] "000067182Abel                Nancy           FW19554    2011R187  70 70  45880  21809            1  00070007030020530050KGKG1616N100              Abbotsford Sch Dist           Abbotsford Elementary         61010Clark County                  04PO Box A                      Abbotsford WI  54405-0901                                   510 W Hemlock St              Abbotsford WI  54405                                        Abbotsford       WI54405-0901Abbotsford       WI54405     715-223-4281      Gary Gunderson                                    NNN                                                  "

That is, read.csv seems to have deleted the BOM and kept the trailing white space. Just a shame that it's so slow.

For now, I've simply added deleting the BOM to my clean-up routine alluded to here, but it seems like fread should match the behavior of read.csv here.

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2016-03-07T01:12:53Z

Fixed in commit 9fa61a9

arunsrinivasan closed this as completed Mar 7, 2016

arunsrinivasan added this to the v1.9.8 milestone Mar 7, 2016

arunsrinivasan added the fread label Mar 7, 2016

arunsrinivasan self-assigned this Mar 7, 2016

mattdowle mentioned this issue Mar 28, 2017

Improvements to BOM detection: #2084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: finer control of `strip.white` in `fread`? Dealing with BOM #1465

Feature Request: finer control of `strip.white` in `fread`? Dealing with BOM #1465

MichaelChirico commented Dec 11, 2015

arunsrinivasan commented Mar 7, 2016

Feature Request: finer control of strip.white in fread? Dealing with BOM #1465

Feature Request: finer control of strip.white in fread? Dealing with BOM #1465

Comments

MichaelChirico commented Dec 11, 2015

arunsrinivasan commented Mar 7, 2016

Feature Request: finer control of `strip.white` in `fread`? Dealing with BOM #1465

Feature Request: finer control of `strip.white` in `fread`? Dealing with BOM #1465