Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: finer control of strip.white in fread? Dealing with BOM #1465

Closed
MichaelChirico opened this issue Dec 11, 2015 · 1 comment
Assignees
Labels
Milestone

Comments

@MichaelChirico
Copy link
Member

I'm working with a series of files, one of which has the UTF-8 BOM marking the beginning of the file: \0xef \0xbb \0xbf

As noted here, the default behavior of read.csv is now to detect and delete the BOM. Unfortunately, for me at least, fread seems to have converted the three characters into a space.

Fortunately, strip.white removes this before returning the data.table; unfortunately, my file also has lots of important trailing white space, so I need to set strip.white = FALSE, negating this.

Here's a link to the file I'm working with (caveat clickor: it's a scary executable link, and also non-trivial size, ~80 MB. For whatever reason they decided to "zip" the file with an executable. My only word of reassurance is that you can tell it's a US government website): http://lbstat.dpi.wi.gov/sites/default/files/imce/lbstat/exe/11STAFF.exe

To see the BOM, run:

r<-readBin("11STAFF.txt",raw(),file.info("11STAFF.txt")$size)
> r[1:10]
 [1] ef bb bf 30 30 30 30 36 37 31
> r[1] == as.raw(0xef)
[1] TRUE

Here's some relevant output from fread with verbose = TRUE:

> fread("11STAFF.txt", sep = "^", header = FALSE, verbose = TRUE)
...
First 10 characters: 0000671

That is, it has treated the first 3 characters as being a space. With strip.white = TRUE, this space disappears in the output.

I compare this to the behavior of read.csv (also a nuisance to use because the file is on the large side):

> read.csv("11STAFF.txt", sep = "^", header = FALSE, stringsAsFactors = FALSE)$V1[1]
[1] "000067182Abel                Nancy           FW19554    2011R187  70 70  45880  21809            1  00070007030020530050KGKG1616N100              Abbotsford Sch Dist           Abbotsford Elementary         61010Clark County                  04PO Box A                      Abbotsford WI  54405-0901                                   510 W Hemlock St              Abbotsford WI  54405                                        Abbotsford       WI54405-0901Abbotsford       WI54405     715-223-4281      Gary Gunderson                                    NNN                                                  "

That is, read.csv seems to have deleted the BOM and kept the trailing white space. Just a shame that it's so slow.

For now, I've simply added deleting the BOM to my clean-up routine alluded to here, but it seems like fread should match the behavior of read.csv here.

@arunsrinivasan
Copy link
Member

Fixed in commit 9fa61a9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants