Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to BOM detection: #2084

Merged
merged 3 commits into from
Apr 19, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
* The ability to position `autostart` anywhere inside one of multiple tables in a single file is removed with warning. It used to search upwards from that line to find the start of the table based on a consistent number of columns. People appear to be using `skip="string"` or `skip=nrow` to find the header row exactly, which is retained and simpler. It was too difficult to retain search-upwards-autostart together with skipping blank lines, filling incomplete rows and parallelization. Varying format and height messy header info above the column names is still auto detected and auto skipped.
* `dec=','` is now implemented directly so there is no dependency on locale. The options `datatable.fread.dec.experiment` and `datatable.fread.dec.locale` have been removed.
* Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan, Hugh Parsonage, Mark Klik and more to add for testing before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087), [#2091](https://github.com/Rdatatable/data.table/issues/2091), [#2107](https://github.com/Rdatatable/data.table/issues/2107), [fst#50](https://github.com/fstpackage/fst/issues/50#issuecomment-294287846)
* Detect BOM mark in GB-18030 and UTF-16 encodings, in verbose mode print a message about BOM detection.

#### BUG FIXES

Expand Down
2 changes: 2 additions & 0 deletions inst/tests/gb18030.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
�1�3x,y,z
,,
4 changes: 4 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -7906,6 +7906,10 @@ if ("package:bit64" %in% search()) {

# fix for #1087 and #1465
test(1627, charToRaw(names(fread("issue_1087_utf8_bom.csv"))[1L]), as.raw(97L))
test(1627.1, names(fread("issue_1087_utf8_bom.csv", verbose=T))[1L], "a", output="UTF-8 byte order mark EF BB BF found")
test(1627.2, names(fread("gb18030.txt", verbose=T))[1L], "x", output="GB-18030 byte order mark 84 31 95 33 found", warning="GB-18030 encoding detected")
test(1627.3, fread("utf16le.txt"), error="File is encoded in UTF-16")
test(1627.4, fread("utf16be.txt"), error="File is encoded in UTF-16")

# uniqueN gains na.rm argument, #1455
set.seed(1L)
Expand Down
Binary file added inst/tests/utf16be.txt
Binary file not shown.
Binary file added inst/tests/utf16le.txt
Binary file not shown.
23 changes: 21 additions & 2 deletions src/fread.c
Original file line number Diff line number Diff line change
Expand Up @@ -602,11 +602,30 @@ int freadMain(freadMainArgs args) {
}
double tMap = wallclock();

// ********************************************************************************************
// Check whether the file contains BOM (Byte Order Mark), and if yes strip it, modifying
// `mmp`. Also, presence of BOM allows us to reliably detect the file's encoding.
// See: https://en.wikipedia.org/wiki/Byte_order_mark
// See: issues #1087 and #1465
// ********************************************************************************************
if (fileSize >= 3 && memcmp(sof, "\xEF\xBB\xBF", 3) == 0) {
sof += 3;
// ienc = CE_UTF8;
if (args.verbose) DTPRINT("UTF-8 byte order mark EF BB BF found at the start of the file and skipped.\n");
}
else if (fileSize >= 4 && memcmp(sof, "\x84\x31\x95\x33", 4) == 0) {
sof += 4;
// ienc = CE_GB18030;
if (args.verbose) DTPRINT("GB-18030 byte order mark 84 31 95 33 found at the start of the file and skipped.\n");
DTWARN("GB-18030 encoding detected, however fread() is unable to decode it. Some character fields may be garbled.\n");
}
else if (fileSize >= 2 && sof[0] + sof[1] == '\xFE' + '\xFF') { // either 0xFE 0xFF or 0xFF 0xFE
STOP("File is encoded in UTF-16, this encoding is not supported by fread(). Please recode the file to UTF-8.");
}

// ********************************************************************************************
// Auto detect eol, first eol where there are two (i.e. CRLF)
// ********************************************************************************************
// take care of UTF8 BOM, #1087 and #1465
if (!memcmp(sof, "\xef\xbb\xbf", 3)) sof += 3;
ch = sof;
while (ch<eof && *ch!='\n' && *ch!='\r') {
if (*ch==quote) while(++ch<eof && *ch!=quote) {}; // (TODO unbounded to fix) allows protection of \n and \r inside column names
Expand Down