[Request] Please include "\n" as additional default separator for parameter sep in fread() for improved backwards-compatibility #1073

holgerman · 2015-03-09T00:38:33Z

Hi,

Currently, the parameter sep in function fread defaults to the set [,\t |;:]
I suggest to include "\n" as final separator in the default, as this might improve downwards-compatibility of existing code with previous versions of data.table.
An example would be a file, where only one single string is written in each line but occassionally some of the sep-default-characters are part of the string.This produces an error in 1.9.5 due to string "c:4" in line 3 (but not in 1.9.4) when not explicitly specifying sep = "\n".

Here is an example:
(I am using data.table 1.9.5 devel from 8.3.2015, txt file available at https://www.dropbox.com/s/y6cmkcza36c1qjn/ex_150309.txt?dl=0)

myfile = "/net/ifs1/san_projekte/projekte/genstat/09_nutzer/holger/39_dt_request//ex_150309.txt" # available at https://www.dropbox.com/s/y6cmkcza36c1qjn/ex_150309.txt?dl=0


aa = fread(myfile, verbose = T)

## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.000000 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Detecting sep ... ':'
## Detected 2 columns. Longest stretch was from line 3 to line 3
## Starting data input on line 3 (either column names or first row of data). First 10 characters: c:4

## Warning in fread(myfile, verbose = T): Starting data input on line 3 and
## discarded previous non-empty line: b

## Some fields on line 3 are not type character (or are empty). Treating as a data row and using default column names.
## Count of eol: 3 (including 1 at the end)
## Count of sep: 1
## nrow = MIN( nsep [1] / ncol [2] -1, neol [3] - nblank [1] ) = 1

## Error in fread(myfile, verbose = T): Expected sep (':') but new line, EOF (or other non printing character) ends field 0 when detecting types (   first): d

aa = fread(myfile, verbose = T, sep = "\n")

## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.000000 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Using supplied sep '
## ' ... Deducing this is a single column input.
## Starting data input on line 1 (either column names or first row of data). First 10 characters: a
## All the fields on line 1 are character fields. Treating as the column names.
## Count of eol: 4 (including 1 at the end)
## Count of sep: 3
## ncol==1 so sep count ignored
## Type codes (   first 5 rows): 4
## Type codes: 4 (after applying colClasses and integer64)
## Type codes: 4 (after applying drop or select (if supplied)
## Allocating 1 column slots (1 - 0 dropped)
## Read 3 rows. Exactly what was estimated and allocated up front
##    0.000s ( 71%) Memory map (rerun may be quicker)
##    0.000s ( 13%) sep and header detection
##    0.000s (  3%) Count rows (wc -l)
##    0.000s (  6%) Column type detection (first, middle and last 5 rows)
##    0.000s (  3%) Allocation of 3x1 result (xMB) in RAM
##    0.000s (  2%) Reading data
##    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
##    0.000s (  0%) Coercing data already read in type bumps (if any)
##    0.000s (  2%) Changing na.strings to NA
##    0.000s        Total

aa

##      a
## 1:   b
## 2: c:4
## 3:   d

sessionInfo()

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-suse-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.5 knitr_1.9       
## 
## loaded via a namespace (and not attached):
## [1] chron_2.3-45   evaluate_0.5.5 formatR_1.0    stringr_0.6.2 
## [5] tools_3.1.2

The text was updated successfully, but these errors were encountered:

mattdowle · 2017-10-27T23:21:09Z

This should be fixed in dev now, when 7357a3a is merged. Recent work significantly improves automatic format detection since this issue was raised 2 years ago. I tried the dropbox link but it's no longer working.

Please try dev and reattach the file if it still doesn't work. Thanks!

arunsrinivasan added the fread label Sep 4, 2015

arunsrinivasan closed this as completed Sep 17, 2015

arunsrinivasan reopened this Sep 17, 2015

st-pasha added the enhancement label Jul 7, 2017

st-pasha mentioned this issue Jul 7, 2017

Master task for fread bugs / proposals #2247

Closed

st-pasha mentioned this issue Aug 28, 2017

fread fails to read 1-column file #2322

Closed

mattdowle closed this as completed Oct 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] Please include "\n" as additional default separator for parameter sep in fread() for improved backwards-compatibility #1073

[Request] Please include "\n" as additional default separator for parameter sep in fread() for improved backwards-compatibility #1073

holgerman commented Mar 9, 2015

mattdowle commented Oct 27, 2017

[Request] Please include "\n" as additional default separator for parameter sep in fread() for improved backwards-compatibility #1073

[Request] Please include "\n" as additional default separator for parameter sep in fread() for improved backwards-compatibility #1073

Comments

holgerman commented Mar 9, 2015

mattdowle commented Oct 27, 2017