Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] Please include "\n" as additional default separator for parameter sep in fread() for improved backwards-compatibility #1073

Closed
holgerman opened this issue Mar 9, 2015 · 1 comment

Comments

@holgerman
Copy link

Hi,

Currently, the parameter sep in function fread defaults to the set [,\t |;:]
I suggest to include "\n" as final separator in the default, as this might improve downwards-compatibility of existing code with previous versions of data.table.
An example would be a file, where only one single string is written in each line but occassionally some of the sep-default-characters are part of the string.This produces an error in 1.9.5 due to string "c:4" in line 3 (but not in 1.9.4) when not explicitly specifying sep = "\n".

Here is an example:
(I am using data.table 1.9.5 devel from 8.3.2015, txt file available at https://www.dropbox.com/s/y6cmkcza36c1qjn/ex_150309.txt?dl=0)

myfile = "/net/ifs1/san_projekte/projekte/genstat/09_nutzer/holger/39_dt_request//ex_150309.txt" # available at https://www.dropbox.com/s/y6cmkcza36c1qjn/ex_150309.txt?dl=0


aa = fread(myfile, verbose = T)

## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.000000 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Detecting sep ... ':'
## Detected 2 columns. Longest stretch was from line 3 to line 3
## Starting data input on line 3 (either column names or first row of data). First 10 characters: c:4

## Warning in fread(myfile, verbose = T): Starting data input on line 3 and
## discarded previous non-empty line: b

## Some fields on line 3 are not type character (or are empty). Treating as a data row and using default column names.
## Count of eol: 3 (including 1 at the end)
## Count of sep: 1
## nrow = MIN( nsep [1] / ncol [2] -1, neol [3] - nblank [1] ) = 1

## Error in fread(myfile, verbose = T): Expected sep (':') but new line, EOF (or other non printing character) ends field 0 when detecting types (   first): d

aa = fread(myfile, verbose = T, sep = "\n")

## Input contains no \n. Taking this to be a filename to open
## File opened, filesize is 0.000000 GB.
## Memory mapping ... ok
## Detected eol as \r\n (CRLF) in that order, the Windows standard.
## Positioned on line 1 after skip or autostart
## This line is the autostart and not blank so searching up for the last non-blank ... line 1
## Using supplied sep '
## ' ... Deducing this is a single column input.
## Starting data input on line 1 (either column names or first row of data). First 10 characters: a
## All the fields on line 1 are character fields. Treating as the column names.
## Count of eol: 4 (including 1 at the end)
## Count of sep: 3
## ncol==1 so sep count ignored
## Type codes (   first 5 rows): 4
## Type codes: 4 (after applying colClasses and integer64)
## Type codes: 4 (after applying drop or select (if supplied)
## Allocating 1 column slots (1 - 0 dropped)
## Read 3 rows. Exactly what was estimated and allocated up front
##    0.000s ( 71%) Memory map (rerun may be quicker)
##    0.000s ( 13%) sep and header detection
##    0.000s (  3%) Count rows (wc -l)
##    0.000s (  6%) Column type detection (first, middle and last 5 rows)
##    0.000s (  3%) Allocation of 3x1 result (xMB) in RAM
##    0.000s (  2%) Reading data
##    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
##    0.000s (  0%) Coercing data already read in type bumps (if any)
##    0.000s (  2%) Changing na.strings to NA
##    0.000s        Total

aa

##      a
## 1:   b
## 2: c:4
## 3:   d

sessionInfo()

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-suse-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.5 knitr_1.9       
## 
## loaded via a namespace (and not attached):
## [1] chron_2.3-45   evaluate_0.5.5 formatR_1.0    stringr_0.6.2 
## [5] tools_3.1.2
@mattdowle
Copy link
Member

This should be fixed in dev now, when 7357a3a is merged. Recent work significantly improves automatic format detection since this issue was raised 2 years ago. I tried the dropbox link but it's no longer working.

Please try dev and reattach the file if it still doesn't work. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants