Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread ignores first 2 lines of file with tabs after data in several lines #1416

Closed
tdhock opened this issue Oct 29, 2015 · 4 comments
Closed
Labels

Comments

@tdhock
Copy link
Member

tdhock commented Oct 29, 2015

To see the problem please download https://gist.githubusercontent.com/tdhock/bc1d9c0a76e4df386199/raw/fe436fb41e42c9ba89d8b567abfca587d323aa36/iterations.txt

and then run

> data.table::fread("iterations.txt")
     V1       V2           V3     V4 V5
 1:   2 -3274352 63539392.790   23.6 NA
 2:   3 -2166076  1108276.003   34.5 NA
 3:   4 -2054893   111182.812   45.3 NA
 4:   5 -1992373    62520.297   56.0 NA
 5:   6 -1913422    78950.778   66.8 NA
 6:   7 -1802173   111249.167   77.7 NA
...

The problem seems to be that the first two lines do not have a trailing tab but all the rest of the lines do (screenshot from emacs which highlights all the tab characters):

screenshot-emacs-tab

I expected the output of read.table:

> head(read.table("toby/iterations/iterations.txt", header=TRUE))
  Iteration EstimatedLogLikelihood       Change TotalTimeSec
1         1              -66813745            -         12.3
2         2               -3274352 63539392.790         23.6
3         3               -2166076  1108276.003         34.5
4         4               -2054893   111182.812         45.3
5         5               -1992373    62520.297         56.0
6         6               -1913422    78950.778         66.8

Thanks for looking into this,

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_CA.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_CA.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] httr_0.6.1       tools_3.2.2      parallel_3.2.2   whisker_0.3-2   
 [5] RCurl_1.95-4.1   memoise_0.1      data.table_1.9.6 stringr_0.6.2   
 [9] digest_0.6.8     chron_2.3-45     devtools_1.5     evaluate_0.5.5  
> 
@MichaelChirico
Copy link
Member

It seems the fill = TRUE argument (since v1.9.8) takes care of this issue. Can you confirm @tdhock?

@tdhock
Copy link
Member Author

tdhock commented May 25, 2017

now I get

> fread("iterations.txt")
     V1       V2           V3     V4 V5
 1:   2 -3274352 63539392.790   23.6 NA
 2:   3 -2166076  1108276.003   34.5 NA
...
98:  99 -1515355        1.227 1090.5 NA
99: 100 -1515354        1.222 1102.0 NA
     V1       V2           V3     V4 V5
Warning message:
In fread("iterations.txt") :
  Starting data input on line 3 <<2	-3274351.940	63539392.790	23>> with 5 fields and discarding line 2 <<1	-66813744.730	-	12.3>> before it because it has a different number of fields (4).

above, without fill=TRUE I get a warning, which is definitely helpful. but it could be more helpful, if you actually add something like "try fill=TRUE to read rather than drop these data." Would you consider adding that to the warning?

> fread("iterations.txt", fill=TRUE)
     Iteration EstimatedLogLikelihood       Change TotalTimeSec V5
  1:         1              -66813745            -         12.3 NA
  2:         2               -3274352 63539392.790         23.6 NA
...
 99:        99               -1515355        1.227       1090.5 NA
100:       100               -1515354        1.222       1102.0 NA
     Iteration EstimatedLogLikelihood       Change TotalTimeSec V5
> 

above, with fill=TRUE I get all of the data that I get from read.table, plus an extra column with missing values, which is fine with me.

this was with data table 1.10.5

> devtools::session_info()
Session info -------------------------------------------------------------------
 setting  value                       
 version  R version 3.3.3 (2017-03-06)
 system   i686, linux-gnu             
 ui       X11                         
 language en_US                       
 collate  en_US.UTF-8                 
 tz       posixrules                  
 date     2017-05-25                  

Packages -----------------------------------------------------------------------
 package       * version    date       source                               
 bitops          1.0-6      2013-08-17 CRAN (R 3.2.2)                       
 caTools         1.17.1     2014-09-10 CRAN (R 3.2.2)                       
 data.table    * 1.10.5     2017-04-21 local                                
 devtools        1.12.0     2016-12-05 CRAN (R 3.3.3)                       
 digest          0.6.12     2017-01-27 cran (@0.6.12)                       
 lattice       * 0.20-34    2016-09-06 CRAN (R 3.3.3)                       
 memoise         1.0.0      2016-01-29 CRAN (R 3.2.2)                       
 namedCapture  * 2017.01.15 2017-04-29 Github (tdhock/namedCapture@1da425b) 
 RColorBrewer  * 1.1-2      2014-12-07 CRAN (R 3.2.2)                       
 RCurl         * 1.96-0     2016-08-07 local                                
 requireGitHub   2017.03.16 2017-04-29 Github (tdhock/requireGitHub@5de2020)
 RJSONIO       * 1.3-0      2014-07-28 CRAN (R 3.2.2)                       
 RSelenium     * 1.3.6      2016-11-09 Github (ropensci/RSelenium@22f06b9)  
 withr           1.0.1      2016-02-04 CRAN (R 3.2.2)                       
 XML           * 3.99-0     2016-08-07 local                                
> 

@mattdowle mattdowle added this to the v1.10.6 milestone Jan 22, 2018
mattdowle added a commit that referenced this issue Jan 22, 2018
@mattdowle mattdowle modified the milestones: v1.11.0, v1.11.2 Apr 29, 2018
@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 11, 2019
@tdhock
Copy link
Member Author

tdhock commented Jul 27, 2019

this issue is still active, i.e. I am still getting an un-informative message -- it would be much more helpful if you could tell the user to do fill=TRUE.

> fread("iterations.txt")
    Iteration EstimatedLogLikelihood Change TotalTimeSec
 1:         1              -66813745      -         12.3
 Warning message:
 In fread("iterations.txt") :
   Stopped early on line 3. Expected 4 fields but found 5. Consider fill=TRUE and comment.char=. Fi\
 rst discarded non-empty line: <<2     -3274351.940    63539392.790    23.6    >>
 > fread("iterations.txt", fill=TRUE)
      Iteration EstimatedLogLikelihood       Change TotalTimeSec V5
   1:         1              -66813745            -         12.3 NA
   2:         2               -3274352 63539392.790         23.6 NA
...
  99:        99               -1515355        1.227       1090.5 NA
 100:       100               -1515354        1.222       1102.0 NA
      Iteration EstimatedLogLikelihood       Change TotalTimeSec V5
!> sessionInfo()
 R version 3.5.0 (2018-04-23)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: CentOS Linux 7 (Core)

 Matrix products: default
 BLAS: /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/intel2016.4/r/3.5.0/lib64\
 /R/lib/libR.so
 LAPACK: /cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/intel2016.4/r/3.5.0/lib\
 64/R/modules/lapack.so
 locale:
  [1] LC_CTYPE=fr_CA.UTF-8       LC_NUMERIC=C               LC_TIME=fr_CA.UTF-8
  [4] LC_COLLATE=fr_CA.UTF-8     LC_MONETARY=fr_CA.UTF-8    LC_MESSAGES=fr_CA.UTF-8
  [7] LC_PAPER=fr_CA.UTF-8       LC_NAME=C                  LC_ADDRESS=C
 [10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_CA.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] data.table_1.12.3       namedCapture_2019.02.21

 loaded via a namespace (and not attached):
  [1] pak_0.1.2         ps_1.3.0          prettyunits_1.0.2 withr_2.1.2       crayon_1.3.4
  [6] rprojroot_1.3-2   assertthat_0.2.1  R6_2.4.0          backports_1.1.4   magrittr_1.5
 [11] cli_1.1.0         curl_4.0          remotes_2.1.0     callr_3.3.1       tools_3.5.0
 [16] compiler_3.5.0    processx_3.4.1    pkgbuild_1.0.3
 > 

@tdhock
Copy link
Member Author

tdhock commented Jul 27, 2019

actually never mind, the message does say fill=TRUE now, so i'm closing this

@tdhock tdhock closed this as completed Jul 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants