You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
May be linked to #2243, but I'm using 1.12.2 which should presumably be free of that bug.
I'm trying to use the first 20 lines of a file to determine what style of a non-canonical csv I've encountered, but fread seems to be trying to gobble up the whole thing.
When freading a largish file (522 MB), the following hangs (renders R unresponsive and not killable from within RStudio):
Hang was also seen with fewer fread parameters supplied e.g. fc <- fread(file = fileName, nrows = 20, blank.lines.skip = TRUE)
and was allowed to run overnight
Verbose output:
omp_get_num_procs()==8
R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50)
R_DATATABLE_NUM_THREADS==""
omp_get_thread_limit()==2147483647
omp_get_max_threads()==8
OMP_THREAD_LIMIT==""
OMP_NUM_THREADS==""
data.table is using 4 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.
RestoreAfterFork==true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=8, nth=4)
NAstrings = [<>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file ./SR18154_190430_102715.csv
File opened, size = 522MB (547386368 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Site Name: ATS,11,30>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 3 fields using quote rule 0
sep=' ' with 6 fields using quote rule 0
Detected 3 columns on line 1. This line is either column names or first data row. Line starts as: <<Site Name: ATS,11,30>>
Quote rule picked = 0
fill=true and the most number of columns found is 6
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because nrow limit (20) supplied
Type codes (jump 000) : AAAAAA Quote rule 0
'header' determined to be false because there are some number columns and those columns do not have a string field at the top of them
=====
Sampled 21 rows (handled \n inside quoted fields) at 1 jump points
Bytes from first data row on line 1 to the end of last row: 547386138
Line length: mean=4.86 sd=12.64 min=0 max=42
Estimated number of rows: 547386138 / 4.86 = 112697147
Initial alloc = 123966861 rows (112697147 + 9%) using bytes/max(mean-2sd,min) clamped between [1.1estn, 2.0*estn]
=====
Alloc limited to lower nrows=20 passed in.
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : AAAAAA
[10] Allocate memory for the datatable
Allocating 6 column slots (6 - 0 dropped) with 20 rows
[11] Read the data
jumps=[0..1), chunk_size=1048576, total_size=547386160
Note that there is not a "Read 20 rows x 6 columns xx.xxMB (xx bytes) from file in xx:xx.xxx wall clock time" as part of item 11 and item 12 ("finalizing the datatable") does not exist.
#Output of sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)
data.table::fread(paste(readr::read_lines("SR18154_190430_102715.csv",n_max=20),collapse ='\n'), sep=',', blank.lines.skip=TRUE, fill=TRUE) is a work-around for now
file was compressed with the cmix algorithm (so is not truly a zip file; neither standard windows .zip nor .7z max compression could compress the file sufficiently for GitHub to allow attachment).
>cmix.exe -c SR18154_190430_102715.csv SR18154_190430_102715.cmixz
547386368 bytes -> 24316983 bytes in 1116045.37 s.
cross entropy: 0.355
cmix downloaded from http://www.byronknoll.com/cmix-v17-windows.zip (also available at GitHub link above)
I chose cmix because its compression ratio is generally much better than gz, bz2, lrzip, etc; but it takes an eternity (13 days!!!! using a quad core i7 desktop with SSD + 16GB RAM) to run
to decompress
cmix.exe -d SR18154_190430_102715.cmix.zip SR18154_190430_102715.csv
I have no idea how the decompression speed compares to the compression speed. I've just started running the decompression on this end to test speed relative to compression. Edit: at current pace, it will take approximately 11 days on the same setup. So, on the order of the same pace as compression.
#
Minimal reproducible example
May be linked to #2243, but I'm using 1.12.2 which should presumably be free of that bug.
I'm trying to use the first 20 lines of a file to determine what style of a non-canonical csv I've encountered, but fread seems to be trying to gobble up the whole thing.
When
fread
ing a largish file (522 MB), the following hangs (renders R unresponsive and not killable from within RStudio):fc <- fread(file = fileName, nrows = 20, blank.lines.skip = TRUE, fill=TRUE, check.names=TRUE, verbose=TRUE, showProgress=interactive())
Hang was also seen with fewer fread parameters supplied e.g.
fc <- fread(file = fileName, nrows = 20, blank.lines.skip = TRUE)
and was allowed to run overnight
Verbose output:
Note that there is not a "Read 20 rows x 6 columns xx.xxMB (xx bytes) from file in xx:xx.xxx wall clock time" as part of item 11 and item 12 ("finalizing the datatable") does not exist.
#
Output of sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.2 lubridate_1.7.4 forcats_0.3.0 stringr_1.4.0 dplyr_0.7.8 purrr_0.2.5 readr_1.3.0 tidyr_0.8.2
[9] tibble_2.1.1 ggplot2_3.1.0 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.0 plyr_1.8.4 bindr_0.1.1 tools_3.5.0 jsonlite_1.6 nlme_3.1-137
[10] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.2 rlang_0.3.4 cli_1.0.1 rstudioapi_0.10 haven_2.0.0 bindrcpp_0.2.2 withr_2.1.2
[19] xml2_1.2.0 httr_1.4.0 generics_0.0.2 hms_0.4.2 grid_3.5.0 tidyselect_0.2.5 glue_1.3.0 R6_2.3.0 readxl_1.3.1
[28] modelr_0.1.2 magrittr_1.5 backports_1.1.4 scales_1.0.0 rvest_0.3.2 assertthat_0.2.0 colorspace_1.4-1 stringi_1.4.3 lazyeval_0.2.1
[37] munsell_0.5.0 broom_0.5.1 crayon_1.3.4
Using R version 3.5.0 (2018-04-23); RStudio Version 1.2.1335
The text was updated successfully, but these errors were encountered: