Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread hangs for large file #3634

Open
MPagel opened this issue Jun 7, 2019 · 2 comments
Open

fread hangs for large file #3634

MPagel opened this issue Jun 7, 2019 · 2 comments
Labels

Comments

@MPagel
Copy link

MPagel commented Jun 7, 2019

# Minimal reproducible example

May be linked to #2243, but I'm using 1.12.2 which should presumably be free of that bug.

I'm trying to use the first 20 lines of a file to determine what style of a non-canonical csv I've encountered, but fread seems to be trying to gobble up the whole thing.

When freading a largish file (522 MB), the following hangs (renders R unresponsive and not killable from within RStudio):

fc <- fread(file = fileName, nrows = 20, blank.lines.skip = TRUE, fill=TRUE, check.names=TRUE, verbose=TRUE, showProgress=interactive())

Hang was also seen with fewer fread parameters supplied e.g.
fc <- fread(file = fileName, nrows = 20, blank.lines.skip = TRUE)
and was allowed to run overnight

Verbose output:

omp_get_num_procs()==8
R_DATATABLE_NUM_PROCS_PERCENT=="" (default 50)
R_DATATABLE_NUM_THREADS==""
omp_get_thread_limit()==2147483647
omp_get_max_threads()==8
OMP_THREAD_LIMIT==""
OMP_NUM_THREADS==""
data.table is using 4 threads. This is set on startup, and by setDTthreads(). See ?setDTthreads.
RestoreAfterFork==true
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 4 threads (omp_get_max_threads()=8, nth=4)
NAstrings = [<>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as integer
[02] Opening the file
Opening file ./SR18154_190430_102715.csv
File opened, size = 522MB (547386368 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Site Name: ATS,11,30>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep automatically ...
sep=',' with 3 fields using quote rule 0
sep=' ' with 6 fields using quote rule 0
Detected 3 columns on line 1. This line is either column names or first data row. Line starts as: <<Site Name: ATS,11,30>>
Quote rule picked = 0
fill=true and the most number of columns found is 6
[07] Detect column types, good nrow estimate and whether first row is column names
Number of sampling jump points = 100 because nrow limit (20) supplied
Type codes (jump 000) : AAAAAA Quote rule 0
'header' determined to be false because there are some number columns and those columns do not have a string field at the top of them
=====
Sampled 21 rows (handled \n inside quoted fields) at 1 jump points
Bytes from first data row on line 1 to the end of last row: 547386138
Line length: mean=4.86 sd=12.64 min=0 max=42
Estimated number of rows: 547386138 / 4.86 = 112697147
Initial alloc = 123966861 rows (112697147 + 9%) using bytes/max(mean-2sd,min) clamped between [1.1estn, 2.0*estn]
=====
Alloc limited to lower nrows=20 passed in.
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 0 drop user overrides : AAAAAA
[10] Allocate memory for the datatable
Allocating 6 column slots (6 - 0 dropped) with 20 rows
[11] Read the data
jumps=[0..1), chunk_size=1048576, total_size=547386160

Note that there is not a "Read 20 rows x 6 columns xx.xxMB (xx bytes) from file in xx:xx.xxx wall clock time" as part of item 11 and item 12 ("finalizing the datatable") does not exist.

# Output of sessionInfo()

R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8.1 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.12.2 lubridate_1.7.4 forcats_0.3.0 stringr_1.4.0 dplyr_0.7.8 purrr_0.2.5 readr_1.3.0 tidyr_0.8.2
[9] tibble_2.1.1 ggplot2_3.1.0 tidyverse_1.2.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 cellranger_1.1.0 pillar_1.3.1 compiler_3.5.0 plyr_1.8.4 bindr_0.1.1 tools_3.5.0 jsonlite_1.6 nlme_3.1-137
[10] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.2 rlang_0.3.4 cli_1.0.1 rstudioapi_0.10 haven_2.0.0 bindrcpp_0.2.2 withr_2.1.2
[19] xml2_1.2.0 httr_1.4.0 generics_0.0.2 hms_0.4.2 grid_3.5.0 tidyselect_0.2.5 glue_1.3.0 R6_2.3.0 readxl_1.3.1
[28] modelr_0.1.2 magrittr_1.5 backports_1.1.4 scales_1.0.0 rvest_0.3.2 assertthat_0.2.0 colorspace_1.4-1 stringi_1.4.3 lazyeval_0.2.1
[37] munsell_0.5.0 broom_0.5.1 crayon_1.3.4

Using R version 3.5.0 (2018-04-23); RStudio Version 1.2.1335

@MPagel
Copy link
Author

MPagel commented Jun 7, 2019

data.table::fread(paste(readr::read_lines("SR18154_190430_102715.csv",n_max=20),collapse ='\n'), sep=',', blank.lines.skip=TRUE, fill=TRUE) is a work-around for now

@MPagel
Copy link
Author

MPagel commented Jul 11, 2019

Uploading a compressed version of the file that was problematic.

SR18154_190430_102715.cmix.zip

file was compressed with the cmix algorithm (so is not truly a zip file; neither standard windows .zip nor .7z max compression could compress the file sufficiently for GitHub to allow attachment).

zip file size: 83 MiB
7z file size: 38 MiB
cmix file size: 24 MiB

Console output from cmix:

>cmix.exe -c SR18154_190430_102715.csv SR18154_190430_102715.cmixz
547386368 bytes -> 24316983 bytes in 1116045.37 s.
cross entropy: 0.355

cmix downloaded from http://www.byronknoll.com/cmix-v17-windows.zip (also available at GitHub link above)
I chose cmix because its compression ratio is generally much better than gz, bz2, lrzip, etc; but it takes an eternity (13 days!!!! using a quad core i7 desktop with SSD + 16GB RAM) to run

to decompress
cmix.exe -d SR18154_190430_102715.cmix.zip SR18154_190430_102715.csv

I have no idea how the decompression speed compares to the compression speed. I've just started running the decompression on this end to test speed relative to compression. Edit: at current pace, it will take approximately 11 days on the same setup. So, on the order of the same pace as compression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants