[R-Forge #4931] Support file connections for fread #561

arunsrinivasan · 2014-06-08T13:15:55Z

Submitted by: Chris Neff; Assigned to: Nobody; R-Forge link

I use a corporate internal networked file system for much of my data, and so often times i need to call read.csv with a file connection. fread doesn't support this yet.

Namely I would like the following to work:

f = file("~/path/to/file.csv")

dt = fread(f)

xiaodaigh · 2014-11-20T05:44:21Z

This would be a pretty awesome feature that I am after as well!!

susannasupalla · 2015-04-08T18:28:54Z

I agree, this would be great.

zachmayer · 2015-08-20T22:56:08Z

👍 I need this!

fredguinog · 2016-01-24T14:42:04Z

An interesting use case for this feature would be reading chunks from de CSV file and pass it to a worker to process such information to create additive and/or semi-additive metrics. I have a working example using read.table (read.csv2) to deal with CSV file that doesn’t fit the available memory (supposing the need of all fields in the process). However, it is still not possible to use all workers in the best way given the slow nature of read.table. I have great expectation for fread with file connection as input.

library(doSNOW)
library(data.table)
library(iterators)
library(parallel)

#(SOCK - Windows - mem copy-on-call - slower)
#(FORK - Linux - mem copy-on-write - faster)
cl <- makeCluster(detectCores(logical=FALSE), type="SOCK")
registerDoSNOW(cl)

chunkSize = 250000
conn = file("FILE_BIGGER_THAN_AVAILABLE_MEMORY.csv","r")
header = scan(conn, what=character(), sep=';', nlines=1)

it <- iter(function() {
tryCatch({
#EXCELLENT OPPORTUNITY TO TEST FREAD'S FILE CONNECTION FEATURE
chunk = read.csv2(file=conn, header=FALSE, nrows=chunkSize)
colnames(chunk) = header
setDT(chunk)
return(chunk)
}, error=function(e) {
#READ.TABLE THROWS ERROS WHEN A READ IS MADE AFTER EOF
stop("StopIteration", call. = FALSE)
})
})

somefun <- function(dt) {
aggreg = dt[,
list(Obs=.N),
by=list(CATEGORICAL_VARIABLE_A)
]
return(aggreg)
}

allaggreg <- foreach(slice=it, .packages='data.table', .combine='rbind', .inorder=FALSE) %dopar% {
somefun(slice)
}

setkey(allaggreg, CATEGORICAL_VARIABLE_A)
finalaggreg = allaggreg[,
list(Obs=sum(Obs, na.rm=TRUE)),
by=list(CATEGORICAL_VARIABLE_A)
]

close(conn)
stopCluster(cl)

mauriciocramos · 2017-03-23T08:19:34Z

+1. I wish I could fread(bzfile("file.csv")).

clarkfitzg · 2018-02-22T17:25:31Z

+1. I'd like to process stdin using data.table with Apache Hive / Hadoop. Here's what I currently do:

stream_in = file("stdin")
open(stream_in)

queue = read.table(stream_in, nrows = rows_per_chunk, colClasses = input_classes
    , col.names = input_cols, na.strings = "\\N")

# Then incrementally refresh and process queue

malcook · 2018-03-18T08:48:35Z

@mauriciocramos - does this not work for your case:

fread("bunzip2 file.csv")

or

fread(sprintf("bunzip2 %s", "file.csv"))

louvetg · 2018-04-20T08:36:09Z

I would like to !

I have a huge CSV file (about 7Gb) and a small RAM (about 8Gb). I do chunk with for loop and skip and nrows parameters for extract some features. If it's very efficient at the beginning it's very slow at the end. I would like to memories were I was in the file and don't look at each time for the beginning of my chunk from the beginning of the file.

I think that use connection could help.

I hope that I was clear,

In advance thank you.

clarkfitzg · 2018-04-20T16:09:14Z

@st-pasha made several relevant points to this issue in #1721, for example:

Generally, fread algorithm likes to see the whole file in order to properly detect types, number of columns, etc. Also, sometimes it needs several passes to get the result correctly.

Based on those points I actually don't think data.table needs to support general file connections or chunking. Sure, it would be convenient, but probably not worth the future trouble. There's already several existing solutions:

Dump to a temporary file
Basicread.table is actually reasonably efficient if we can specify column types and number of rows ahead of time.
iotools package already offers fast stream processing reads.

MichaelChirico · 2020-07-16T14:35:18Z

a note to explore after implementing: use textConnection to handle input like

fread('a,b,c
1,2,3
4,5,6')

instead of outputting it to disk as is done now if I'm not mistaken.

MichaelChirico · 2020-10-27T17:27:29Z

Curious here if fread simply handling the logic of spilling the connection to disk & then reading would be enough for this FR?

At a glance I think this implementation doesn't satisfy the "chunked read" use case, am I missing anything else?

rfaelens · 2020-11-12T16:31:39Z

@MichaelChirico This would not satisfy my usecase. The idea of using large bzip'd files is precisely to avoid spilling to (slow) disk.

If fread indeed needs multiple passes and seeks, then either it should use seek() ea or this FR should be closed as WONTFIX.

ieiwk · 2021-03-05T01:56:40Z

I hope the feature would enable reading a few lines of file at a time and do something to these lines in a much more fast way than using readLines.

MichaelChirico · 2021-03-05T07:20:18Z

reading a few lines of file at a time

If it's just a few lines, readLines should be fine (especially using n= argument and/or passing a connection rather than a file name to get incremental reads)... could you elaborate your use case / why readLines won't suffice?

ieiwk · 2021-03-05T07:39:21Z

reading a few lines of file at a time

If it's just a few lines, readLines should be fine (especially using n= argument and/or passing a connection rather than a file name to get incremental reads)... could you elaborate your use case / why readLines won't suffice?

For example, I want to go through a gz file, which is from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz and is more than 2gb. If I use readLines to read it from a pipe('gzip -cd gene2accession.gz', 'r'), the bottleneck is R and the gzip process uses only less than 10% CPU of a thread. For comparison, for the same task, Perl uses ~ 80% CPU while the pigz uses about ~170% CPU.

ieiwk · 2021-03-05T11:46:43Z

Or, can fread be made to be able to process a raw vector vec1 containing file content read by readBin( con1, 'raw', 1e6)? readBin is super fast, and one can quickly find out line separaters by vec1 == charToRaw( '\n'). The last line is probably incomplete, and this incomplete line can be cut and pasted to the head of the next readBin output.

greg-minshall · 2021-09-16T07:12:09Z

+1. for fwrite also, would be great. cheers.

renqian mentioned this issue Jul 4, 2014

Support .gz file format for fread #717

Closed

This was referenced Nov 19, 2014

[R-Forge #5258] allow fread to handle a connection #543

Closed

Read irregular format text file using data.table #964

Closed

brendanhoganvt mentioned this issue Aug 17, 2015

[REQUEST] fread() from connection #1041

Closed

arunsrinivasan added the fread label Sep 4, 2015

arunsrinivasan added the High label Sep 17, 2015

arunsrinivasan added this to the v1.9.8 milestone Sep 17, 2015

clarkdk mentioned this issue Nov 5, 2015

[Request] add text= argument to fread, to read from multi-line character vectors #1423

Closed

arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 8, 2016

st-pasha mentioned this issue Jul 7, 2017

Master task for fread bugs / proposals #2247

Closed

MichaelChirico mentioned this issue Mar 3, 2018

Chunked fread #1721

Open

mattdowle removed this from the Candidate milestone May 10, 2018

mattdowle added this to the 1.11.6 milestone Aug 23, 2018

mattdowle modified the milestones: 1.11.6, 1.12.0 Sep 20, 2018

MichaelChirico mentioned this issue Dec 6, 2018

Master list of most-requested issues #3189

Open

76 tasks

mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019

mattdowle removed this from the 1.12.2 milestone Jan 14, 2019

MichaelChirico mentioned this issue Sep 5, 2019

working with out-of-memory data #3821

Closed

This was referenced May 26, 2020

Slow line skipping when using fread with big file and with many lines to skip #4496

Open

fread from "clipboard" or "clipboard-128" (on Windows) fails #1292

Open

jangorecki added top request One of our most-requested issues and removed High labels Jun 8, 2020

ben-schwen mentioned this issue Mar 4, 2021

can enable reading from a connection? #4921

Closed

ben-schwen mentioned this issue May 25, 2024

add support for the Chinese encoding to read or write in fread or fwrite #6148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #4931] Support file connections for fread #561

[R-Forge #4931] Support file connections for fread #561

arunsrinivasan commented Jun 8, 2014

xiaodaigh commented Nov 20, 2014

susannasupalla commented Apr 8, 2015

zachmayer commented Aug 20, 2015

fredguinog commented Jan 24, 2016

mauriciocramos commented Mar 23, 2017

clarkfitzg commented Feb 22, 2018

malcook commented Mar 18, 2018

louvetg commented Apr 20, 2018

clarkfitzg commented Apr 20, 2018

MichaelChirico commented Jul 16, 2020

MichaelChirico commented Oct 27, 2020

rfaelens commented Nov 12, 2020

ieiwk commented Mar 5, 2021

MichaelChirico commented Mar 5, 2021

ieiwk commented Mar 5, 2021 •

edited

Loading

ieiwk commented Mar 5, 2021 •

edited

Loading

greg-minshall commented Sep 16, 2021

[R-Forge #4931] Support file connections for fread #561

[R-Forge #4931] Support file connections for fread #561

Comments

arunsrinivasan commented Jun 8, 2014

xiaodaigh commented Nov 20, 2014

susannasupalla commented Apr 8, 2015

zachmayer commented Aug 20, 2015

fredguinog commented Jan 24, 2016

mauriciocramos commented Mar 23, 2017

clarkfitzg commented Feb 22, 2018

malcook commented Mar 18, 2018

louvetg commented Apr 20, 2018

clarkfitzg commented Apr 20, 2018

MichaelChirico commented Jul 16, 2020

MichaelChirico commented Oct 27, 2020

rfaelens commented Nov 12, 2020

ieiwk commented Mar 5, 2021

MichaelChirico commented Mar 5, 2021

ieiwk commented Mar 5, 2021 • edited Loading

ieiwk commented Mar 5, 2021 • edited Loading

greg-minshall commented Sep 16, 2021

ieiwk commented Mar 5, 2021 •

edited

Loading

ieiwk commented Mar 5, 2021 •

edited

Loading