Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #4931] Support file connections for fread #561

Open
Tracked by #3189
arunsrinivasan opened this issue Jun 8, 2014 · 17 comments
Open
Tracked by #3189

[R-Forge #4931] Support file connections for fread #561

arunsrinivasan opened this issue Jun 8, 2014 · 17 comments
Labels
feature request fread top request One of our most-requested issues

Comments

@arunsrinivasan
Copy link
Member

Submitted by: Chris Neff; Assigned to: Nobody; R-Forge link

I use a corporate internal networked file system for much of my data, and so often times i need to call read.csv with a file connection. fread doesn't support this yet.

Namely I would like the following to work:

f = file("~/path/to/file.csv")

dt = fread(f)

@xiaodaigh
Copy link

This would be a pretty awesome feature that I am after as well!!

@susannasupalla
Copy link

I agree, this would be great.

@zachmayer
Copy link

👍 I need this!

@fredguinog
Copy link

An interesting use case for this feature would be reading chunks from de CSV file and pass it to a worker to process such information to create additive and/or semi-additive metrics. I have a working example using read.table (read.csv2) to deal with CSV file that doesn’t fit the available memory (supposing the need of all fields in the process). However, it is still not possible to use all workers in the best way given the slow nature of read.table. I have great expectation for fread with file connection as input.

library(doSNOW)
library(data.table)
library(iterators)
library(parallel)

#(SOCK - Windows - mem copy-on-call - slower)
#(FORK - Linux - mem copy-on-write - faster)
cl <- makeCluster(detectCores(logical=FALSE), type="SOCK")
registerDoSNOW(cl)

chunkSize = 250000
conn = file("FILE_BIGGER_THAN_AVAILABLE_MEMORY.csv","r")
header = scan(conn, what=character(), sep=';', nlines=1)

it <- iter(function() {
tryCatch({
#EXCELLENT OPPORTUNITY TO TEST FREAD'S FILE CONNECTION FEATURE
chunk = read.csv2(file=conn, header=FALSE, nrows=chunkSize)
colnames(chunk) = header
setDT(chunk)
return(chunk)
}, error=function(e) {
#READ.TABLE THROWS ERROS WHEN A READ IS MADE AFTER EOF
stop("StopIteration", call. = FALSE)
})
})

somefun <- function(dt) {
aggreg = dt[,
list(Obs=.N),
by=list(CATEGORICAL_VARIABLE_A)
]
return(aggreg)
}

allaggreg <- foreach(slice=it, .packages='data.table', .combine='rbind', .inorder=FALSE) %dopar% {
somefun(slice)
}

setkey(allaggreg, CATEGORICAL_VARIABLE_A)
finalaggreg = allaggreg[,
list(Obs=sum(Obs, na.rm=TRUE)),
by=list(CATEGORICAL_VARIABLE_A)
]

close(conn)
stopCluster(cl)

@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 8, 2016
@mauriciocramos
Copy link

+1. I wish I could fread(bzfile("file.csv")).

@clarkfitzg
Copy link

+1. I'd like to process stdin using data.table with Apache Hive / Hadoop. Here's what I currently do:

stream_in = file("stdin")
open(stream_in)

queue = read.table(stream_in, nrows = rows_per_chunk, colClasses = input_classes
    , col.names = input_cols, na.strings = "\\N")

# Then incrementally refresh and process queue

@malcook
Copy link

malcook commented Mar 18, 2018

@mauriciocramos - does this not work for your case:

fread("bunzip2 file.csv")

or

fread(sprintf("bunzip2 %s", "file.csv"))

@louvetg
Copy link

louvetg commented Apr 20, 2018

I would like to !

I have a huge CSV file (about 7Gb) and a small RAM (about 8Gb). I do chunk with for loop and skip and nrows parameters for extract some features. If it's very efficient at the beginning it's very slow at the end. I would like to memories were I was in the file and don't look at each time for the beginning of my chunk from the beginning of the file.

I think that use connection could help.

I hope that I was clear,

In advance thank you.

@clarkfitzg
Copy link

@st-pasha made several relevant points to this issue in #1721, for example:

Generally, fread algorithm likes to see the whole file in order to properly detect types, number of columns, etc. Also, sometimes it needs several passes to get the result correctly.

Based on those points I actually don't think data.table needs to support general file connections or chunking. Sure, it would be convenient, but probably not worth the future trouble. There's already several existing solutions:

  • Dump to a temporary file
  • Basicread.table is actually reasonably efficient if we can specify column types and number of rows ahead of time.
  • iotools package already offers fast stream processing reads.

@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@mattdowle mattdowle added this to the 1.11.6 milestone Aug 23, 2018
@mattdowle mattdowle modified the milestones: 1.11.6, 1.12.0 Sep 20, 2018
@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019
@mattdowle mattdowle removed this from the 1.12.2 milestone Jan 14, 2019
@MichaelChirico
Copy link
Member

a note to explore after implementing: use textConnection to handle input like

fread('a,b,c
1,2,3
4,5,6')

instead of outputting it to disk as is done now if I'm not mistaken.

@MichaelChirico
Copy link
Member

Curious here if fread simply handling the logic of spilling the connection to disk & then reading would be enough for this FR?

At a glance I think this implementation doesn't satisfy the "chunked read" use case, am I missing anything else?

@rfaelens
Copy link

@MichaelChirico This would not satisfy my usecase. The idea of using large bzip'd files is precisely to avoid spilling to (slow) disk.

If fread indeed needs multiple passes and seeks, then either it should use seek() ea or this FR should be closed as WONTFIX.

@ieiwk
Copy link

ieiwk commented Mar 5, 2021

I hope the feature would enable reading a few lines of file at a time and do something to these lines in a much more fast way than using readLines.

@MichaelChirico
Copy link
Member

reading a few lines of file at a time

If it's just a few lines, readLines should be fine (especially using n= argument and/or passing a connection rather than a file name to get incremental reads)... could you elaborate your use case / why readLines won't suffice?

@ieiwk
Copy link

ieiwk commented Mar 5, 2021

reading a few lines of file at a time

If it's just a few lines, readLines should be fine (especially using n= argument and/or passing a connection rather than a file name to get incremental reads)... could you elaborate your use case / why readLines won't suffice?

For example, I want to go through a gz file, which is from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz and is more than 2gb. If I use readLines to read it from a pipe('gzip -cd gene2accession.gz', 'r'), the bottleneck is R and the gzip process uses only less than 10% CPU of a thread. For comparison, for the same task, Perl uses ~ 80% CPU while the pigz uses about ~170% CPU.

@ieiwk
Copy link

ieiwk commented Mar 5, 2021

Or, can fread be made to be able to process a raw vector vec1 containing file content read by readBin( con1, 'raw', 1e6)? readBin is super fast, and one can quickly find out line separaters by vec1 == charToRaw( '\n'). The last line is probably incomplete, and this incomplete line can be cut and pasted to the head of the next readBin output.

@greg-minshall
Copy link

+1. for fwrite also, would be great. cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request fread top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests