Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #2605] add filtering option to fread so it can load less than all rows #583

Open
Tracked by #3189
arunsrinivasan opened this issue Jun 8, 2014 · 18 comments
Labels
feature request fread top request One of our most-requested issues

Comments

@arunsrinivasan
Copy link
Member

Submitted by: stat quant; Assigned to: Nobody; R-Forge link

Discussed in data.table list.

fread(input, chunk.nrows=10000, chunk.filter = <anything acceptable to i of DT[i]>), that could be grep() or any expression of column names.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 17, 2015

I'm pretty sure this is the same as what I had in mind recently, but let me elaborate with an example:

read_dt = data.table(
  id = sample(10, 1e7, TRUE),
  var=rnorm(1e7)
)
fwrite(read_dt, file="dt_to_read.csv")
main_dt = data.table(
  id = sample(8, 1e5, TRUE),
  var2 = rnorm(1e5)
)

I'm working with main_dt but want to pull in some matching (based on id) info from read_dt; currently, I need to do something like this:

relevant_read_dt = fread("dt_to_read.csv")[id %in% main_dt[ , unique(id)]]

This is inefficient because I need to read all of read_dt (especially painful as the number of columns of read_dt increases), then immediately chop off ~20% of them.

An approach like this:

relevant_read_dt  = fread("dt_to_read.csv", row.select = id %in% main_dt[ , unique(id)])

Would only require 1) read id from "dt_to_read.csv" 2) run the logical argument id %in% main_dt[ , unique(id)] and return row numbers to read 3) fread only the selected row numbers.

@skanskan
Copy link

skanskan commented Jul 22, 2016

Hello.
Is the filtering already implemented?
For example I want to read a very big csv file with 4 columns: Value, XXX, YYY, ZZZ,
and I want to read only the lines where the Value >= 1.3

I could do it in two steps: first read all the file, second filter, but this is slower and I could have problems if the file doesn't fit on memory.

I don't know if we are speaking about the same thing or if I missunderstood it.
fread("file", Value>=1.3)
Regards.

@jangorecki

This comment has been minimized.

@skanskan

This comment has been minimized.

@zachmayer

This comment has been minimized.

@VinceLYO
Copy link

VinceLYO commented Sep 6, 2017

Any update for those stuck with Windows :D ?

Update : My bad, Cygwin works perfectly on Windows, as said above.
Good installation tutorial here :
Restart R, and you're good to go !

In order to avoid to include the header as a line and get the colnames you can write something like that :

library(data.table)
fichier = "iris.txt"
# keep the colnames
cols <- names(fread(fichier,nrows = 0L,sep = ","))

# load a random sample of the dataframe, excluding the header
df<- fread(paste("tail -n+2",fichier,"| shuf -n 15")
              ,sep = ","
              ,header = FALSE
              ,col.names = cols
              ,colClasses = list(character = which(cols == "class"))) # define the classes of your columns

Thanks to @thoera for the help !

Regards.

@VinceLYO
Copy link

UPDATE 2 :

After some tests, I figured that the solution I proposed wasn't working on R.
Actually, the code line tail -n+2 fichier.txt | shuf -n 15 works in a cmd consol, but not in R with Fread.
It returns the header as a line (randomly, of course).

This issue can be reproduced with the iris dataset and the following code :

setwd("path")
test <- fread("tail -n+2 IRIS.csv | shuf -n 149"
              ,sep = ","
              ,header = FALSE)

You can also try with sed 1d IRIS.csv | shuf -n149 => Same result.

Does fread deal with pipe and command lines more complicated than one instruction ?

Thanks

Vincent.

@skanskan

This comment has been minimized.

@VinceLYO

This comment has been minimized.

@MichaelChirico
Copy link
Member

To be updated

https://stackoverflow.com/questions/47172355

@deepakagrawal

This comment has been minimized.

@Andrei-WongE

This comment has been minimized.

@tgwhite

This comment has been minimized.

@jangorecki
Copy link
Member

jangorecki commented Apr 15, 2020

For those who are bumping this issue, be sure to upvote first post here as well. AFAIK nobody is currently working on implementing this. If anyone would, we would be happy to assign him/her to this issue.
I will clean up a little bit this thread.


Regarding the FR itself. I don't think it make sense to introduce new mechanism for filtering on a csv files directly. It is basically a lot of effort and maintenance, where now grep works pretty well. What could eventually be a low hanging fruit, is to examine filter expression, guess which columns are required to filter. Then read fully those columns only, perform filter using currently implemented algorithms which=TRUE, and then re-read csv applying filter on lines based on which results. That would be fully implemented in R (not sure about skipping lines), might not be so efficient, but should reduce peak memory required.

@MichaelChirico
Copy link
Member

See here: https://stackoverflow.com/a/62240442/3576984

grep / awk don't have the benefit of autoparallelism so can be quite slow vs fread

@MichaelChirico MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020
@MichaelChirico
Copy link
Member

grep works pretty well

One issue I don't see raised yet that's a shortcoming of many sys + fread approaches is that the first row may be lost, so we won't get nice column names unless we're extra careful, e.g.

fwrite(as.data.table(mtcars, keep.rownames="name"), tmp <- tempfile())
fread(paste("grep -F 'Merc'", tmp))
#             V1   V2 V3    V4  V5   V6   V7   V8 V9 V10 V11 V12
# 1:   Merc 240D 24.4  4 146.7  62 3.69 3.19 20.0  1   0   4   2
# 2:    Merc 230 22.8  4 140.8  95 3.92 3.15 22.9  1   0   4   2
# 3:    Merc 280 19.2  6 167.6 123 3.92 3.44 18.3  1   0   4   4
# 4:   Merc 280C 17.8  6 167.6 123 3.92 3.44 18.9  1   0   4   4
# 5:  Merc 450SE 16.4  8 275.8 180 3.07 4.07 17.4  0   0   3   3
# 6:  Merc 450SL 17.3  8 275.8 180 3.07 3.73 17.6  0   0   3   3
# 7: Merc 450SLC 15.2  8 275.8 180 3.07 3.78 18.0  0   0   3   3

Would it be worth adding an argument to fread that would work around this somehow? That would surely require a lot less development work than filtering. Mostly a question of design.

@MichaelChirico
Copy link
Member

MichaelChirico commented Feb 12, 2021

How about using col.names = /path/to/file for this?

The only overlap with current usage is for one-column files; it should be safe to check file.exists(col.names) to distinguish the two cases.

Related: #4029, #4686, fread with nrow=0 might be a nice way to implement this (otherwise rely on readLines or scan to get the first line...)

@luisvalenzuelar
Copy link

luisvalenzuelar commented Feb 23, 2024

It's interesting that neither Python nor Stata nor other R functions like readr's read_csv have managed to include this option. In any case, benchmarking suggests that loading all data before subsetting using commands like read_csv_chunked or read.csv.sql do better than system-based approaches (grwp/awk/etc), approaches which are in any case far from intuitive for most users. Maybe fread can allow for the pre-loading options, which might still be faster than subsetting ex-post, i.e. A[B].

@tdhock tdhock changed the title [R-Forge #2605] add filtering option to fread so it can load part of a file [R-Forge #2605] add filtering option to fread so it can load less than all rows Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request fread top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

10 participants