[R-Forge #2605] add filtering option to fread so it can load less than all rows #583

arunsrinivasan · 2014-06-08T13:16:13Z

Submitted by: stat quant; Assigned to: Nobody; R-Forge link

fread(input, chunk.nrows=10000, chunk.filter = <anything acceptable to i of DT[i]>), that could be grep() or any expression of column names.

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2015-06-17T17:37:04Z

I'm pretty sure this is the same as what I had in mind recently, but let me elaborate with an example:

read_dt = data.table(
  id = sample(10, 1e7, TRUE),
  var=rnorm(1e7)
)
fwrite(read_dt, file="dt_to_read.csv")
main_dt = data.table(
  id = sample(8, 1e5, TRUE),
  var2 = rnorm(1e5)
)

I'm working with main_dt but want to pull in some matching (based on id) info from read_dt; currently, I need to do something like this:

relevant_read_dt = fread("dt_to_read.csv")[id %in% main_dt[ , unique(id)]]

This is inefficient because I need to read all of read_dt (especially painful as the number of columns of read_dt increases), then immediately chop off ~20% of them.

An approach like this:

relevant_read_dt  = fread("dt_to_read.csv", row.select = id %in% main_dt[ , unique(id)])

Would only require 1) read id from "dt_to_read.csv" 2) run the logical argument id %in% main_dt[ , unique(id)] and return row numbers to read 3) fread only the selected row numbers.

skanskan · 2016-07-22T10:50:30Z

Hello.
Is the filtering already implemented?
For example I want to read a very big csv file with 4 columns: Value, XXX, YYY, ZZZ,
and I want to read only the lines where the Value >= 1.3

I could do it in two steps: first read all the file, second filter, but this is slower and I could have problems if the file doesn't fit on memory.

I don't know if we are speaking about the same thing or if I missunderstood it.
fread("file", Value>=1.3)
Regards.

VinceLYO · 2017-09-06T12:47:50Z

Any update for those stuck with Windows :D ?

Update : My bad, Cygwin works perfectly on Windows, as said above.
Good installation tutorial here :
Restart R, and you're good to go !

In order to avoid to include the header as a line and get the colnames you can write something like that :

library(data.table)
fichier = "iris.txt"
# keep the colnames
cols <- names(fread(fichier,nrows = 0L,sep = ","))

# load a random sample of the dataframe, excluding the header
df<- fread(paste("tail -n+2",fichier,"| shuf -n 15")
              ,sep = ","
              ,header = FALSE
              ,col.names = cols
              ,colClasses = list(character = which(cols == "class"))) # define the classes of your columns

Thanks to @thoera for the help !

Regards.

VinceLYO · 2017-09-11T08:38:49Z

UPDATE 2 :

After some tests, I figured that the solution I proposed wasn't working on R.
Actually, the code line tail -n+2 fichier.txt | shuf -n 15 works in a cmd consol, but not in R with Fread.
It returns the header as a line (randomly, of course).

This issue can be reproduced with the iris dataset and the following code :

setwd("path")
test <- fread("tail -n+2 IRIS.csv | shuf -n 149"
              ,sep = ","
              ,header = FALSE)

You can also try with sed 1d IRIS.csv | shuf -n149 => Same result.

Does fread deal with pipe and command lines more complicated than one instruction ?

Thanks

Vincent.

MichaelChirico · 2017-11-08T05:51:52Z

To be updated

https://stackoverflow.com/questions/47172355

jangorecki · 2020-04-15T20:51:32Z

For those who are bumping this issue, be sure to upvote first post here as well. AFAIK nobody is currently working on implementing this. If anyone would, we would be happy to assign him/her to this issue.
I will clean up a little bit this thread.

Regarding the FR itself. I don't think it make sense to introduce new mechanism for filtering on a csv files directly. It is basically a lot of effort and maintenance, where now grep works pretty well. What could eventually be a low hanging fruit, is to examine filter expression, guess which columns are required to filter. Then read fully those columns only, perform filter using currently implemented algorithms which=TRUE, and then re-read csv applying filter on lines based on which results. That would be fully implemented in R (not sure about skipping lines), might not be so efficient, but should reduce peak memory required.

MichaelChirico · 2020-06-07T04:40:05Z

See here: https://stackoverflow.com/a/62240442/3576984

grep / awk don't have the benefit of autoparallelism so can be quite slow vs fread

MichaelChirico · 2021-02-12T07:25:40Z

grep works pretty well

One issue I don't see raised yet that's a shortcoming of many sys + fread approaches is that the first row may be lost, so we won't get nice column names unless we're extra careful, e.g.

fwrite(as.data.table(mtcars, keep.rownames="name"), tmp <- tempfile())
fread(paste("grep -F 'Merc'", tmp))
#             V1   V2 V3    V4  V5   V6   V7   V8 V9 V10 V11 V12
# 1:   Merc 240D 24.4  4 146.7  62 3.69 3.19 20.0  1   0   4   2
# 2:    Merc 230 22.8  4 140.8  95 3.92 3.15 22.9  1   0   4   2
# 3:    Merc 280 19.2  6 167.6 123 3.92 3.44 18.3  1   0   4   4
# 4:   Merc 280C 17.8  6 167.6 123 3.92 3.44 18.9  1   0   4   4
# 5:  Merc 450SE 16.4  8 275.8 180 3.07 4.07 17.4  0   0   3   3
# 6:  Merc 450SL 17.3  8 275.8 180 3.07 3.73 17.6  0   0   3   3
# 7: Merc 450SLC 15.2  8 275.8 180 3.07 3.78 18.0  0   0   3   3

Would it be worth adding an argument to fread that would work around this somehow? That would surely require a lot less development work than filtering. Mostly a question of design.

MichaelChirico · 2021-02-12T07:58:22Z

How about using col.names = /path/to/file for this?

The only overlap with current usage is for one-column files; it should be safe to check file.exists(col.names) to distinguish the two cases.

Related: #4029, #4686, fread with nrow=0 might be a nice way to implement this (otherwise rely on readLines or scan to get the first line...)

luisvalenzuelar · 2024-02-23T16:06:23Z

It's interesting that neither Python nor Stata nor other R functions like readr's read_csv have managed to include this option. In any case, benchmarking suggests that loading all data before subsetting using commands like read_csv_chunked or read.csv.sql do better than system-based approaches (grwp/awk/etc), approaches which are in any case far from intuitive for most users. Maybe fread can allow for the pre-loading options, which might still be faster than subsetting ex-post, i.e. A[B].

arunsrinivasan added the fread label Sep 4, 2015

MichaelChirico mentioned this issue Mar 18, 2016

fread input random lines. Sampling #1084

Closed

This comment has been minimized.

Sign in to view

st-pasha mentioned this issue Jul 7, 2017

Master task for fread bugs / proposals #2247

Closed

This comment has been minimized.

Sign in to view

MichaelChirico mentioned this issue Aug 6, 2018

fread's nrow argument could accept -ve values to skip last 'n' rows #1643

Open

MichaelChirico mentioned this issue Oct 26, 2018

cross-platform version of fread("tail -1 file")? #3128

Closed

MichaelChirico mentioned this issue Dec 6, 2018

Master list of most-requested issues #3189

Open

76 tasks

alexg9010 mentioned this issue Apr 5, 2019

Setting context="CpG" in methRead of bismark cytosine report failed to retain only CpG positions al2na/methylKit#136

Closed

This comment has been minimized.

Sign in to view

Rdatatable deleted a comment from Andrei-WongE Jul 23, 2019

jangorecki mentioned this issue Aug 8, 2019

Chunked fread #1721

Open

MichaelChirico mentioned this issue Sep 5, 2019

working with out-of-memory data #3821

Closed

This comment has been minimized.

Sign in to view

MichaelChirico added the High label May 30, 2020

MichaelChirico added top request One of our most-requested issues and removed High labels Jun 7, 2020

tdhock changed the title ~~[R-Forge #2605] add filtering option to fread so it can load part of a file~~ [R-Forge #2605] add filtering option to fread so it can load less than all rows Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #2605] add filtering option to fread so it can load less than all rows #583

[R-Forge #2605] add filtering option to fread so it can load less than all rows #583

arunsrinivasan commented Jun 8, 2014

MichaelChirico commented Jun 17, 2015 •

edited

Loading

skanskan commented Jul 22, 2016 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

VinceLYO commented Sep 6, 2017 •

edited

Loading

VinceLYO commented Sep 11, 2017

This comment has been minimized.

This comment has been minimized.

MichaelChirico commented Nov 8, 2017

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jangorecki commented Apr 15, 2020 •

edited

Loading

MichaelChirico commented Jun 7, 2020

MichaelChirico commented Feb 12, 2021

MichaelChirico commented Feb 12, 2021 •

edited

Loading

luisvalenzuelar commented Feb 23, 2024 •

edited

Loading

[R-Forge #2605] add filtering option to fread so it can load less than all rows #583

[R-Forge #2605] add filtering option to fread so it can load less than all rows #583

Comments

arunsrinivasan commented Jun 8, 2014

MichaelChirico commented Jun 17, 2015 • edited Loading

skanskan commented Jul 22, 2016 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

VinceLYO commented Sep 6, 2017 • edited Loading

VinceLYO commented Sep 11, 2017

This comment has been minimized.

This comment has been minimized.

MichaelChirico commented Nov 8, 2017

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jangorecki commented Apr 15, 2020 • edited Loading

MichaelChirico commented Jun 7, 2020

MichaelChirico commented Feb 12, 2021

MichaelChirico commented Feb 12, 2021 • edited Loading

luisvalenzuelar commented Feb 23, 2024 • edited Loading

MichaelChirico commented Jun 17, 2015 •

edited

Loading

skanskan commented Jul 22, 2016 •

edited

Loading

VinceLYO commented Sep 6, 2017 •

edited

Loading

jangorecki commented Apr 15, 2020 •

edited

Loading

MichaelChirico commented Feb 12, 2021 •

edited

Loading

luisvalenzuelar commented Feb 23, 2024 •

edited

Loading