fread feather #2026

mattdowle · 2017-02-15T19:31:28Z

As suggested here to avoid needing to use or wrap with setDT :
https://twitter.com/bennetvoorhees/status/830070242659414016
(I guess that rio returns a data.frame or tibble, so making fread do it is perhaps clearer as people use fread to return data.table.)

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2019-08-28T18:35:32Z

I think a separate package like rio is better suited for this -- minimally, a package of simple wrappers to communicate among data.table and other non-CSV data types (e.g. Parquet #2505). Worth considering to build such a package maybe as an add-on within Rdatatable org but not in core data.table.

Of course feel free to re-open if you think otherwise @mattdowle :)

mattdowle · 2019-10-07T20:04:24Z

Yes I agree in general rio is better place for this. But the wrinkle is that rio returns a data.frame. So it's inconvenience to users who like fread returning a data.table by default. I got the impression that was the gist motivating Bernard's tweet.

DrMaphuse · 2023-01-26T12:51:31Z

I would really love to see more support for this. We see ourselves increasingly moving away from R simply because our preferred dataframe package (data.table) needs to use setDT before working with parquet files.

The way things are now, freading .csv files is faster than reading and converting parquet files, despite the latter being the superior format in every aspect. In the age of delta lakes and similar modern data science infrastructure, this feels somewhat anachronistic and it represents a significant bottleneck when working with large datasets.

grantmcdermott · 2023-01-29T18:46:18Z

I'm in a similar boat to @DrMaphuse. Our entire stack is all parquet+arrow based and data.table sits a little awkwardly alongside this. Given the architectural differences and how the data are represented in memory, my guess is that some conversation penalty (whether setDT or otherwise) is unavoidable.

OTOH it would be great to be able to use data.table's [i, j, by] syntax on arrow tables, i.e. as an alternative to the current dplyr frontend. This would allow you to keep the out of memory features of arrow (efficient subsetting most obviously) and reduce the cognitive overhead from switching syntaxes after you do bring a dataset into memory. It probably requires a separate (arrow.table?) package, though.

shrektan · 2023-01-30T04:49:48Z

I would really love to see more support for this. We see ourselves increasingly moving away from R simply because our preferred dataframe package (data.table) needs to use setDT before working with parquet files.

The way things are now, freading .csv files is faster than reading and converting parquet files, despite the latter being the superior format in every aspect. In the age of delta lakes and similar modern data science infrastructure, this feels somewhat anachronistic and it represents a significant bottleneck when working with large datasets.

I haven't used parquet before. When you mentioned setDT() is slow, do you mean that setDT() has to read the data from the disk to the memory, which is time-consuming? Otherwise, I'm confused that when reading into the memory, R has already converted the data to SEXP, and setDT() should be super fast as it only bookkeeping some meta info there.

DrMaphuse · 2023-01-30T12:43:12Z

When you mentioned setDT() is slow, do you mean that setDT() has to read the data from the disk to the memory, which is time-consuming?

No, the data is already in a data.frame when setDT is applied. And this is indeed slow, at least when compared to fread or just reading a parquet file into a data.frame.

BUT, setDT isn't always necessary, and this is where it gets interesting:

When writing a data.table to disk using write_parquet or write_feather, it can be read back in using read_parquet or read_feather, and it is instantly recognized as a data.table.

> test_dt <- data.table(c(1, 2, 3, 4))
> test_df <- data.frame(c(1, 2, 3, 4))
> is.data.table(test_dt)
[1] TRUE
> is.data.table(test_df)
[1] FALSE
> test_dt %>% write_parquet(x = ., 'test_dt.parquet')
> test_df %>% write_parquet(x = ., 'test_df.parquet')
> test_dt <- read_parquet('test_dt.parquet')
> test_df <- read_parquet('test_df.parquet')
> is.data.table(test_dt)
[1] TRUE
> is.data.table(test_df)
[1] FALSE

However, this doesn't work when reading files that were created with other tools. I've been trying to figure out how this works, but I couldn't find anything in the files' metadata that would indicate a difference. It would be super cool if we could use this to generate data.table-parquet files with other tools.

ben-schwen · 2023-01-30T21:51:48Z

@DrMaphuse could you profile what part of setDT takes time? I do not know the details of read_parquet but it could use lazy indexing similar to vroom.

Not using setDT ends up being pretty much the same as just manually setting the class to data.table. You can see this by calling data.table:::truelength(test_dt) or data.table:::selfrefok(test_dt)

DrMaphuse · 2023-02-01T08:32:26Z

I tried profiling but I'm not too familiar with the profiler in RStudio. Does this answer your question?:

eitsupi · 2023-02-02T12:32:49Z

However, this doesn't work when reading files that were created with other tools. I've been trying to figure out how this works, but I couldn't find anything in the files' metadata that would indicate a difference.

This is simply writing back the R attributes when converting from arrow::Table to data.frame.
The attributes are stored as metadata in the Parquet or Arrow file.

The same results can be reached by manually setting up the following procedure.

> tbl <- data.frame(x = c(1, 2)) |> arrow::arrow_table()

> tbl$metadata$r$attributes$class <- c("data.table", "data.frame")

> arrow::write_parquet(tbl, "test.parquet")

> library(data.table)
data.table 1.14.6 using 8 threads (see ?getDTthreads).  Latest news: r-datatable.com

> arrow::read_parquet("test.parquet")
   x
1: 1
2: 2

You can check the metadata of this file with pyarrow, for example.

>>> import pyarrow.parquet
>>> md = pyarrow.parquet.read_metadata("test.parquet")
>>> md.metadata
{b'ARROW:schema': b'/////4gBAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAABQBAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAADAAAAAEAAAByAAAA4gAAAEEKMwoyNjI2NTgKMTk3ODg4CjUKVVRGLTgKNTMxCjIKNTMxCjEKMTYKMgoyNjIxNTMKMTAKZGF0YS50YWJsZQoyNjIxNTMKMTAKZGF0YS5mcmFtZQoxMDI2CjEKMjYyMTUzCjUKbmFtZXMKMTYKMQoyNjIxNTMKNQpjbGFzcwoyNTQKNTMxCjEKMjU0CjEwMjYKNTExCjE2CjEKMjYyMTUzCjEKeAoyNTQKMTAyNgo1MTEKMTYKMgoyNjIxNTMKMTAKYXR0cmlidXRlcwoyNjIxNTMKNwpjb2x1bW5zCjI1NAoAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEDEAAAABgAAAAEAAAAAAAAAAEAAAB4AAYACAAGAAYAAAAAAAIAAAAAAA==', b'r': b'A\n3\n262658\n197888\n5\nUTF-8\n531\n2\n531\n1\n16\n2\n262153\n10\ndata.table\n262153\n10\ndata.frame\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n5\nclass\n254\n531\n1\n254\n1026\n511\n16\n1\n262153\n1\nx\n254\n1026\n511\n16\n2\n262153\n10\nattributes\n262153\n7\ncolumns\n254\n'}

mattdowle added the feature request label Feb 15, 2017

mattdowle added this to the v1.10.6 milestone Feb 15, 2017

mattdowle changed the title ~~fread read feather directly~~ fread feather Feb 15, 2017

jangorecki added the fread label Mar 5, 2017

st-pasha mentioned this issue Jul 6, 2017

Master task for fread bugs / proposals #2247

Closed

mattdowle modified the milestones: v1.10.6, Candidate Mar 30, 2018

mattdowle removed this from the Candidate milestone May 10, 2018

MichaelChirico closed this as completed Aug 28, 2019

mattdowle reopened this Oct 7, 2019

This was referenced Jan 26, 2023

fread support for parquet #2505

Closed

setDT() for arrow tables #5584

Closed

eitsupi mentioned this issue Apr 4, 2023

GH-34775: [R] arrow_table: as.data.frame() sometimes returns a tbl and sometimes a data.frame apache/arrow#34825

Closed

tdhock mentioned this issue Nov 28, 2023

add GOVERNANCE.md document #5772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread feather #2026

fread feather #2026

mattdowle commented Feb 15, 2017

MichaelChirico commented Aug 28, 2019

mattdowle commented Oct 7, 2019

DrMaphuse commented Jan 26, 2023

grantmcdermott commented Jan 29, 2023

shrektan commented Jan 30, 2023 •

edited

Loading

DrMaphuse commented Jan 30, 2023 •

edited

Loading

ben-schwen commented Jan 30, 2023

DrMaphuse commented Feb 1, 2023

eitsupi commented Feb 2, 2023

fread feather #2026

fread feather #2026

Comments

mattdowle commented Feb 15, 2017

MichaelChirico commented Aug 28, 2019

mattdowle commented Oct 7, 2019

DrMaphuse commented Jan 26, 2023

grantmcdermott commented Jan 29, 2023

shrektan commented Jan 30, 2023 • edited Loading

DrMaphuse commented Jan 30, 2023 • edited Loading

ben-schwen commented Jan 30, 2023

DrMaphuse commented Feb 1, 2023

eitsupi commented Feb 2, 2023

shrektan commented Jan 30, 2023 •

edited

Loading

DrMaphuse commented Jan 30, 2023 •

edited

Loading