Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread feather #2026

Open
mattdowle opened this issue Feb 15, 2017 · 9 comments
Open

fread feather #2026

mattdowle opened this issue Feb 15, 2017 · 9 comments

Comments

@mattdowle
Copy link
Member

As suggested here to avoid needing to use or wrap with setDT :
https://twitter.com/bennetvoorhees/status/830070242659414016
(I guess that rio returns a data.frame or tibble, so making fread do it is perhaps clearer as people use fread to return data.table.)

@mattdowle mattdowle added this to the v1.10.6 milestone Feb 15, 2017
@mattdowle mattdowle changed the title fread read feather directly fread feather Feb 15, 2017
@mattdowle mattdowle modified the milestones: v1.10.6, Candidate Mar 30, 2018
@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@MichaelChirico
Copy link
Member

I think a separate package like rio is better suited for this -- minimally, a package of simple wrappers to communicate among data.table and other non-CSV data types (e.g. Parquet #2505). Worth considering to build such a package maybe as an add-on within Rdatatable org but not in core data.table.

Of course feel free to re-open if you think otherwise @mattdowle :)

@mattdowle
Copy link
Member Author

Yes I agree in general rio is better place for this. But the wrinkle is that rio returns a data.frame. So it's inconvenience to users who like fread returning a data.table by default. I got the impression that was the gist motivating Bernard's tweet.

@mattdowle mattdowle reopened this Oct 7, 2019
@DrMaphuse
Copy link

I would really love to see more support for this. We see ourselves increasingly moving away from R simply because our preferred dataframe package (data.table) needs to use setDT before working with parquet files.

The way things are now, freading .csv files is faster than reading and converting parquet files, despite the latter being the superior format in every aspect. In the age of delta lakes and similar modern data science infrastructure, this feels somewhat anachronistic and it represents a significant bottleneck when working with large datasets.

This was referenced Jan 26, 2023
@grantmcdermott
Copy link
Contributor

I'm in a similar boat to @DrMaphuse. Our entire stack is all parquet+arrow based and data.table sits a little awkwardly alongside this. Given the architectural differences and how the data are represented in memory, my guess is that some conversation penalty (whether setDT or otherwise) is unavoidable.

OTOH it would be great to be able to use data.table's [i, j, by] syntax on arrow tables, i.e. as an alternative to the current dplyr frontend. This would allow you to keep the out of memory features of arrow (efficient subsetting most obviously) and reduce the cognitive overhead from switching syntaxes after you do bring a dataset into memory. It probably requires a separate (arrow.table?) package, though.

@shrektan
Copy link
Member

shrektan commented Jan 30, 2023

I would really love to see more support for this. We see ourselves increasingly moving away from R simply because our preferred dataframe package (data.table) needs to use setDT before working with parquet files.

The way things are now, freading .csv files is faster than reading and converting parquet files, despite the latter being the superior format in every aspect. In the age of delta lakes and similar modern data science infrastructure, this feels somewhat anachronistic and it represents a significant bottleneck when working with large datasets.

I haven't used parquet before. When you mentioned setDT() is slow, do you mean that setDT() has to read the data from the disk to the memory, which is time-consuming? Otherwise, I'm confused that when reading into the memory, R has already converted the data to SEXP, and setDT() should be super fast as it only bookkeeping some meta info there.

@DrMaphuse
Copy link

DrMaphuse commented Jan 30, 2023

When you mentioned setDT() is slow, do you mean that setDT() has to read the data from the disk to the memory, which is time-consuming?

No, the data is already in a data.frame when setDT is applied. And this is indeed slow, at least when compared to fread or just reading a parquet file into a data.frame.

BUT, setDT isn't always necessary, and this is where it gets interesting:

When writing a data.table to disk using write_parquet or write_feather, it can be read back in using read_parquet or read_feather, and it is instantly recognized as a data.table.

> test_dt <- data.table(c(1, 2, 3, 4))
> test_df <- data.frame(c(1, 2, 3, 4))
> is.data.table(test_dt)
[1] TRUE
> is.data.table(test_df)
[1] FALSE
> test_dt %>% write_parquet(x = ., 'test_dt.parquet')
> test_df %>% write_parquet(x = ., 'test_df.parquet')
> test_dt <- read_parquet('test_dt.parquet')
> test_df <- read_parquet('test_df.parquet')
> is.data.table(test_dt)
[1] TRUE
> is.data.table(test_df)
[1] FALSE

However, this doesn't work when reading files that were created with other tools. I've been trying to figure out how this works, but I couldn't find anything in the files' metadata that would indicate a difference. It would be super cool if we could use this to generate data.table-parquet files with other tools.

@ben-schwen
Copy link
Member

@DrMaphuse could you profile what part of setDT takes time? I do not know the details of read_parquet but it could use lazy indexing similar to vroom.

Not using setDT ends up being pretty much the same as just manually setting the class to data.table. You can see this by calling data.table:::truelength(test_dt) or data.table:::selfrefok(test_dt)

@DrMaphuse
Copy link

I tried profiling but I'm not too familiar with the profiler in RStudio. Does this answer your question?:
grafik

@eitsupi
Copy link
Contributor

eitsupi commented Feb 2, 2023

However, this doesn't work when reading files that were created with other tools. I've been trying to figure out how this works, but I couldn't find anything in the files' metadata that would indicate a difference.

This is simply writing back the R attributes when converting from arrow::Table to data.frame.
The attributes are stored as metadata in the Parquet or Arrow file.

The same results can be reached by manually setting up the following procedure.

> tbl <- data.frame(x = c(1, 2)) |> arrow::arrow_table()

> tbl$metadata$r$attributes$class <- c("data.table", "data.frame")

> arrow::write_parquet(tbl, "test.parquet")

> library(data.table)
data.table 1.14.6 using 8 threads (see ?getDTthreads).  Latest news: r-datatable.com

> arrow::read_parquet("test.parquet")
   x
1: 1
2: 2

You can check the metadata of this file with pyarrow, for example.

>>> import pyarrow.parquet
>>> md = pyarrow.parquet.read_metadata("test.parquet")
>>> md.metadata
{b'ARROW:schema': b'/////4gBAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAABQBAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAADAAAAAEAAAByAAAA4gAAAEEKMwoyNjI2NTgKMTk3ODg4CjUKVVRGLTgKNTMxCjIKNTMxCjEKMTYKMgoyNjIxNTMKMTAKZGF0YS50YWJsZQoyNjIxNTMKMTAKZGF0YS5mcmFtZQoxMDI2CjEKMjYyMTUzCjUKbmFtZXMKMTYKMQoyNjIxNTMKNQpjbGFzcwoyNTQKNTMxCjEKMjU0CjEwMjYKNTExCjE2CjEKMjYyMTUzCjEKeAoyNTQKMTAyNgo1MTEKMTYKMgoyNjIxNTMKMTAKYXR0cmlidXRlcwoyNjIxNTMKNwpjb2x1bW5zCjI1NAoAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEDEAAAABgAAAAEAAAAAAAAAAEAAAB4AAYACAAGAAYAAAAAAAIAAAAAAA==', b'r': b'A\n3\n262658\n197888\n5\nUTF-8\n531\n2\n531\n1\n16\n2\n262153\n10\ndata.table\n262153\n10\ndata.frame\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n5\nclass\n254\n531\n1\n254\n1026\n511\n16\n1\n262153\n1\nx\n254\n1026\n511\n16\n2\n262153\n10\nattributes\n262153\n7\ncolumns\n254\n'}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants