Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unclear class for argument and return value of DT #5559

Open
OfekShilon opened this issue Dec 10, 2022 · 5 comments
Open

Unclear class for argument and return value of DT #5559

OfekShilon opened this issue Dec 10, 2022 · 5 comments
Labels

Comments

@OfekShilon
Copy link
Contributor

OfekShilon commented Dec 10, 2022

While experimenting with a potential fix for #5286 this test of DT surfaced:

  D = copy(mtcars)
...
  test(2212.02, EVAL("D |> DT(,.SD)"), mtcars)

This line in [.data.table :

if (!is.data.table(ans)) setattr(ans, "class", c("data.table","data.frame"))  # DF |> DT(,.SD[...]) .SD should be data.table, test 2212.013

Seems to indicate that the desired behavior is to return a data.table. However ans isn't the variable that is returned directly, later on the code restores the class to data.frame before returning:

The returned object has class data.frame but still has data.table attributes (namely .internal.selfref).

Before suggesting fixes I believe some design decisions need to be explicitly discussed:

  1. Currently DT operates freely on non-data-tables. The only discussion I could find of it is this single comment, and I'm not sure this is the best design. [.data.table, as the name implies, is a method of data.table. It might accidently work in some (even most) scenarios on data.frames, but is a hefty piece of code that was designed to use data.table attributes and trying to open it would mean an avalanche of bugs and a world of pain. In general, as a user I don't expect methods of one class to operate on another - on the contrary, I expect the class system to help me by isolating their respective functionalities.
    In particular, rejecting non-data.table's would make shallow adds .internal.selfref to an input data.frame #5286 (and probably others, both current and future) moot.
  2. Inside [.data.table and other package methods I'd suggest using setDF or setDT instead of setting class directly - as setDF cleans up data.table-only attributes, that manual-class-set misses.
@jangorecki jangorecki added this to the 1.14.7 milestone Dec 10, 2022
@cthombor
Copy link

cthombor commented Dec 14, 2022

I hesitate to comment here, because I'm a newbie to data.table, to R's S3 class system, and to RStudio. But... maybe my perspective will be helpful to the core devteam of this package?

I wholeheartedly agree with #OfekShilon: I too "... don't expect methods of one class to operate on another - on the contrary, I expect the class system to help me by isolating their respective functionalities."

I tentatively suggest that the data.table development effort adopt, as a long-term goal, that it become impossible to silently promote a data.frame object to a data.table object -- except by using DT and perhaps fread(). I'm reluctantly including fread() here, because any future CRAN package which defines an fread() method will become hazardous for use in conjunction with data.table. For backwards compatibility, all existing automagic promotions of data.frame objects to data.table objects should (I suppose!) be allowed with a warning.

I'd also suggest the Introduction to data.table vignette be amended, so that it include a warning against manipulating data.table objects in interpretive environments which don't include the data.table library. I'm moderately confident that such careless manipulation was how I had once managed to create a data.table object with a corrupted set of object-level attributes: my best guess is that I had done this by mistakenly invoking an undecorated save() method (via command shell in RStudio) on a data.table object that my codebase had created. Fortunately -- and I'm very grateful for this -- my subsequent load() of this serialised object threw a warning which informed me of a dangling reference in its object-level attribute list.

I hope it is clear that the above are only suggestions from a newbie to R who isn't your usual sort of newbie -- because I have extensive experience with other languages, I prefer to develop in a strongly-typed language, I am happy to "hack" throwaway code in a weakly-typed scripting language, and my only prior experience with statistical programming was way back in the early 1990s -- when I used S and emacs! I'm hugely impressed with the current status of R. After my initial foray into emacs/ess, RStudio was a huge relief... although I'm still struggling with the basics e.g. I can't be bothered to figure out how to ensure that data.table is loaded by default in my project environment. (Surely I shouldn't be hacking .Renviron but instead should be invoking some method in usethis... but what is it called? I'm confident I could figure it out within a half-hour, maybe even faster -- but instead I find myself writing this overlong comment ;-)

Thanks to the lovely documentation by Hadley Wickham, and so many years of well-directed behind-the-scenes work by core R developers over the years, I'm now making great progress on my rev of vote_2.3.2!

@OfekShilon
Copy link
Contributor Author

OfekShilon commented Dec 15, 2022

@cthombor If I understand your intention correctly there is already an attempt to give something like "a warning against manipulating data.table objects in interpretive environments which don't include the data.table library" - check cedta, which is checked in the beginning of most (all?) data.table method calls.

This is a different matter but I do feel there is some commonality - maybe the underlying bias is to try and help everyone with everything they try to do with data.tables. Which is very generous, but often leads to fragile design.

Anyway you'd probably draw better attention for your problem if you open a new issue with concrete examples. Can you give examples of silent promotions of data.frames to data.tables?

@cthombor
Copy link

@OfekShilon thanks for your response. I have no idea whether there are any silent promotions of data.frames to data.tables; but their possible existence is one of the (many!) gaps in my knowledge which makes it difficult for me to discover the root cause(s) of the invalid .internal.selfref warnings I have triggered while developing.

Another gap in my knowledge: does data.table export a method for testing the .internal.selfref sentinel? AFAIK the answer is "no". Section 5.13 of Writing R Extensions tells me that it's not something that can be written in R. I'm asking because it'd not be very onerous for me to find a root cause for a warning -- and to develop an MRE if it turns out to be a novel cause -- by pepper-potting my code with stopifnot(valid.internal.selfref(DT)) calls along the path to the operation on DT which produced the warning message.

And yes I believe we're on the same page with the difficulties (both to users and to its devs!) of supporting use-cases for data.table that are outside what (I now, belatedly) understand to be its central set of use-cases -- the analysis of very large datasets of well-established provenance. I am now almost certain that a data.table is not, and can never be, an appropriate structure for the collection and archival storage of experimental data. The secondary factors of an experimental dataset must be reliably recorded and preserved in archival storage. During analysis -- especially if there's a conveniently-accessible archival store of experimental datasets -- there's no strong data-integrity requirement on the secondary factors which have been copied into the object-level attributes of a data.table which contains a copy of the archival record. That's my current thinking anyway: that the tidyverse might "someday" include a data-collection package that's distinct from data.table. Rather orthogonal really: for analysis of experimental data in the tidyverse you want the experimental units indexing the rows, and the primary factors indexing the columns. Given R's column-major storage: for the collection of experimental data you want the primary factors indexing the rows and the experimental units indexing the columns!

@cthombor
Copy link

cthombor commented Dec 17, 2022

@OfekShilon thanks for the pointer to cedta. However I won't expect every base method which could copy (or corrupt) a data.table to include a cedta() call; so the existence of cedta() doesn't reduce the hazard of my foolishly manipulating a data.table with a base method which copies it and therefore puts the integrity of my DT at risk. But... returning to the issue at hand -- yes I strongly support your idea of removing (what I understand to be) the silent (even if only temporary!) promotion of a data.frame object to a data.table object.

The cheat sheet for DT lists only two ways to create a data.table. Converting from a data.frame is easy-as... so I guess the body of that DT test could "easily" be revised to EVAL("D |> setDT() |> DT(,.SD)"), mtcars) but... I'm back to wondering about the feasibility of a transition in which the silent promotion you had identified in [.data.table, and of any other silent promotions which may exist in the data.table codebase, are deprecated with a warning.

Warnings can be very confusing to a newbie like me -- indeed quite distressing, if it's unclear what one can do to reliably avoid triggering such warnings in the future -- but I doubt any of my stumblings with .SD syntax and semantics would have included an attempt to use a data.frame as a value for .SD, so deprecating this particular promotion with a warning wouldn't have bothered me.

Hmmm... maybe a search for is.data.table in the data.table codebase would put a quick upper-bound on the number of its methods which can -- at least under some conditions -- silently promote an R object to a data.table?

@OfekShilon OfekShilon changed the title Unclear class for return value of DT Unclear class for argument and return value of DT Dec 18, 2022
@OfekShilon
Copy link
Contributor Author

OfekShilon commented Dec 18, 2022

@cthombor

Another gap in my knowledge: does data.table export a method for testing the .internal.selfref sentinel? AFAIK the answer is "no". Section 5.13 of Writing R Extensions tells me that it's not something that can be written in R. I'm asking because it'd not be very onerous for me to find a root cause for a warning -- and to develop an MRE if it turns out to be a novel cause -- by pepper-potting my code with stopifnot(valid.internal.selfref(DT)) calls along the path to the operation on DT which produced the warning message.

You can use the non-exported data.table:::selfrefok.
data.table does many things that can't be done in vanilla R by invoking C code. The key one in this context is address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants