Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

names<- retain key and option to turn new warning on #5133

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

mattdowle
Copy link
Member

@mattdowle mattdowle commented Sep 3, 2021

Follow up to #5084
Related to #5085
Closes #5125
Closes #5126
Closes #5127
Closes #5128

The existing names<- test didn't check retaining key, fixed. This PR will fail until WIP removed.

@mattdowle mattdowle added the WIP label Sep 3, 2021
@mattdowle mattdowle added this to the 1.14.1 milestone Sep 3, 2021
@ben-schwen

This comment was marked as resolved.

@mattdowle

This comment was marked as resolved.

@ben-schwen
Copy link
Member

ben-schwen commented Sep 3, 2021

@mattdowle retaining key is possible with defining

`colnames<-.data.table` = `names<-.data.table` = function(x, value) {
  if (!is.null(value) && length(names(x))!=0) setnames(x, value)
  else { setattr(x, "names", value); invisible(x) }
}
colnames.data.table = names.data.table = function(x) copy(attr(x, "names"))

and registering them as S3method.

I can push to this branch with a build which passes all tests but makes the following changes to tests.Rraw

#test 1967.72 changed to output = "i.a has same type (integer) as x.a. No coercion needed.")

I just didn't want to hijack your PR without asking first ;)

@mattdowle
Copy link
Member Author

mattdowle commented Sep 3, 2021

@ben-schwen Thanks for looking and asking. We already have names<- and S3method for it. colnames<- reaches names<- too. Did you see names<- in data.table.R: it's very similar to yours. The problem is the shallow() in names<- which now (in dev) doesn't retain the key because R's copy via its *tmp* (R's asterix not mine) is a copy that selfref detects. We could remove the shallow(), but it's the shallow() that is reinstating over-allocation after R's copy via *tmp* removed the over-allocation. If we did remove shallow() (or alloc.col which is really what that's doing) then if the caller, who is data.table-ware, subsequently does a := adding a new column by reference, then that subsequent := is going to the issue the over-allocation-full warning :

Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.

So I'm thinking we have to continue to re-over-allocate inside names<- as it has been doing, for backwards compatibility.
Using cedta() in names<- was the way I had been thinking, together with adding an option to turn on a new warning so that folk (at least data.table-aware-folk) can find and change their names<- to setnames. I can turn it on locally, rerun all revdeps, and notify them all with plenty of notice, so that it can be turned on by default in a year or two.

There is no known way to avoid R's *tmp* copy and that was one reason the set* family of functions was created. That copy happens in R internals before our method is entered.

@ben-schwen
Copy link
Member

Thanks for the explanation. Removing shallow will also lead to segfaulting subsequent set() calls so removing it from names<- does not seem an option. Maybe we should even add the following testcases to counter the segfault?

test(492.2, DT[, C := 1], data.table(A=1:3, B=4:6, C=1, key="A"))
test(492.3, set(DT, j="D", value = 2), data.table(A=1:3, B=4:6, C=1, D=2, key="A"))

@mattdowle
Copy link
Member Author

mattdowle commented Sep 4, 2021

Yes, good point. Adding those tests good idea. Thanks.

It seems that R's *tmp* isn't a copy actually when all the names are changed. It's only a copy when a subset of names are changed. So the tests need to cover those two cases. IIUC.

You can see it happening using debug("names<-.data.table") and looking at address(x) upon entry. Or adding a first line to it to print(address(x)).

DT = data.table(a=1:3, b=4:6)
names(DT)[1] = "A"    # R's *tmp* is a copy;  address(x) != address(DT)
names(DT) = c("P", "Q")  # R's *tmp* isn't a copy; address(x) == address(DT)

DT = data.table(a=1:3)
names(DT)[1] = "A"    # R's *tmp* isn't a copy (all names changed as it's just one column)

If I got that right, it feels to me as though R's internals go through an extra step or function call when a subset of names is changed, to construct the new names vector to pass to the method, and that extra step is where the copy comes from. Strange. Just a guess. Would be interesting to look at R's internals on this to see if that's the case. Maybe doing so would reveal a way to avoid the copy, or something else unexpected.

int protecti=0;
const int n = (isNull(cols) ? length(dt) : length(cols)) + (nSpare<0 ? allocColOpt() : nSpare);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could move const int l = isNull(cols) ? LENGTH(dt) : length(cols); up from line from line 180 and already use it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ben. I have to admit I struggled in this PR so it's great to have more eyes on it helping out. Will come back to this line.

const bool verbose = LOGICAL(verboseArg)[0];
if (length(nSpareArg)!=1 || (!isInteger(nSpareArg) && !isReal(nSpareArg)))
error(_("%s should be a single number but it is length %d and type '%s'"), "getOption('datatable.alloccol') (or an argument with this default)", length(nSpareArg), type2char(TYPEOF(nSpareArg)));
int nSpare = isInteger(nSpareArg) ? INTEGER(nSpareArg)[0] : (int)REAL(nSpareArg)[0];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't work for bit64 integer such as in alloc.col(DT, as.integer64(100)) (quite edgy though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. We probably have this problem throughout many arguments in many functions then? On the one hand I agree it's edgy but now you've raised it, ugh, yep. Maybe we need a new helper to deal with this and sweep the whole codebase in a separate PR for int/real arguments that could be int64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coerceAs helper was added to handle this kind of coercion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see coerceAs defined in wrappers.R but it doesn't seem in use at R level yet. Yes it would be good to start using it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR I added R wrapper for easier tests coverage.

@MichaelChirico
Copy link
Member

Hey Matt, given the subtlety of the issues here, it would help to add a verbal summary of the PR to the main commit. Right now it's a bit scattered among your responses, the code comments, and the referenced issues -- it would be easier to follow the PR in this case if you could provide such an overview of the reasoning.

@mattdowle
Copy link
Member Author

Yes I see. But if I reach that stage to make it easy for you, then it will be done, and I won't need any help. Kind of the very point is that it is scattered and hard to follow, and that's what I need help with. Perhaps you could ask on specific lines for example.

@MichaelChirico
Copy link
Member

if I reach that stage to make it easy for you, then it will be done

That makes sense, but even an encapsulation of that in the main post would be helpful, along the lines of

Here's a draft version, my thinking so far is [...], but I don't have much confidence because of [...]

Something to communicate the current status of the PR in words & not just code... it will be an added friction for you as author but will substantially help the reviewer.

@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Sep 9, 2023

For test.data.table(), I see 28 errors while running on either RStudio or RGui for this branch. I can work on resolving the issues but I'm not sure if there are problems outside of the test suite. Likewise, the first failing test I see does not have the expected value that I would think and ideally it wouldn't have a key anymore (I've included it below). I also see why this PR needs to add an option for the warning because some of the 28 errors are due to warning.

As for better downstream protection, could the indices be columns instead of attributes? That way any reordering or subset would be protected and still be useable by data.table internals and we could still hide them when using print(). That would allow us to revisit the 5084 solution, but

I'm also not sure if this PR is a direct result of #5084 or is independent of it.

Test 104.2 error
# 104.2
library(data.table)
dt <- data.table(A = rep(1:3, each=4), B = rep(11:14, each=3), C = rep(21:22, 6), key = "A,B")
dt1 <- dt2 <- dt
dt1[,"A"] <- 3L;
## test(104.2, dt[, transform(.SD, B=sum(C)), by="A"], data.table(A=rep(1:3,each=4), B=86L, C=21:22, key="A"))
## While test expects A to have 1:3, dt1 was modified which impacts dt. 
dt
##    A  B  C
##  1: 3 11 21
##  2: 3 11 22
##  3: 3 11 21
##  4: 3 12 22
##  5: 3 12 21
##  6: 3 12 22
##  7: 3 13 21
##  8: 3 13 22
##  9: 3 13 21
## 10: 3 14 22
## 11: 3 14 21
## 12: 3 14 22

@jangorecki

This comment was marked as resolved.

Comment on lines +212 to +214
SEXP copy(SEXP x)
{
return INHERITS(x, char_datatable) ? _shallow(x, R_NilValue, -1) : duplicate(x);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is responsible for 10 failed tests. Previously, copy() only returned duplicate(x). An additional flag to guarantee copying or maybe even a different function may be helpful, e.g. SEXP copy(SEXP x, bool doCopy) { return doCopy || !INHERITS(x, char_datatable) ? duplicate(x) : _shallow(x, R_NilValue, -1);}

One way the tests fail is that the shallow copy returned does not always copy when assigned correctly when i is provided. E.g., copy(dt)[, x := new_val] would not affect the original dt whereas copy(dt)[1L, x:= new_val] would affect the original dt and it would be modified. These examples would be helpful to have a flag to guarantee a copy - here are the 8 tests that fail (980.20, 982.20, 1419.03, 1419.05, 1419.07, 1419.10, 2068.20, 2233.36).

A second way the tests failed is that now [<- behaves more similarly to dt[, x:=newval] in that other objects pointing to the same address gets similar change when the replacement vector is a different length (i.e., assigning a scalar to a column that is more than 1 value). There are 2 tests like this (104.2 and 113 )

Just as interesting to note is that when reverting copy back to simply return duplicate(x); there are 3 new errors produced that aren't produced in the original PR (510, 2016.6, 2049.4). The first and last test are due to #5084 detecting invalid self ref. Specifically, making <- assignments and then subsequent assignment calls don't build on one another. e.g., dt = data.table(x = 1L, y = 1L); key(dt) <- "x"; dt[, y:=2L]; results in data.table(x = 1L, y = 1L). In that example, the first test would typically be correct but then the next text would expect the data.table to be modified but it is still the original.

2016.6, is also related to the invalid self ref, but the cause is because the data.table is loaded instead of modifying data.tables in a non-canonical way. I'll look at this more next weekend.

## Tests 980.20,  982.20, 1419.03, 1419.05, 1419.07, 1419.10, 2068.20, 2233.36
dt = data.table(x = 1L)
copy(dt)[, x:= x+1L]
dt
## x
## <int>
##   1:     1

copy(dt)[1L, x:=x+1L] ## affects original dt. Any subsequent calls that assume unmodified dt cause errors
dt
## x
## <int>
##   1:     2
## Tests 104.20  113.00
dt = data.table(a = 1:2)
dt1 <- dt
dt1[, "A"] <- 3L ## if this were the same length as initial vector, a copy would be made and match original tests
dt
##        a
##    <int>
## 1:     3
## 2:     3
## Reverting copy behavior to always duplicate **change to PR code on local**
## Tests 510, 2049.4
dt = data.table(x = 1L, y = 1L)
key(dt) <- "x"
dt[, y:= 2L] ## Warning about invalid ref
dt           ## No change
## Key: <x>
##        x     y
##    <int> <int>
## 1:     1     1

## Similar issue when loading data from storage
## This still modified to return duplicate copy
## 2016.6
DT = data.table(a=1:3)
save(list="DT", file=tt<-tempfile())
rm(DT)
name = load(tt)
DT[, a:=4L] ## warning due to invalid self ref.
DT
##        a
##    <int>
## 1:     1
## 2:     2
## 3:     3

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this

whereas copy(dt)[1L, x:= new_val] would affect the original dt and it would be modified.

This indeed seems to be unexpected behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests 492.* to ensure about that

@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
@jangorecki
Copy link
Member

This issue is currently main blocker for releasing master branch to CRAN. If anyone would like to take over this PR you are welcome and encouraged to do so.

@jangorecki jangorecki mentioned this pull request Dec 10, 2023
@MichaelChirico
Copy link
Member

I still am a bit lost on this PR, can anyone at least confirm that we have good tests in place such that once tests go green the PR is ready to go?

@jangorecki
Copy link
Member

This issue came up from revdeps testing so even if current tests won't be enough then revdeps will be enough.

@MichaelChirico
Copy link
Member

This issue came up from revdeps testing so even if current tests won't be enough then revdeps will be enough.

I guess in this case yes, where our main concern is un-breaking regressed functionality. but in general it's much preferable to have minimal tests written by authors who understand the issue more deeply.

@jangorecki
Copy link
Member

jangorecki commented Dec 18, 2023

I don't think merging to master is good at that point. IMO better to make PR complete, and then merge to master. Eventually there might have been an interference in recent changes in the master that could impact the code in not very predictable way. Then it is not only finishing PR but also resolving new issue that could have sneaked in during merge.

@ColeMiller1
Copy link
Contributor

#5125 gives the greatest insight of the issue from Matt outside of the notes in this PR:

Emailed chicane's maintainer today, below.
3 other revdeps out of 1,111 break that I haven't looked at yet.
If there is a way to make names()<- continue to retain the key, then that would be better. On first glance I can't think of a way, but I'll have another go. Probably more <- methods are affected too.
In the meantime I figured it was best to get the change request to the revdep since setnames() is better regardless.

Jan - could we roll back #5084 and plan to release 1.15.0? 5084 had breaking changes and I am not sure we can resolve the changes. Alternatively, it may be better to include a way for users to opt in with options while we figure this out.

To me, it seemed like there was interest from Matt to not have DTs be more respectful of modification by reference, mainly based on his interest with #4978 and 5084. But even in this PR, there are non-idiomatic items (e.g., dt$col <- {new_vals} that I don't really know what the expected behavior is.

@jangorecki
Copy link
Member

Rolling back is some idea, but problem that PR fixed was quite severe, silently returning wrong answer. Moreover it may not be limited to dplyr only but to whoever subsets DT without [ dispatch. There is huge amount of work already invested in resolving that, and I am worrying if we roll it back, it will slowly get abandoned. Now feels the best time to work that out. I cannot imagine documentation entry that warns about subset avoiding [ that would not sound scary to 95% of readers.

@ColeMiller1
Copy link
Contributor

That is fair - it might still be good to have a general timeline (e.g., 3 months) where if this still isn't resolved, other options are looked at.

Do you have thoughts on failed test 104.2? The existing test and behavior in 1.14.10 expect this:

library(data.table)
dt = data.table(A = 1:2, key = "A")
dt1 <- dt

dt1[, "A"] <- 3L
dt1
#>    A
#> 1: 3
#> 2: 3

## original DT unaffected
dt
#>    A
#> 1: 1
#> 2: 2

In this PR, the original DT is also updated on the dt1[, "A"] <- 3L assignment which causes a failed test. I have found that since many of the failed tests are related to non-idiomatic data.table usage, that I am not really sure what the expected results should be. If anything, I would have guessed that the new behavior is how it would have behaved all along.

@MichaelChirico
Copy link
Member

In this PR, the original DT is also updated on the dt1[, "A"] <- 3L assignment which causes a failed test. I have found that since many of the failed tests are related to non-idiomatic data.table usage, that I am not really sure what the expected results should be. If anything, I would have guessed that the new behavior is how it would have behaved all along.

I think this is what Matt was getting at with cedta(). If !cedta(), the calling code might be running dt1[, "A"] <- 3L without knowing that the [<-.data.table is being dispatched. E.g., this could be in a package that has only checked is.data.frame() and so runs the same code on data.frame as data.table.

If cedta(), the author should know better, and we should throw a warning that we can later upgrade to an error. Otherwise, we should try to do what [<-.data.frame would do, rather than introduce reference semantics to code that is not aware of it (could have bad implications).

@MichaelChirico
Copy link
Member

Reading a bit more carefully, I see [<-.data.table already has a !cedta() branch:

data.table/R/data.table.R

Lines 2139 to 2146 in 6782251

"[<-.data.table" = function (x, i, j, value) {
# [<- is provided for consistency, but := is preferred as it allows by group and by reference to subsets of columns
# with no copy of the (very large, say 10GB) columns at all. := is like an UPDATE in SQL and we like and want two symbols to change.
if (!cedta()) {
x = if (nargs()<4L) `[<-.data.frame`(x, i, value=value)
else `[<-.data.frame`(x, i, j, value)
return(setalloccol(x)) # over-allocate (again). Avoid all this by using :=.
}

In fact, several methods do:

data.table/R/data.table.R

Lines 2202 to 2206 in 6782251

"$<-.data.table" = function(x, name, value) {
if (!cedta()) {
ans = `$<-.data.frame`(x, name, value)
return(setalloccol(ans)) # over-allocate (again)
}

data.table/R/data.table.R

Lines 2234 to 2239 in 6782251

dimnames.data.table = function(x) {
if (!cedta()) {
if (!inherits(x, "data.frame"))
stopf("data.table inherits from data.frame (from v1.5), but this data.table does not. Has it been created manually (e.g. by using 'structure' rather than 'data.table') or saved to disk using a prior version of data.table?")
return(`dimnames.data.frame`(x))
}

data.table/R/data.table.R

Lines 2243 to 2245 in 6782251

"dimnames<-.data.table" = function (x, value) # so that can do colnames(dt)=<..> as well as names(dt)=<..>
{
if (!cedta()) return(`dimnames<-.data.frame`(x,value)) # nocov ; will drop key but names<-.data.table (below) is more common usage and does retain the key

data.table/R/data.table.R

Lines 2265 to 2269 in 6782251

within.data.table = function (data, expr, ...)
# basically within.list but retains key (if any)
# will be slower than using := or a regular query (see ?within for further info).
{
if (!cedta()) return(NextMethod()) # nocov

data.table/R/data.table.R

Lines 2290 to 2293 in 6782251

transform.data.table = function (`_data`, ...)
# basically transform.data.frame with data.table instead of data.frame, and retains key
{
if (!cedta()) return(NextMethod()) # nocov

names<-.data.table is thus a bit of an outlier for not having such a branch:

data.table/R/data.table.R

Lines 2253 to 2263 in 6782251

"names<-.data.table" = function(x,value)
{
# When non data.table aware packages change names, we'd like to maintain the key.
# If call is names(DT)[2]="newname", R will call this names<-.data.table function (notice no i) with 'value' already prepared to be same length as ncol
x = shallow(x) # `names<-` should not modify by reference. Related to #1015, #476 and #825. Needed for R v3.1.0+. TO DO: revisit
if (is.null(value))
setattr(x,"names",NULL) # e.g. plyr::melt() calls base::unname()
else
setnames(x,value)
x # this returned value is now shallow copied by R 3.1.0 via *tmp*. A very welcome change.
}

MichaelChirico added a commit that referenced this pull request Dec 21, 2023
@jangorecki jangorecki removed this from the 1.15.0 milestone Dec 22, 2023
MichaelChirico added a commit that referenced this pull request Dec 22, 2023
* names<- retains sorting attributes

* Add tests from #5133

* Add colnames<- method, setalloccol() output to enable set()
@MichaelChirico MichaelChirico marked this pull request as draft February 19, 2024 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

revdep maditr affected revdep getDTeval affected revdep CornerstoneR affected revdep chicane affected
5 participants