-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand nomatch to accept any value (control fills, rolls, omits during joins) #857
Comments
nomatch
to accept any value - control fills, rolls, omits all in one placenomatch
to accept any value (control fills, rolls, omits during joins)
Merging with #940 as it is the same feature.
We have 2 dts: library(data.table)
dict <- data.table(symbol = c("a","b","c"), ratio = c(5,10,15), key = "symbol")
dt <- data.table(symbol = c("a","b","c","d"), value = rnorm(4)) Which we want join: dict[dt][,list(symbol, new_value = ratio * value)] Consider the case when we expect to have match on all data in the join. new_dt <- dict[dt]
if(any(is.na(new_dt$ratio))) stop("missing data in the reference dictionary") Which makes us unable to make chaining in such case. dict[dt,nomatch=quote(stop("missing data in the reference dictionary"))
][,list(symbol, new_value = ratio * value)] just to stop process and raise the error in case of any nomatch. |
Added to
|
I think that second option is a bit revolutionary and potentially going to "piss" a lot of people who either don't read documentation/warnings or will have to rewrite huge code chunks... |
It's bad enough that this will break a lot of code, but now we'd have to type even more characters for this very common operation? Uhh... I don't like just saying I dislike smth without presenting a viable alternative, but the only thing I can think of is a new argument, which sucks in its own little way :-\ |
On extra characters, I guess you're talking about "roll" arguments? I'm not quite certain what are the extra characters otherwise.. And, I am not sure about the "roll" part, hence the "??". The idea is to first implement # previously
X[Y, roll = Inf, rollends=TRUE]
# after
X[Y, nomatch = LOCF(Inf, TRUE)] # default case can be just nomatch = "locf" Is this what you find hard? For one, I find it straightforward, as "roll" actually is used to "fill in" missing values. Second, the number of arguments that are mutually exclusive in On breaking code.. like I said before, in this version (1.9.6), we'll have a global argument which will be set to current behaviour and will issue a warning that I think this is a hugely welcoming change, similar to |
I'm talking about typing That said I totally understand the need for filling with other numbers/throwing an error, I'm just not particularly happy with this implementation. Actually the root of my annoyance comes from |
Taking into account what Gsee mentioned about having a release as fallback in case changes are too major (and possibly affecting current code too much), I've changed it to 1.9.8. I agree, purely on consistency, that This means, this'll be default in 2.0.0 (depending on what Matt feels as well). |
Ok, if |
As for breaking the code. The sooner we have it as an option to better tested it will be, so if you are not in rush with 1.9.6 (sad) it can be worth to put it there. But for the breaking changes - moving from option to default - the best will be 2.0.0 release. If we consider adding new argument to PS. do you drop the |
If such breaking change is going to happened I think we should put some info "There is a breaking change coming in next stable release to CRAN (2.0.0) related to `nomatch` argument. That affects joins and binary search filtering. To keep the current behavior of `nomatch` in 2.0.0+ recode missing `nomatch` to `nomatch=NA` and `nomatch=0` to `nomatch=NULL` / missing. Alternatively use `options('datatable.oldnomatch'=TRUE)`." Maybe we can also provide an option |
adding that this feature would make a lot of problems simpler and that breaking existing code with |
Meanwhile, there are people writing production code today that need the current functionality of nomatch=0L and every time they write a new line of code that includes it, they're increasing their workload in the future if the behavior is changed. Regardless of whether the behavior of nomatch=0 is ever changed, it would be nice to be able to start using nomatch=NULL, because at least we know that once that is implemented, it's not likely to break in the foreseeable future. |
@gsee that's a good idea, but I do think you're exaggerating claims of an increased workload... |
The point is that I don't like writing a line of code when I know it's going to break in the future. If I had a way to write code such that I wouldn't have to fix it, I'd prefer to do that. |
@gsee totally agree. Many times when I'm using |
Just adding my two-cents to highlight that, as @jangorecki knows, this is an issue that commonly comes up on StackExchange, etc. The |
Could you provide an example on how to use this nomatch roll method, please? |
Yep: The desire is to specify fill values in Rather than changing the meaning of
This would allow the ability of filling with 0 rather than NA, without needing to change the meaning of Btw, the reason for the default being nomatch=NA (outer join) is for statistical safety. I found that in the majority of cases in my day to day work, I expected there to be data available and if there wasn't, then I wanted to measure how much was missing or at least be tripped up with the NAs afterwards. It's just too easy to miss incorrect joins due to missing data when the default is inner join. Inner join to me means ignore missing data silently which doesn't feel robust. If you expect missing data and that's ok, then I like to see nomatch=NULL in the code to remind myself that missing data is being dropped silently. I also like the nomatch="error" option idea too. In other words, I'm not keen on changing the default behaviour away from |
Very good question about filling multiple columns, and brilliant idea to address it retaining To include also |
nomatch
to accept any value (control fills, rolls, omits during joins)
it seems like there are 2 discrete ideas that are being combined in a single parameter:
granted, if you do an inner join, you don't need fill values, but adding new parameters would get you around the transition and backward compatibility quite easily: add the join.type could be implemented now, with warnings and backward compatibility would be a simple thunk to the new parameter. this would also leave open the possibility for supporting |
To summarize current requirements # `nomatch` argument value on the left, translation on the right
NA -> list(a=NA, b=NA)
0L -> NULL
"locf" -> list(a=LOCF(), b=LOCF())
"stop" -> list(a=stop(), b=stop())
NULL -> "[remove non matching rows]"
list(0L) -> list(a=0L, b=0L) # fill with zeros
list(a=NA, b=LOCF()) # fill with a column with NA, and b with LOCF
list(a=NA, b=stop()) # fill a column is irrelevant becase we will raise exception anyway
# this should be supported as well, so `nomatch` can be more easily passed as a parameter
list(a=NA, b=quote(LOCF()))
list(a=NA, b=quote(stop()))
# of we could also provide LOCF() function, that crease a specification object, rather than a symbol that we handle in unevaluated expression.
list(a=NA, b=LOCF()) rolling join options # some extra guessing of user intent
list(0L, "missing") # fill numeric columns with 0, character columns with "missing"
list(NA, price=LOCF()) # fill NA, other than price column which should be rolled Assuming we will implement an argument that controls join type (
note that there is an idea of provided two kinds of |
Please consider the following feature request.
Currently,
nomatch
accepts only NA or 0:By expanding it to accept wider range of values, one can perform for example
na.fill
operation during the join operation, not after, lending to faster code. Related SOProposed behavior:
Alternatively, this could be achieved by introducing new argument
fill
, accepting any value:[.data.table(..., fill=NA, ...)
jangorecki: adding some features to scope
The text was updated successfully, but these errors were encountered: