Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand nomatch to accept any value (control fills, rolls, omits during joins) #857

Open
danielkrizian opened this issue Oct 5, 2014 · 25 comments
Labels
feature request joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins top request One of our most-requested issues

Comments

@danielkrizian
Copy link

danielkrizian commented Oct 5, 2014

Please consider the following feature request.

Currently, nomatch accepts only NA or 0:

DT[i, nomatch = NULL]
# Error in if (!is.na(nomatch) && nomatch != 0L) stop("nomatch must either be NA or 0, or (ideally) NA_integer_ or 0L") : 

By expanding it to accept wider range of values, one can perform for example na.fill operation during the join operation, not after, lending to faster code. Related SO

Proposed behavior:

DT[i, nomatch = any_character_or_real]     # fill with supplied value; could be 0, "foo", even NA ...
DT[i, nomatch = NA]                        # fill with NA. Special case of above. Same behavior as today.
DT[i, nomatch = 0]                         # fill with zeroes. Special case of above. 
DT[i, nomatch = NULL]                      # omit non-matched rows. Equivalent to today's nomatch = 0
DT[i, nomatch = locf]                      # reserved symbol. Equivalent to roll=+Inf
DT[i, nomatch = nocb]                      # reserved symbol. Equivalent to roll=-Inf

Alternatively, this could be achieved by introducing new argument fill, accepting any value:
[.data.table(..., fill=NA, ...)


jangorecki: adding some features to scope

  • rollequal=TRUE|FALSE
nomatch=.(0L)    # fill numeric columns with 0,  character columns with NA
nomatch=.(0L, "missing")  # fill numeric columns with 0, character columns with "missing"
nomatch=.(NA, price=LOCF())  # fill NA, other than price column which should be rolled
@danielkrizian danielkrizian changed the title Expand nomatch to accept any value - control fills, rolls, omits all in one place Expand nomatch to accept any value (control fills, rolls, omits during joins) Oct 5, 2014
@jangorecki
Copy link
Member

Merging with #940 as it is the same feature.
Extending Daniel's examples for an prepared error message in case of nomatch.

data.table join nomatch=quote(stop())

We have 2 dts:

library(data.table)
dict <- data.table(symbol = c("a","b","c"), ratio = c(5,10,15), key = "symbol")
dt <- data.table(symbol = c("a","b","c","d"), value = rnorm(4))

Which we want join:

dict[dt][,list(symbol, new_value = ratio * value)]

Consider the case when we expect to have match on all data in the join.
In case of nomatch we want to raise error.
Currently it is possible using:

new_dt <- dict[dt]
if(any(is.na(new_dt$ratio))) stop("missing data in the reference dictionary")

Which makes us unable to make chaining in such case.
If we could use something like this:

dict[dt,nomatch=quote(stop("missing data in the reference dictionary"))
     ][,list(symbol, new_value = ratio * value)]

just to stop process and raise the error in case of any nomatch.

@arunsrinivasan arunsrinivasan added this to the v1.9.6 milestone Apr 23, 2015
@arunsrinivasan
Copy link
Member

Added to 1.9.6. Will add a global option and issue a warning message so that transition can be made in the next major release. Setting the global option will allow users to use the new implementation already.

nomatch = NA - current behaviour (and default).
nomatch = 0 - fills with 0 instead of removing non-matched rows.
nomatch = NULL - current behaviour of nomatch = 0.
nomatch = LOCF(val, rollends=TRUE/FALSE) and nomatch = NOCB(val, rollends=TRUE/FALSE) - could potentially replace roll and rollends arguments??
nomatch = val - should type convert appropriately and use val for all columns.

@DavidArenburg
Copy link
Member

I think that second option is a bit revolutionary and potentially going to "piss" a lot of people who either don't read documentation/warnings or will have to rewrite huge code chunks...

@eantonya
Copy link
Contributor

It's bad enough that this will break a lot of code, but now we'd have to type even more characters for this very common operation? Uhh...

I don't like just saying I dislike smth without presenting a viable alternative, but the only thing I can think of is a new argument, which sucks in its own little way :-\

@arunsrinivasan
Copy link
Member

On extra characters, I guess you're talking about "roll" arguments? I'm not quite certain what are the extra characters otherwise.. And, I am not sure about the "roll" part, hence the "??". The idea is to first implement nomatch = <any_value> first.

# previously
X[Y, roll = Inf, rollends=TRUE]
# after
X[Y, nomatch = LOCF(Inf, TRUE)] # default case can be just nomatch = "locf"

Is this what you find hard?

For one, I find it straightforward, as "roll" actually is used to "fill in" missing values. Second, the number of arguments that are mutually exclusive in [.data.table are one too many, and this helps address that.. (by removing roll and rollends as separate arguments). Right now, using roll=TRUE and nomatch = 0L returns the rolled result, which I find confusing. Providing nomatch = roll is explicit in that it asks the results to be rolled in the event of a no match.

On breaking code.. like I said before, in this version (1.9.6), we'll have a global argument which will be set to current behaviour and will issue a warning that nomatch = 0L should be replaced to nomatch = NULL and the current behaviour will be deprecated in the next version (which is plenty time), and we'll add a new vignette to explain the benefits. There will also be an option to switch off this warning. (or at least this is what I've in mind right now). @mattdowle thoughts?

I think this is a hugely welcoming change, similar to by = .EACHI (which you advocated btw, @eddi, which also would have resulted in lot of code being broken). The point is to bring in as much consistency as possible, by transitioning smoothly (and by giving ample time for users to shift their code), and this kind of "major" change shouldn't happen a lot (it's a lot of effort for us as well), but it's essential here.

@eantonya
Copy link
Contributor

I'm talking about typing NULL instead of 0, I didn't even think about the roll. For me, nomatch=0 has been a major annoyance from the very beginning, and I'm not too happy about it becoming even more annoying.

That said I totally understand the need for filling with other numbers/throwing an error, I'm just not particularly happy with this implementation.

Actually the root of my annoyance comes from nomatch=NA being the default, instead of nomatch=NULL/0. Since we're talking about a breaking change, can we please discuss making NULL/0 the default? I'll make a short "pro" argument: it's the default for base merge and ime by far the more common type of a join/merge.

@arunsrinivasan arunsrinivasan modified the milestones: v1.9.8, v1.9.6 Apr 23, 2015
@arunsrinivasan
Copy link
Member

Taking into account what Gsee mentioned about having a release as fallback in case changes are too major (and possibly affecting current code too much), I've changed it to 1.9.8. 1.9.6 will keep things as expected for people who don't want their code to be affected (and they need not upgrade if they wish so).

I agree, purely on consistency, that nomatch = NULL should be the default. IMO, we'll have to do that as well. The default should be an inner join.

This means, this'll be default in 2.0.0 (depending on what Matt feels as well).

@eantonya
Copy link
Contributor

Ok, if nomatch=NULL is made the default, then I have no issues, and the change would be great.

@jangorecki
Copy link
Member

As for breaking the code. The sooner we have it as an option to better tested it will be, so if you are not in rush with 1.9.6 (sad) it can be worth to put it there. But for the breaking changes - moving from option to default - the best will be 2.0.0 release.

If we consider adding new argument to [ to it is worth to review other open issues which would like to get own new arguments. Maybe they can be nicely combined. The one very similar would be #614, or maybe also #830 would need one - this can be combined with roll maybe?

PS. do you drop the nomatch=quote(stop("my error")) FR at all?

@jangorecki
Copy link
Member

If such breaking change is going to happened I think we should put some info packageStartupMessage yet before 1.9.8 release - example one below.

"There is a breaking change coming in next stable release to CRAN (2.0.0) related to `nomatch` argument. That affects joins and binary search filtering. To keep the current behavior of `nomatch` in 2.0.0+ recode missing `nomatch` to `nomatch=NA` and `nomatch=0` to `nomatch=NULL` / missing. Alternatively use `options('datatable.oldnomatch'=TRUE)`."

Maybe we can also provide an option datatable.newnomatch which could makes new behavior available in 1.9.8? This can make switch easier.

@MichaelChirico
Copy link
Member

adding that this feature would make a lot of problems simpler and that breaking existing code with nomatch = 0 is worth it in the long run as I've always thought that choice of notation was strange anyway.

@gsee
Copy link

gsee commented Jan 29, 2016

Meanwhile, there are people writing production code today that need the current functionality of nomatch=0L and every time they write a new line of code that includes it, they're increasing their workload in the future if the behavior is changed. Regardless of whether the behavior of nomatch=0 is ever changed, it would be nice to be able to start using nomatch=NULL, because at least we know that once that is implemented, it's not likely to break in the foreseeable future.

@MichaelChirico
Copy link
Member

@gsee that's a good idea, but I do think you're exaggerating claims of an increased workload... ctrl+F will still only need to be pressed once (or grep could be used to do all R scripts at once)

@gsee
Copy link

gsee commented Jan 29, 2016

The point is that I don't like writing a line of code when I know it's going to break in the future. If I had a way to write code such that I wouldn't have to fix it, I'd prefer to do that.

@jangorecki
Copy link
Member

@gsee totally agree. Many times when I'm using nomatch I'm concerned if I can write it in a way it could be supported in future. One option is to use nomatch = if(packageVersion("data.table") >= "2.0.0") NULL else 0 but I'm not sure if it is recommended way.
Another option is to not upgrade your production environment. If everything works as expected then I would say it is a good practice.
Having good test coverage for own package would easily detect most of the issues when migration to new version of a dependency. Giving API for future changes, as the one written above, can help package maintainers. Additionally pointing what test they could include in their test workflow.

@mekkim
Copy link

mekkim commented Feb 3, 2016

Just adding my two-cents to highlight that, as @jangorecki knows, this is an issue that commonly comes up on StackExchange, etc. The nomatch=0 syntax always confused me (nomatch=null always seemed more logical). Here's a current thread with a very basic case. Thanks for your great work and this upcoming feature, however it is implemented!

@mattdowle
Copy link
Member

@skanskan
Copy link

Could you provide an example on how to use this nomatch roll method, please?
For example, how can I do this
zoo::na.locf(c(NA, 1,2,NA, NA, 1, NA))
using just data.table?

@mattdowle
Copy link
Member

mattdowle commented Oct 4, 2018

Yep: nomatch=NULL good and agree with Jan's PR #3095.

The desire is to specify fill values in nomatch=, right? So nomatch=0L will eventually mean use 0L instead of NA to fill the missing rows. But what about differing column types? There might be a mixture of character columns and numeric columns to fill and what should match=0L mean for the character columns?

Rather than changing the meaning of nomatch=0L, how about allowing a new list value.

nomatch=.(0L)    # fill numeric columns with 0,  character columns with NA
nomatch=.(0L, "missing")  # fill numeric columns with 0, character columns with "missing"
nomatch=.(NA, price=LOCF())  # fill NA, other than price column which should be rolled

This would allow the ability of filling with 0 rather than NA, without needing to change the meaning of nomatch=0L and the associated breakage.

Btw, the reason for the default being nomatch=NA (outer join) is for statistical safety. I found that in the majority of cases in my day to day work, I expected there to be data available and if there wasn't, then I wanted to measure how much was missing or at least be tripped up with the NAs afterwards. It's just too easy to miss incorrect joins due to missing data when the default is inner join. Inner join to me means ignore missing data silently which doesn't feel robust. If you expect missing data and that's ok, then I like to see nomatch=NULL in the code to remind myself that missing data is being dropped silently. I also like the nomatch="error" option idea too. In other words, I'm not keen on changing the default behaviour away from nomatch=NA since there's a good reason for it.

@jangorecki
Copy link
Member

jangorecki commented Oct 5, 2018

Very good question about filling multiple columns, and brilliant idea to address it retaining nomatch=0L backward compatible at the same time.

To include also rollequal=TRUE|FALSE #857

@mattdowle mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019
@jangorecki jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 24, 2019
@MichaelChirico MichaelChirico changed the title Expand nomatch to accept any value (control fills, rolls, omits during joins) Expand nomatch to accept any value (control fills, rolls, omits during joins) Feb 19, 2019
@ethanbsmith
Copy link
Contributor

it seems like there are 2 discrete ideas that are being combined in a single parameter:

  1. the nature of the join (inner or right)
  2. the fill values to use for a left join

granted, if you do an inner join, you don't need fill values, but adding new parameters would get you around the transition and backward compatibility quite easily:

add join.type as a new parameter (currently only allowing inner, right)
add join.fill as a new parameter taking a list of name/value pars as described in Matt's Oct4th post.

the join.type could be implemented now, with warnings and backward compatibility would be a simple thunk to the new parameter.

this would also leave open the possibility for supporting full join in the future, or possibly even the sacrilegious left join ;)

@jangorecki jangorecki modified the milestones: 1.12.4, 1.13.0 Jul 25, 2019
@jangorecki jangorecki added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Aug 4, 2019
@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@jangorecki
Copy link
Member

jangorecki commented Apr 10, 2020

To summarize current requirements

# `nomatch` argument value on the left, translation on the right
NA -> list(a=NA, b=NA)
0L -> NULL
"locf" -> list(a=LOCF(), b=LOCF())
"stop" -> list(a=stop(), b=stop())

NULL -> "[remove non matching rows]"
list(0L) -> list(a=0L, b=0L) # fill with zeros
list(a=NA, b=LOCF()) # fill with a column with NA, and b with LOCF
list(a=NA, b=stop()) # fill a column is irrelevant becase we will raise exception anyway

# this should be supported as well, so `nomatch` can be more easily passed as a parameter
list(a=NA, b=quote(LOCF()))
list(a=NA, b=quote(stop()))
# of we could also provide LOCF() function, that crease a specification object, rather than a symbol that we handle in unevaluated expression.
list(a=NA, b=LOCF())

rolling join options rollends and rollequal (#579) are whole tables join specification, rather than per-column specification. Not sure if we want to have them as argument to be specified in LOCF(rollends, rollequal) as then it would be defined per-column.

# some extra guessing of user intent
list(0L, "missing")  # fill numeric columns with 0, character columns with "missing"
list(NA, price=LOCF())  # fill NA, other than price column which should be rolled

Assuming we will implement an argument that controls join type ("inner","left","right","full" #3946) then nomatch will be provided automatically

left, right, full -> NA
inner -> NULL

note that there is an idea of provided two kinds of nomatch, one for LHS table, and another for RHS, which would control if join is inner, left, right, full: #614 (comment)

@jangorecki jangorecki added top request One of our most-requested issues and removed High labels Jun 8, 2020
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@mattdowle mattdowle removed this from the 1.14.1 milestone Aug 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests