Expand nomatch to accept any value (control fills, rolls, omits during joins) #857

danielkrizian · 2014-10-05T15:28:07Z

Please consider the following feature request.

Currently, nomatch accepts only NA or 0:

DT[i, nomatch = NULL]
# Error in if (!is.na(nomatch) && nomatch != 0L) stop("nomatch must either be NA or 0, or (ideally) NA_integer_ or 0L") :

By expanding it to accept wider range of values, one can perform for example na.fill operation during the join operation, not after, lending to faster code. Related SO

Proposed behavior:

DT[i, nomatch = any_character_or_real]     # fill with supplied value; could be 0, "foo", even NA ...
DT[i, nomatch = NA]                        # fill with NA. Special case of above. Same behavior as today.
DT[i, nomatch = 0]                         # fill with zeroes. Special case of above. 
DT[i, nomatch = NULL]                      # omit non-matched rows. Equivalent to today's nomatch = 0
DT[i, nomatch = locf]                      # reserved symbol. Equivalent to roll=+Inf
DT[i, nomatch = nocb]                      # reserved symbol. Equivalent to roll=-Inf

Alternatively, this could be achieved by introducing new argument fill, accepting any value:
[.data.table(..., fill=NA, ...)

jangorecki: adding some features to scope

rollequal=TRUE|FALSE

nomatch=.(0L)    # fill numeric columns with 0,  character columns with NA
nomatch=.(0L, "missing")  # fill numeric columns with 0, character columns with "missing"
nomatch=.(NA, price=LOCF())  # fill NA, other than price column which should be rolled

The text was updated successfully, but these errors were encountered:

jangorecki · 2014-12-13T10:11:33Z

Merging with #940 as it is the same feature.
Extending Daniel's examples for an prepared error message in case of nomatch.

data.table join nomatch=quote(stop())

We have 2 dts:

library(data.table)
dict <- data.table(symbol = c("a","b","c"), ratio = c(5,10,15), key = "symbol")
dt <- data.table(symbol = c("a","b","c","d"), value = rnorm(4))

Which we want join:

dict[dt][,list(symbol, new_value = ratio * value)]

Consider the case when we expect to have match on all data in the join.
In case of nomatch we want to raise error.
Currently it is possible using:

new_dt <- dict[dt]
if(any(is.na(new_dt$ratio))) stop("missing data in the reference dictionary")

Which makes us unable to make chaining in such case.
If we could use something like this:

dict[dt,nomatch=quote(stop("missing data in the reference dictionary"))
     ][,list(symbol, new_value = ratio * value)]

just to stop process and raise the error in case of any nomatch.

arunsrinivasan · 2015-04-23T10:04:11Z

Added to 1.9.6. Will add a global option and issue a warning message so that transition can be made in the next major release. Setting the global option will allow users to use the new implementation already.

nomatch = NA - current behaviour (and default).
nomatch = 0 - fills with 0 instead of removing non-matched rows.
nomatch = NULL - current behaviour of nomatch = 0.
nomatch = LOCF(val, rollends=TRUE/FALSE) and nomatch = NOCB(val, rollends=TRUE/FALSE) - could potentially replace roll and rollends arguments??
nomatch = val - should type convert appropriately and use val for all columns.

DavidArenburg · 2015-04-23T10:08:23Z

I think that second option is a bit revolutionary and potentially going to "piss" a lot of people who either don't read documentation/warnings or will have to rewrite huge code chunks...

eantonya · 2015-04-23T10:32:01Z

It's bad enough that this will break a lot of code, but now we'd have to type even more characters for this very common operation? Uhh...

I don't like just saying I dislike smth without presenting a viable alternative, but the only thing I can think of is a new argument, which sucks in its own little way :-\

arunsrinivasan · 2015-04-23T11:50:56Z

On extra characters, I guess you're talking about "roll" arguments? I'm not quite certain what are the extra characters otherwise.. And, I am not sure about the "roll" part, hence the "??". The idea is to first implement nomatch = <any_value> first.

# previously
X[Y, roll = Inf, rollends=TRUE]
# after
X[Y, nomatch = LOCF(Inf, TRUE)] # default case can be just nomatch = "locf"

Is this what you find hard?

For one, I find it straightforward, as "roll" actually is used to "fill in" missing values. Second, the number of arguments that are mutually exclusive in [.data.table are one too many, and this helps address that.. (by removing roll and rollends as separate arguments). Right now, using roll=TRUE and nomatch = 0L returns the rolled result, which I find confusing. Providing nomatch = roll is explicit in that it asks the results to be rolled in the event of a no match.

On breaking code.. like I said before, in this version (1.9.6), we'll have a global argument which will be set to current behaviour and will issue a warning that nomatch = 0L should be replaced to nomatch = NULL and the current behaviour will be deprecated in the next version (which is plenty time), and we'll add a new vignette to explain the benefits. There will also be an option to switch off this warning. (or at least this is what I've in mind right now). @mattdowle thoughts?

I think this is a hugely welcoming change, similar to by = .EACHI (which you advocated btw, @eddi, which also would have resulted in lot of code being broken). The point is to bring in as much consistency as possible, by transitioning smoothly (and by giving ample time for users to shift their code), and this kind of "major" change shouldn't happen a lot (it's a lot of effort for us as well), but it's essential here.

eantonya · 2015-04-23T13:13:27Z

I'm talking about typing NULL instead of 0, I didn't even think about the roll. For me, nomatch=0 has been a major annoyance from the very beginning, and I'm not too happy about it becoming even more annoying.

That said I totally understand the need for filling with other numbers/throwing an error, I'm just not particularly happy with this implementation.

Actually the root of my annoyance comes from nomatch=NA being the default, instead of nomatch=NULL/0. Since we're talking about a breaking change, can we please discuss making NULL/0 the default? I'll make a short "pro" argument: it's the default for base merge and ime by far the more common type of a join/merge.

arunsrinivasan · 2015-04-23T13:36:39Z

Taking into account what Gsee mentioned about having a release as fallback in case changes are too major (and possibly affecting current code too much), I've changed it to 1.9.8. 1.9.6 will keep things as expected for people who don't want their code to be affected (and they need not upgrade if they wish so).

I agree, purely on consistency, that nomatch = NULL should be the default. IMO, we'll have to do that as well. The default should be an inner join.

This means, this'll be default in 2.0.0 (depending on what Matt feels as well).

eantonya · 2015-04-23T13:57:26Z

Ok, if nomatch=NULL is made the default, then I have no issues, and the change would be great.

jangorecki · 2015-04-23T16:47:58Z

As for breaking the code. The sooner we have it as an option to better tested it will be, so if you are not in rush with 1.9.6 (sad) it can be worth to put it there. But for the breaking changes - moving from option to default - the best will be 2.0.0 release.

If we consider adding new argument to [ to it is worth to review other open issues which would like to get own new arguments. Maybe they can be nicely combined. The one very similar would be #614, or maybe also #830 would need one - this can be combined with roll maybe?

PS. do you drop the nomatch=quote(stop("my error")) FR at all?

jangorecki · 2015-09-27T22:58:53Z

If such breaking change is going to happened I think we should put some info packageStartupMessage yet before 1.9.8 release - example one below.

"There is a breaking change coming in next stable release to CRAN (2.0.0) related to `nomatch` argument. That affects joins and binary search filtering. To keep the current behavior of `nomatch` in 2.0.0+ recode missing `nomatch` to `nomatch=NA` and `nomatch=0` to `nomatch=NULL` / missing. Alternatively use `options('datatable.oldnomatch'=TRUE)`."

Maybe we can also provide an option datatable.newnomatch which could makes new behavior available in 1.9.8? This can make switch easier.

MichaelChirico · 2016-01-27T00:00:20Z

adding that this feature would make a lot of problems simpler and that breaking existing code with nomatch = 0 is worth it in the long run as I've always thought that choice of notation was strange anyway.

gsee · 2016-01-29T12:51:15Z

Meanwhile, there are people writing production code today that need the current functionality of nomatch=0L and every time they write a new line of code that includes it, they're increasing their workload in the future if the behavior is changed. Regardless of whether the behavior of nomatch=0 is ever changed, it would be nice to be able to start using nomatch=NULL, because at least we know that once that is implemented, it's not likely to break in the foreseeable future.

MichaelChirico · 2016-01-29T12:57:57Z

@gsee that's a good idea, but I do think you're exaggerating claims of an increased workload... ctrl+F will still only need to be pressed once (or grep could be used to do all R scripts at once)

gsee · 2016-01-29T13:18:43Z

The point is that I don't like writing a line of code when I know it's going to break in the future. If I had a way to write code such that I wouldn't have to fix it, I'd prefer to do that.

jangorecki · 2016-01-29T15:05:51Z

@gsee totally agree. Many times when I'm using nomatch I'm concerned if I can write it in a way it could be supported in future. One option is to use nomatch = if(packageVersion("data.table") >= "2.0.0") NULL else 0 but I'm not sure if it is recommended way.
Another option is to not upgrade your production environment. If everything works as expected then I would say it is a good practice.
Having good test coverage for own package would easily detect most of the issues when migration to new version of a dependency. Giving API for future changes, as the one written above, can help package maintainers. Additionally pointing what test they could include in their test workflow.

mekkim · 2016-02-03T21:02:23Z

Just adding my two-cents to highlight that, as @jangorecki knows, this is an issue that commonly comes up on StackExchange, etc. The nomatch=0 syntax always confused me (nomatch=null always seemed more logical). Here's a current thread with a very basic case. Thanks for your great work and this upcoming feature, however it is implemented!

mattdowle · 2016-02-03T22:05:31Z

To update when done:
http://stackoverflow.com/questions/35186694/r-left-outer-join-with-0-fill-instead-of-na-while-preserving-valid-nas-in-left

skanskan · 2017-07-11T10:06:55Z

Could you provide an example on how to use this nomatch roll method, please?
For example, how can I do this
zoo::na.locf(c(NA, 1,2,NA, NA, 1, NA))
using just data.table?

mattdowle · 2018-10-04T23:44:46Z

Yep: nomatch=NULL good and agree with Jan's PR #3095.

The desire is to specify fill values in nomatch=, right? So nomatch=0L will eventually mean use 0L instead of NA to fill the missing rows. But what about differing column types? There might be a mixture of character columns and numeric columns to fill and what should match=0L mean for the character columns?

Rather than changing the meaning of nomatch=0L, how about allowing a new list value.

nomatch=.(0L)    # fill numeric columns with 0,  character columns with NA
nomatch=.(0L, "missing")  # fill numeric columns with 0, character columns with "missing"
nomatch=.(NA, price=LOCF())  # fill NA, other than price column which should be rolled

This would allow the ability of filling with 0 rather than NA, without needing to change the meaning of nomatch=0L and the associated breakage.

Btw, the reason for the default being nomatch=NA (outer join) is for statistical safety. I found that in the majority of cases in my day to day work, I expected there to be data available and if there wasn't, then I wanted to measure how much was missing or at least be tripped up with the NAs afterwards. It's just too easy to miss incorrect joins due to missing data when the default is inner join. Inner join to me means ignore missing data silently which doesn't feel robust. If you expect missing data and that's ok, then I like to see nomatch=NULL in the code to remind myself that missing data is being dropped silently. I also like the nomatch="error" option idea too. In other words, I'm not keen on changing the default behaviour away from nomatch=NA since there's a good reason for it.

jangorecki · 2018-10-05T05:14:06Z

Very good question about filling multiple columns, and brilliant idea to address it retaining nomatch=0L backward compatible at the same time.

To include also rollequal=TRUE|FALSE #857

ethanbsmith · 2019-05-14T20:28:40Z

it seems like there are 2 discrete ideas that are being combined in a single parameter:

the nature of the join (inner or right)
the fill values to use for a left join

granted, if you do an inner join, you don't need fill values, but adding new parameters would get you around the transition and backward compatibility quite easily:

add join.type as a new parameter (currently only allowing inner, right)
add join.fill as a new parameter taking a list of name/value pars as described in Matt's Oct4th post.

the join.type could be implemented now, with warnings and backward compatibility would be a simple thunk to the new parameter.

this would also leave open the possibility for supporting full join in the future, or possibly even the sacrilegious left join ;)

jangorecki · 2020-04-10T12:51:13Z

To summarize current requirements

# `nomatch` argument value on the left, translation on the right
NA -> list(a=NA, b=NA)
0L -> NULL
"locf" -> list(a=LOCF(), b=LOCF())
"stop" -> list(a=stop(), b=stop())

NULL -> "[remove non matching rows]"
list(0L) -> list(a=0L, b=0L) # fill with zeros
list(a=NA, b=LOCF()) # fill with a column with NA, and b with LOCF
list(a=NA, b=stop()) # fill a column is irrelevant becase we will raise exception anyway

# this should be supported as well, so `nomatch` can be more easily passed as a parameter
list(a=NA, b=quote(LOCF()))
list(a=NA, b=quote(stop()))
# of we could also provide LOCF() function, that crease a specification object, rather than a symbol that we handle in unevaluated expression.
list(a=NA, b=LOCF())

rolling join options rollends and rollequal (#579) are whole tables join specification, rather than per-column specification. Not sure if we want to have them as argument to be specified in LOCF(rollends, rollequal) as then it would be defined per-column.

# some extra guessing of user intent
list(0L, "missing")  # fill numeric columns with 0, character columns with "missing"
list(NA, price=LOCF())  # fill NA, other than price column which should be rolled

Assuming we will implement an argument that controls join type ("inner","left","right","full" #3946) then nomatch will be provided automatically

left, right, full -> NA
inner -> NULL

note that there is an idea of provided two kinds of nomatch, one for LHS table, and another for RHS, which would control if join is inner, left, right, full: #614 (comment)

danielkrizian changed the title ~~Expand nomatch to accept any value - control fills, rolls, omits all in one place~~ Expand nomatch to accept any value (control fills, rolls, omits during joins) Oct 5, 2014

arunsrinivasan added the feature request label Nov 10, 2014

jangorecki mentioned this issue Nov 10, 2014

[Request] data.table join nomatch=quote(stop()) #940

Closed

arunsrinivasan added this to the v1.9.6 milestone Apr 23, 2015

arunsrinivasan added the High label Apr 23, 2015

arunsrinivasan modified the milestones: v1.9.8, v1.9.6 Apr 23, 2015

arunsrinivasan mentioned this issue Sep 17, 2015

Strange behavior for filtering in data.table with NA values #1252

Closed

jangorecki mentioned this issue Nov 28, 2015

[R-Forge #748] Allow zeros (or any other value) to be used as fill rather than just NA #675

Closed

mattdowle modified the milestones: v2.0.0, v1.9.8 Sep 14, 2016

franknarf1 mentioned this issue Apr 21, 2017

New symbol to count rows matched in X[Y] #2124

Open

mattdowle added this to the 1.12.0 milestone Oct 4, 2018

mattdowle mentioned this issue Oct 4, 2018

Both columns for rolling and non-equi joins #3093

Open

4 tasks

mattdowle pushed a commit that referenced this issue Oct 6, 2018

inner join now available with nomatch=NULL also, towards #857 (#3095)

a6ac844

mattdowle modified the milestones: 1.12.0, 1.12.2 Jan 6, 2019

jangorecki modified the milestones: 1.12.2, 1.12.4 Jan 24, 2019

MichaelChirico changed the title ~~Expand nomatch to accept any value (control fills, rolls, omits during joins)~~ Expand nomatch to accept any value (control fills, rolls, omits during joins) Feb 19, 2019

MichaelChirico mentioned this issue Feb 19, 2019

Master list of most-requested issues #3189

Open

75 tasks

jangorecki mentioned this issue Mar 6, 2019

Add rollequal=TRUE|FALSE #579

Open

franknarf1 mentioned this issue Mar 11, 2019

With by=.EACHI and unmatched i, can we set nomatch= to get .SD[0] instead of .SD[NA]? #3452

Open

MichaelChirico mentioned this issue Jul 2, 2019

Resuming work on dtplyr #3641

Closed

asardaes mentioned this issue Jul 6, 2019

Required data.table arguments for the filtering verbs asardaes/table.express#17

Closed

jangorecki modified the milestones: 1.12.4, 1.13.0 Jul 25, 2019

jangorecki added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Aug 4, 2019

mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019

jangorecki mentioned this issue Apr 10, 2020

join vignette #2181

Closed

zeomal mentioned this issue Apr 25, 2020

Create datatable-joins.Rmd for Joins vignette #4398

Closed

jangorecki added top request One of our most-requested issues and removed High labels Jun 8, 2020

myoung3 mentioned this issue Jun 30, 2020

by = .EACHI and summing over the empty set returning NA #4582

Open

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

mattdowle removed this from the 1.14.1 milestone Aug 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand nomatch to accept any value (control fills, rolls, omits during joins) #857

Expand nomatch to accept any value (control fills, rolls, omits during joins) #857

danielkrizian commented Oct 5, 2014 •

edited by jangorecki

Loading

jangorecki commented Dec 13, 2014

arunsrinivasan commented Apr 23, 2015

DavidArenburg commented Apr 23, 2015

eantonya commented Apr 23, 2015

arunsrinivasan commented Apr 23, 2015

eantonya commented Apr 23, 2015

arunsrinivasan commented Apr 23, 2015

eantonya commented Apr 23, 2015

jangorecki commented Apr 23, 2015

jangorecki commented Sep 27, 2015

MichaelChirico commented Jan 27, 2016

gsee commented Jan 29, 2016

MichaelChirico commented Jan 29, 2016

gsee commented Jan 29, 2016

jangorecki commented Jan 29, 2016

mekkim commented Feb 3, 2016

mattdowle commented Feb 3, 2016

skanskan commented Jul 11, 2017

mattdowle commented Oct 4, 2018 •

edited

Loading

jangorecki commented Oct 5, 2018 •

edited

Loading

ethanbsmith commented May 14, 2019

jangorecki commented Apr 10, 2020 •

edited

Loading

Expand nomatch to accept any value (control fills, rolls, omits during joins) #857

Expand nomatch to accept any value (control fills, rolls, omits during joins) #857

Comments

danielkrizian commented Oct 5, 2014 • edited by jangorecki Loading

jangorecki commented Dec 13, 2014

arunsrinivasan commented Apr 23, 2015

DavidArenburg commented Apr 23, 2015

eantonya commented Apr 23, 2015

arunsrinivasan commented Apr 23, 2015

eantonya commented Apr 23, 2015

arunsrinivasan commented Apr 23, 2015

eantonya commented Apr 23, 2015

jangorecki commented Apr 23, 2015

jangorecki commented Sep 27, 2015

MichaelChirico commented Jan 27, 2016

gsee commented Jan 29, 2016

MichaelChirico commented Jan 29, 2016

gsee commented Jan 29, 2016

jangorecki commented Jan 29, 2016

mekkim commented Feb 3, 2016

mattdowle commented Feb 3, 2016

skanskan commented Jul 11, 2017

mattdowle commented Oct 4, 2018 • edited Loading

jangorecki commented Oct 5, 2018 • edited Loading

ethanbsmith commented May 14, 2019

jangorecki commented Apr 10, 2020 • edited Loading

danielkrizian commented Oct 5, 2014 •

edited by jangorecki

Loading

mattdowle commented Oct 4, 2018 •

edited

Loading

jangorecki commented Oct 5, 2018 •

edited

Loading

jangorecki commented Apr 10, 2020 •

edited

Loading