Skip to content

Commit

Permalink
Implemented anywhere() and %anywhere%, convenient fn for range join. c…
Browse files Browse the repository at this point in the history
…loses #679.
  • Loading branch information
arunsrinivasan committed Apr 7, 2016
1 parent 672e6fd commit 80ccd2f
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 11 deletions.
2 changes: 1 addition & 1 deletion NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ exportClasses(data.table, IDate, ITime)

export(data.table, tables, setkey, setkeyv, key, "key<-", haskey, CJ, SJ, copy)
export(set2key, set2keyv, key2, setindex, setindexv, indices)
export(as.data.table,is.data.table,test.data.table,last,like,"%like%",between,"%between%")
export(as.data.table,is.data.table,test.data.table,last,like,"%like%",between,"%between%",anywhere,"%anywhere%")
export(timetaken)
export(truelength, alloc.col, ":=")
export(setattr, setnames, setcolorder, set, setDT, setDF)
Expand Down
13 changes: 12 additions & 1 deletion R/between.R
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@

# is x[i] in between lower[i] and upper[i] ?
between <- function(x,lower,upper,incbounds=TRUE) {
if(incbounds) x>=lower & x<=upper
else x>lower & x<upper
Expand All @@ -7,3 +7,14 @@ between <- function(x,lower,upper,incbounds=TRUE) {
# %between% is vectorised, #534.
"%between%" <- function(x,y) between(x,y[[1]],y[[2]],incbounds=TRUE)
# If we want non inclusive bounds with %between%, just +1 to the left, and -1 to the right (assuming integers)

# is x[i] found anywhere within [lower, upper] range?
anywhere <- function(x,lower,upper,incbounds=TRUE) {
query = setDT(list(x=x, ans=rep(FALSE, length(x))))
subject = setDT(list(l=lower, u=upper))
on = if (incbounds) c("x>=l", "x<=u") else c("x>l", "x<u")
query[subject, ans := TRUE, on=on]
query$ans
}

"%anywhere%" <- function(x,y) anywhere(x,y[[1L]],y[[2L]],incbounds=TRUE)
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,8 @@

33. `%between%` is vectorised which means we can now do: `DT[x %between% list(y,z)]` which is equivalent to `DT[x >= y & x <= z]`, [#534](https://github.com/Rdatatable/data.table/issues/534). Thanks @MicheleCarriero for filing the issue and the idea.

34. New functions `anywhere()` and `%anywhere%` are exported. `between()` answers the question: *"Is x[i] present in between `lower[i]` and `upper[i]`?"*. `anywhere()` on the other hand answers the question: *"Is x[i] present in any of the intervals specified by `lower, upper`?"*. This makes use of the recently implemented `non-equi` join to provide a convenient function to perform a *range join* [#679](https://github.com/Rdatatable/data.table/issues/679).

#### BUG FIXES

1. Now compiles and runs on IBM AIX gcc. Thanks to Vinh Nguyen for investigation and testing, [#1351](https://github.com/Rdatatable/data.table/issues/1351).
Expand Down
7 changes: 7 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -8673,6 +8673,13 @@ local({
fwrite_expect_error(1658.24, function(f) {fwrite(data.table(a=1)[NULL,], f)})
})

# tests for #679, anywhere()
dt = data.table(a=c(8,3,10,7,-10), val=runif(5))
range = data.table(start = 1:5, end = 6:10)
test(1659.1, dt[a %anywhere% range], dt[1:4])
test(1659.2, dt[anywhere(a, range$start, range$end)], dt[1:4])
test(1659.3, dt[anywhere(a, range$start, range$end, incbounds=FALSE)], dt[c(1,2,4)])

##########################

# TODO: Tests involving GForce functions needs to be run with optimisation level 1 and 2, so that both functions are tested all the time.
Expand Down
37 changes: 28 additions & 9 deletions man/between.Rd
Original file line number Diff line number Diff line change
@@ -1,32 +1,51 @@
\name{between}
\alias{between}
\alias{\%between\%}
\title{ Convenience function for range subset logic. }
\alias{anywhere}
\alias{\%anywhere\%}
\title{ Convenience functions for range subsets. }
\description{
Intended for use in \code{i} in \code{[.data.table}. From \code{v1.9.8}, \code{between} is vectorised.
Intended for use in \code{i} in \code{[.data.table}.

\code{between} answers the question: Is \code{x[i]} in between \code{lower[i]} and \code{upper[i]}. \code{lower} and \code{upper} are recycled if they are not identical to \code{length(x)}. This is equivalent to \code{x >= lower & x <= upper}, when \code{incbounds=TRUE} and \code{x > lower & y < upper} when \code{FALSE}.

\code{anywhere} on the other hand answers the question: Is \code{x[i]} is in between \emph{any of the intervals}specified by \code{lower, upper}. There is no need for recycling here. A \code{non-equi} join is performed internally in this case to determine if \code{x[i]} is in between \emph{any} of the intervals in \code{lower, upper}.
}
\usage{
between(x,lower,upper,incbounds=TRUE)
x \%between\% y
anywhere(x,lower,upper,incbounds=TRUE)
x \%anywhere\% y
}
\arguments{
\item{x}{ Any orderable vector, i.e., those with relevant methods for \code{`<=`}, such as \code{numeric}, \code{character}, \code{Date}, ... }
\item{lower}{ Lower range bound. Usually of length=\code{1} or \code{length(x)}.}
\item{upper}{ Upper range bound. Usually of same length as \code{lower}.}
\item{lower}{ Lower range bound. Must be of same length as \code{upper}. Recycled to \code{length(x)} in case of \code{between}.}
\item{upper}{ Upper range bound. Must be of same length as \code{lower}. Recycled to \code{length(x)} in case of \code{between}.}
\item{y}{ A length-2 \code{vector} or \code{list}, with \code{y[[1]]} interpreted as \code{lower} and \code{y[[2]]} as \code{upper}.}
\item{incbounds}{ \code{TRUE} means inclusive bounds, i.e., [lower,upper]. \code{FALSE} means exclusive bounds, i.e., (lower,upper). }
}
% \details{
% }
\details{
When \code{lower} and \code{upper} are length-1 vectors, \code{between} and \code{anywhere} are the same. In that case, \code{anywhere} is likely to be faster since it uses \emph{binary search} based \code{non-equi} join instead of \code{vector scan} as in the case of \code{between}.
}
\value{
Logical vector as the same length as \code{x} with value \code{TRUE} for those that lie within the specified range.
}
\note{ Current implementation does not make use of ordered keys. \code{incbounds} is set to \code{TRUE} for the infix notation \code{\%between\%}. }
\seealso{ \code{\link{data.table}}, \code{\link{like}} }
\examples{
DT = data.table(x=1:5, y=6:10, z=c(5:1))
DT[y \%between\% c(7,9)]
X = data.table(a=1:5, b=6:10, c=c(5:1))
X[b \%between\% c(7,9)]
X[between(b, 7, 9)] # same as above
# NEW feature in v1.9.8, vectorised between
DT[z \%between\% list(x,y)]
X[c \%between\% list(a,b)]
X[between(c, a, b)] # same as above
X[between(c, a, b, incbounds=FALSE)] # open interval

# anywhere()
Y = data.table(a=c(8,3,10,7,-10), val=runif(5))
range = data.table(start = 1:5, end = 6:10)
Y[a \%anywhere\% range]
Y[anywhere(a, range$start, range$end)] # same as above
Y[anywhere(a, range$start, range$end, incbounds=FALSE)] # open interval
}
\keyword{ data }

0 comments on commit 80ccd2f

Please sign in to comment.