Skip to content

Commit

Permalink
Updated ?setops and added NEWS entry, #547.
Browse files Browse the repository at this point in the history
  • Loading branch information
arunsrinivasan committed Mar 6, 2016
1 parent dd566e6 commit 9c93779
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 38 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@

23. Use of `mult='first'` or `mult='last'` when `i` argument is *logical/numeric* now provides a warning that `mult` argument is ignored, [#1295](https://github.com/Rdatatable/data.table/issues/1295). Thanks to @nkurz.

24. Fast set operations `fsetdiff`, `fintersect`, `funion` and `fsetequal` for data.tables is now implemented, [#547](https://github.com/Rdatatable/data.table/issues/547).

#### BUG FIXES

1. Now compiles and runs on IBM AIX gcc. Thanks to Vinh Nguyen for investigation and testing, [#1351](https://github.com/Rdatatable/data.table/issues/1351).
Expand Down
54 changes: 16 additions & 38 deletions man/setops.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,11 @@
\alias{funion}
\alias{setequal}
\alias{fsetequal}
\title{ Set operators for a pair of data tables }
\title{ Set operations for data tables }
\description{
Fast set operators for dealing with datasets as relation algebra sets. Use \code{all = TRUE} to keep duplicate rows and process datasets unique rows quantities. Unlikely in SQL, data.table functions will retain order of rows.
Similar to base's set functions, \code{union}, \code{intersect}, \code{setdiff} and \code{setequal} but for \code{data.table}s. Additional \code{all} argument controls if/how \code{duplicate} rows are returned. \code{bit64::integer64} is also supported.
Unlike SQL, data.table functions will retain order of rows in result.
}
\usage{
fintersect(x, y, all = FALSE)
Expand All @@ -20,47 +22,23 @@ funion(x, y, all = FALSE)
fsetequal(x, y)
}
\arguments{
\item{x}{data.table.}
\item{y}{data.table.}
\item{all}{logical, decides how to handle duplicate entries, maps to operators known as \emph{INTERSECT ALL}, \emph{EXCEPT ALL}, \emph{UNION ALL}, see details below.}
\item{x}{A data.table.}
\item{y}{A data.table.}
\item{all}{Logical. Default is \code{FALSE} and removes duplicate rows on the result. When \code{TRUE}, if there are \code{xn} copies of a particular row in \code{x} and \code{yn} copies of the same row in \code{y}, then:
\code{fintersect} will return \code{min(xn, yn)} copies of that row.
\code{fsetdiff} will return \code{max(0, xn-yn)} copies of that row.
\code{funion} will return xn+yn copies of that row.}
}
\details{
Set operators are internally performing joins or forcing uniqueness. That means set operators will not work on datasets having columns not supported in join or unique, e.g. \emph{list} or \emph{complex} columns. The exception is \emph{UNION ALL} case, \code{funion(..., all=TRUE)}, will support \emph{list} type columns. The \emph{bigint} data type \code{\link[bit64]{integer64}} is supported for all set operators.
}
\section{Fast intersect}{
\code{fintersect} will produce intersection between two datasets.
If \code{all = FALSE} the distinct dataset will be returned, otherwise \code{min(x.N, y.N)} copies of each unique row.
Taking the most basic example:
\preformatted{
x = data.table(rep("a", 5))
y = data.table(rep("a", 2))
fintersect(x, y) # 1 row
fintersect(x, y, all=TRUE) # 2 rows
}
}
\section{Fast setdiff}{
\code{fsetdiff} will produce difference between two datasets.
If \code{all = FALSE} the distinct dataset will be returned, otherwise \code{max(0, x.N - y.N)} copies of each unique row.
Taking the most basic example:
\preformatted{
x = data.table(rep("a", 5))
y = data.table(rep("a", 2))
fsetdiff(x, y) # 0 rows
fsetdiff(x, y, all=TRUE) # 3 rows
}
}
\section{Fast union}{
\code{funion} will union two datasets, it is a wrapper on \code{\link{rbindlist}}.
If \code{all = FALSE} (default) then it will return distinct dataset.
}
\section{Fast setequal}{
\code{fsetequal} will tests if two datasets are equal in content, so ignoring row order and attributes.
It is a wrapper on \code{\link{all.equal.data.table}} to return logical scalar always.
Columns of type \code{complex} and \code{list} are not supported except for \code{funion}.
}
\value{
data.table, for \code{fsetequal} a logical scalar.
A data.table in case of \code{fintersect}, \code{funion} and \code{fsetdiff}. Logical \code{TRUE} or \code{FALSE} for \code{fsetequal}.
}
\seealso{ \code{\link{rbindlist}}, \code{\link{all.equal.data.table}} }
\seealso{ \code{\link{data.table}}, \code{\link{rbindlist}}, \code{\link{all.equal.data.table}} }
\references{
\url{https://db.apache.org/derby/papers/Intersect-design.html}
}
Expand Down

0 comments on commit 9c93779

Please sign in to comment.