Overlap Death Match!
Intersect/overlap of genomic data (possibly from bed/bedGraph files) is implemented by
- data.table::foverlaps in R.
- GenomicRanges::findOverlaps in R.
- bedtools intersect command line program.
See demo.Rterm for the terminal output during my talk.
Recent versions of these packages are all pretty fast, see slides for
details. The only big winner is data.table::fread
,
which is much faster than read.table
or rtracklayer::import
for reading big bed/bedGraph files.
They all give the correct results, if used correctly. The only issue
is that chromStart is 0-based and chromEnd is 1-based in bedGraph
files, so you need to use chromStart+1 to get correct results in
R. More specifically, if you read a bed file into R as a data.frame
with columns chrom, chromStart, chromEnd, you need to use
IRanges(chromStart+1L, chromEnd)
or
data.table(chromStart=chromStart+1L, chromEnd)
as input to
findOverlaps/foverlaps.
The bedGraph files are big so I did not put them online anywhere, which makes it impossible to re-do the timings in TF.benchmark.RData.
However a subset of the data is available: