Releases: brycelelbach/cub_historical_2019_2020
CUB 1.5.0
CUB 1.5.0
CUB 1.5.0 introduces segmented sort and reduction primitives.
New Features:
- Segmented device-wide operations for device-wide sort and reduction primitives.
Bug Fixes:
- #36:
cub::ThreadLoad
generates compiler errors when loading from pointer-to-const. - #29:
cub::DeviceRadixSort::SortKeys<bool>
yields compiler errors. - #26: Misaligned address after
cub::DeviceRadixSort::SortKeys
. - #25: Fix for incorrect results and crashes when radix sorting 0-length problems.
- Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and warp-reduction on non-primitive data types (e.g. user-defined structs).
- Fix small radix sorting problems where 0 temporary bytes were required and users code was invoking
malloc(0)
on some systems where that returnsNULL
. CUB assumed the user was asking for the size again and not running the sort.
CUB 1.4.1
Summary
CUB 1.4.1 is a minor release.
Enhancements
- Allow
cub::DeviceRadixSort
andcub::BlockRadixSort
on bool types.
Bug Fixes
- Fix minor CUDA 7.0 performance regressions in
cub::DeviceScan
andcub::DeviceReduceByKey
. - Remove requirement for callers to define the
CUB_CDP
macro when invoking CUB device-wide rountines using CUDA dynamic parallelism. - Fix headers not being included in the proper order (or missing includes) for some block-wide functions.
CUB 1.4.0
Summary
CUB 1.4.0 adds cub::DeviceSpmv
, cub::DeviceRunLength::NonTrivialRuns
, improves cub::DeviceHistogram
, and introduces support for SM5x (Maxwell) GPUs.
New Features:
cub::DeviceSpmv
methods for multiplying sparse matrices by dense vectors, load-balanced using a merge-based parallel decomposition.cub::DeviceRadixSort
sorting entry-points that always return the sorted output into the specified buffer, as opposed to thecub::DoubleBuffer
in which it could end up in either buffer.cub::DeviceRunLengthEncode::NonTrivialRuns
for finding the starting offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in a given sequence. Useful for top-down partitioning algorithms like MSD sorting of very-large keys.
Other Enhancements
- Support and performance tuning for SM5x (Maxwell) GPUs.
- Updated cub::DeviceHistogram implementation that provides the same "histogram-even" and "histogram-range" functionality as IPP/NPP. Provides extremely fast and, perhaps more importantly, very uniform performance response across diverse real-world datasets, including pathological (homogeneous) sample distributions.
CUB 1.3.2
Summary
CUB 1.3.2 is a minor release.
Bug Fixes
- Fix
cub::DeviceReduce
where reductions of small problems (small enough to only dispatch a single thread block) would run in the default stream (stream zero) regardless of whether an alternate stream was specified.
CUB 1.3.1
Summary
CUB 1.3.1 is a minor release.
Bug Fixes
- Workaround for a benign WAW race warning reported by cuda-memcheck in
cub::BlockScan
specialized forBLOCK_SCAN_WARP_SCANS
algorithm. - Fix bug in
cub::DeviceRadixSort
where the algorithm may sort more key bits than the caller specified (up to the nearest radix digit). - Fix for ~3%
cub::DeviceRadixSort
performance regression on SM2x (Fermi) and SM3x (Kepler) GPUs.
CUB 1.3.0
Summary
CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide primitives and adds an enhanced version of cub::WarpScan
.
Breaking Changes
- CUB's collective (block-wide, warp-wide) primitives underwent a minor interface refactoring:
- To provide the appropriate support for multidimensional thread blocks, The interfaces for collective classes are now template-parameterized by X, Y, and Z block dimensions (with
BLOCK_DIM_Y
andBLOCK_DIM_Z
being optional, andBLOCK_DIM_X
replacingBLOCK_THREADS
). Furthermore, the constructors that accept remapped linear thread-identifiers have been removed: all primitives now assume a row-major thread-ranking for multidimensional thread blocks. - To allow the host program (compiled by the host-pass) to accurately determine the device-specific storage requirements for a given collective (compiled for each device-pass), the interfaces for collective classes are now (optionally) template-parameterized by the desired PTX compute capability. This is useful when aliasing collective storage to shared memory that has been allocated dynamically by the host at the kernel call site.
- Most CUB programs having typical 1D usage should not require any changes to accomodate these updates.
- To provide the appropriate support for multidimensional thread blocks, The interfaces for collective classes are now template-parameterized by X, Y, and Z block dimensions (with
New Features
- Added "combination"
cub::WarpScan
methods for efficiently computing both inclusive and exclusive prefix scans (and sums).
Bug Fixes
- Fix for bug in
cub::WarpScan
(which affectedcub::BlockScan
andcub::DeviceScan
) where incorrect results (e.g., NAN) would often be returned when parameterized for floating-point types (fp32, fp64). - Workaround for ptxas error when compiling with with -G flag on Linux (for debug instrumentation).
- Fixes for certain scan scenarios using custom scan operators where code compiled for SM1x is run on newer GPUs of higher compute-capability: the compiler could not tell which memory space was being used collective operations and was mistakenly using global ops instead of shared ops.
CUB 1.2.3
Summary
CUB 1.2.3 is a minor release.
Bug Fixes
- Fixed access violation bug in
cub::DeviceReduce::ReduceByKey
for non-primitive value types. - Fixed code-snippet bug in
ArgIndexInputIteratorT
documentation.
CUB 1.2.2
Summary
CUB 1.2.2 adds a new variant of cub::BlockReduce
and MSVC project solections for examples.
New Features
- MSVC project solutions for device-wide and block-wide examples
- New algorithmic variant of cub::BlockReduce for improved performance when using commutative operators (e.g., numeric addition).
Bug Fixes
- Inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly.
CUB 1.2.0
Summary
CUB 1.2.0 adds cub::DeviceReduce::ReduceByKey
and cub::DeviceReduce::RunLengthEncode
and support for CUDA 6.0.
New Features
cub::DeviceReduce::ReduceByKey
.cub::DeviceReduce::RunLengthEncode
.
Other Enhancements
- Improved
cub::DeviceScan
,cub::DeviceSelect
,cub::DevicePartition
performance. - Documentation and testing:
- Added performance-portability plots for many device-wide primitives.
- Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
- Revised the operation of temporary tile status bookkeeping for
cub::DeviceScan
(and similar) to be safe for current code run on future platforms (now uses proper fences).
Bug Fixes
- Fix
cub::DeviceScan
bug where Windows alignment disagreements between host and device regarding user-defined data types would corrupt tile status. - Fix
cub::BlockScan
bug where certain exclusive scans on custom data types for theBLOCK_SCAN_WARP_SCANS
variant would return incorrect results for the first thread in the block. - Added workaround to make
cub::TexRefInputIteratorT
work with CUDA 6.0.
CUB 1.1.1
Summary
CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting, cub::DeviceSelect
, cub::DevicePartition
, cub::Shuffle*
, and cub::MaxSMOccupancy
. Additionally, scan and sort performance for older GPUs has been improved and many bugs have been fixed.
Breaking Changes
- Refactored block-wide I/O (
cub::BlockLoad
andcub::BlockStore
), removing cache-modifiers from their interfaces.cub::CacheModifiedInputIterator
andcub::CacheModifiedOutputIterator
should now be used withcub::BlockLoad
andcub::BlockStore
to effect that behavior.
New Features
cub::TexObjInputIterator
,cub::TexRefInputIterator
,cub::CacheModifiedInputIterator
, andcub::CacheModifiedOutputIterator
types for loading & storing arbitrary types through the cache hierarchy. They are compatible with Thrust.- Descending sorting for
cub::DeviceRadixSort
andcub::BlockRadixSort
. - Min, max, arg-min, and arg-max operators for
cub::DeviceReduce
. cub::DeviceSelect
(select-unique, select-if, and select-flagged).cub::DevicePartition
(partition-if, partition-flagged).- Generic
cub::ShuffleUp
,cub::ShuffleDown
, andcub::ShuffleIndex
for warp-wide communication of arbitrary data types (SM3x and up). cub::MaxSmOccupancy
for accurately determining SM occupancy for any given kernel function pointer.
Other Enhancements
- Improved
cub::DeviceScan
andcub::DeviceRadixSort
performance for older GPUs (SM1x to SM3x). - Renamed device-wide
stream_synchronous
param todebug_synchronous
to avoid confusion about usage. - Documentation improvements:
- Added simple examples of device-wide methods.
- Improved doxygen documentation and example snippets.
- Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform).
Bug Fixes
- Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM1x.
- SHFL-based scans and reductions produced incorrect results for multi-word types (size > 4B) on Linux.
- For
cub::WarpScan
-based scans, not all threads in the first warp were entering the prefix callback functor. cub::DeviceRadixSort
had a race condition with key-value pairs for pre-SM35 architectures.cub::DeviceRadixSor
bitfield-extract behavior with long keys on 64-bit Linux was incorrect.cub::BlockDiscontinuity
failed to compile for types other thanint32_t
/uint32_t
.- CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts.