Skip to content

Releases: brycelelbach/cub_historical_2019_2020

CUB 1.5.0

19 May 08:37
Compare
Choose a tag to compare

CUB 1.5.0

CUB 1.5.0 introduces segmented sort and reduction primitives.

New Features:

  • Segmented device-wide operations for device-wide sort and reduction primitives.

Bug Fixes:

  • #36: cub::ThreadLoad generates compiler errors when loading from pointer-to-const.
  • #29: cub::DeviceRadixSort::SortKeys<bool> yields compiler errors.
  • #26: Misaligned address after cub::DeviceRadixSort::SortKeys.
  • #25: Fix for incorrect results and crashes when radix sorting 0-length problems.
  • Fix CUDA 7.5 issues on SM52 GPUs with SHFL-based warp-scan and warp-reduction on non-primitive data types (e.g. user-defined structs).
  • Fix small radix sorting problems where 0 temporary bytes were required and users code was invoking malloc(0) on some systems where that returns NULL. CUB assumed the user was asking for the size again and not running the sort.

CUB 1.4.1

19 May 08:32
Compare
Choose a tag to compare

Summary

CUB 1.4.1 is a minor release.

Enhancements

  • Allow cub::DeviceRadixSort and cub::BlockRadixSort on bool types.

Bug Fixes

  • Fix minor CUDA 7.0 performance regressions in cub::DeviceScan and cub::DeviceReduceByKey.
  • Remove requirement for callers to define the CUB_CDP macro when invoking CUB device-wide rountines using CUDA dynamic parallelism.
  • Fix headers not being included in the proper order (or missing includes) for some block-wide functions.

CUB 1.4.0

19 May 08:32
Compare
Choose a tag to compare

Summary

CUB 1.4.0 adds cub::DeviceSpmv, cub::DeviceRunLength::NonTrivialRuns, improves cub::DeviceHistogram, and introduces support for SM5x (Maxwell) GPUs.

New Features:

  • cub::DeviceSpmv methods for multiplying sparse matrices by dense vectors, load-balanced using a merge-based parallel decomposition.
  • cub::DeviceRadixSort sorting entry-points that always return the sorted output into the specified buffer, as opposed to the cub::DoubleBuffer in which it could end up in either buffer.
  • cub::DeviceRunLengthEncode::NonTrivialRuns for finding the starting offsets and lengths of all non-trivial runs (i.e., length > 1) of keys in a given sequence. Useful for top-down partitioning algorithms like MSD sorting of very-large keys.

Other Enhancements

  • Support and performance tuning for SM5x (Maxwell) GPUs.
  • Updated cub::DeviceHistogram implementation that provides the same "histogram-even" and "histogram-range" functionality as IPP/NPP. Provides extremely fast and, perhaps more importantly, very uniform performance response across diverse real-world datasets, including pathological (homogeneous) sample distributions.

CUB 1.3.2

19 May 08:30
Compare
Choose a tag to compare

Summary

CUB 1.3.2 is a minor release.

Bug Fixes

  • Fix cub::DeviceReduce where reductions of small problems (small enough to only dispatch a single thread block) would run in the default stream (stream zero) regardless of whether an alternate stream was specified.

CUB 1.3.1

19 May 08:29
Compare
Choose a tag to compare

Summary

CUB 1.3.1 is a minor release.

Bug Fixes

  • Workaround for a benign WAW race warning reported by cuda-memcheck in cub::BlockScan specialized for BLOCK_SCAN_WARP_SCANS algorithm.
  • Fix bug in cub::DeviceRadixSort where the algorithm may sort more key bits than the caller specified (up to the nearest radix digit).
  • Fix for ~3% cub::DeviceRadixSort performance regression on SM2x (Fermi) and SM3x (Kepler) GPUs.

CUB 1.3.0

19 May 08:27
Compare
Choose a tag to compare

Summary

CUB 1.3.0 improves how thread blocks are expressed in block- and warp-wide primitives and adds an enhanced version of cub::WarpScan.

Breaking Changes

  • CUB's collective (block-wide, warp-wide) primitives underwent a minor interface refactoring:
    • To provide the appropriate support for multidimensional thread blocks, The interfaces for collective classes are now template-parameterized by X, Y, and Z block dimensions (with BLOCK_DIM_Y and BLOCK_DIM_Z being optional, and BLOCK_DIM_X replacing BLOCK_THREADS). Furthermore, the constructors that accept remapped linear thread-identifiers have been removed: all primitives now assume a row-major thread-ranking for multidimensional thread blocks.
    • To allow the host program (compiled by the host-pass) to accurately determine the device-specific storage requirements for a given collective (compiled for each device-pass), the interfaces for collective classes are now (optionally) template-parameterized by the desired PTX compute capability. This is useful when aliasing collective storage to shared memory that has been allocated dynamically by the host at the kernel call site.
    • Most CUB programs having typical 1D usage should not require any changes to accomodate these updates.

New Features

  • Added "combination" cub::WarpScan methods for efficiently computing both inclusive and exclusive prefix scans (and sums).

Bug Fixes

  • Fix for bug in cub::WarpScan (which affected cub::BlockScan and cub::DeviceScan) where incorrect results (e.g., NAN) would often be returned when parameterized for floating-point types (fp32, fp64).
  • Workaround for ptxas error when compiling with with -G flag on Linux (for debug instrumentation).
  • Fixes for certain scan scenarios using custom scan operators where code compiled for SM1x is run on newer GPUs of higher compute-capability: the compiler could not tell which memory space was being used collective operations and was mistakenly using global ops instead of shared ops.

CUB 1.2.3

19 May 07:39
Compare
Choose a tag to compare

Summary

CUB 1.2.3 is a minor release.

Bug Fixes

  • Fixed access violation bug in cub::DeviceReduce::ReduceByKey for non-primitive value types.
  • Fixed code-snippet bug in ArgIndexInputIteratorT documentation.

CUB 1.2.2

19 May 07:38
Compare
Choose a tag to compare

Summary

CUB 1.2.2 adds a new variant of cub::BlockReduce and MSVC project solections for examples.

New Features

  • MSVC project solutions for device-wide and block-wide examples
  • New algorithmic variant of cub::BlockReduce for improved performance when using commutative operators (e.g., numeric addition).

Bug Fixes

  • Inclusion of Thrust headers in a certain order prevented CUB device-wide primitives from working properly.

CUB 1.2.0

19 May 07:29
Compare
Choose a tag to compare

Summary

CUB 1.2.0 adds cub::DeviceReduce::ReduceByKey and cub::DeviceReduce::RunLengthEncode and support for CUDA 6.0.

New Features

  • cub::DeviceReduce::ReduceByKey.
  • cub::DeviceReduce::RunLengthEncode.

Other Enhancements

  • Improved cub::DeviceScan, cub::DeviceSelect, cub::DevicePartition performance.
  • Documentation and testing:
    • Added performance-portability plots for many device-wide primitives.
    • Explain that iterator (in)compatibilities with CUDA 5.0 (and older) and Thrust 1.6 (and older).
  • Revised the operation of temporary tile status bookkeeping for cub::DeviceScan (and similar) to be safe for current code run on future platforms (now uses proper fences).

Bug Fixes

  • Fix cub::DeviceScan bug where Windows alignment disagreements between host and device regarding user-defined data types would corrupt tile status.
  • Fix cub::BlockScan bug where certain exclusive scans on custom data types for the BLOCK_SCAN_WARP_SCANS variant would return incorrect results for the first thread in the block.
  • Added workaround to make cub::TexRefInputIteratorT work with CUDA 6.0.

CUB 1.1.1

19 May 07:29
Compare
Choose a tag to compare

Summary

CUB 1.1.1 introduces texture and cache modifier iterators, descending sorting, cub::DeviceSelect, cub::DevicePartition, cub::Shuffle*, and cub::MaxSMOccupancy. Additionally, scan and sort performance for older GPUs has been improved and many bugs have been fixed.

Breaking Changes

  • Refactored block-wide I/O (cub::BlockLoad and cub::BlockStore), removing cache-modifiers from their interfaces. cub::CacheModifiedInputIterator and cub::CacheModifiedOutputIterator should now be used with cub::BlockLoad and cub::BlockStore to effect that behavior.

New Features

  • cub::TexObjInputIterator, cub::TexRefInputIterator, cub::CacheModifiedInputIterator, and cub::CacheModifiedOutputIterator types for loading & storing arbitrary types through the cache hierarchy. They are compatible with Thrust.
  • Descending sorting for cub::DeviceRadixSort and cub::BlockRadixSort.
  • Min, max, arg-min, and arg-max operators for cub::DeviceReduce.
  • cub::DeviceSelect (select-unique, select-if, and select-flagged).
  • cub::DevicePartition (partition-if, partition-flagged).
  • Generic cub::ShuffleUp, cub::ShuffleDown, and cub::ShuffleIndex for warp-wide communication of arbitrary data types (SM3x and up).
  • cub::MaxSmOccupancy for accurately determining SM occupancy for any given kernel function pointer.

Other Enhancements

  • Improved cub::DeviceScan and cub::DeviceRadixSort performance for older GPUs (SM1x to SM3x).
  • Renamed device-wide stream_synchronous param to debug_synchronous to avoid confusion about usage.
  • Documentation improvements:
    • Added simple examples of device-wide methods.
    • Improved doxygen documentation and example snippets.
  • Improved test coverege to include up to 21,000 kernel variants and 851,000 unit tests (per architecture, per platform).

Bug Fixes

  • Fix misc `cub::DeviceScan, BlockScan, DeviceReduce, and BlockReduce bugs when operating on non-primitive types for older architectures SM1x.
  • SHFL-based scans and reductions produced incorrect results for multi-word types (size > 4B) on Linux.
  • For cub::WarpScan-based scans, not all threads in the first warp were entering the prefix callback functor.
  • cub::DeviceRadixSort had a race condition with key-value pairs for pre-SM35 architectures.
  • cub::DeviceRadixSor bitfield-extract behavior with long keys on 64-bit Linux was incorrect.
  • cub::BlockDiscontinuity failed to compile for types other than int32_t/uint32_t.
  • CUDA Dynamic Parallelism (CDP, e.g. device-callable) versions of device-wide methods now report the same temporary storage allocation size requirement as their host-callable counterparts.