Releases: ROCm/Tensile
Releases · ROCm/Tensile
V4.7.0 Performance improvements, bug fixes, add assembly hpa_hgemm, initial source hpa_igemm
Features
- add dot2 instructions for fp16/fp32 hpa_hgemm on gfx906
- initial i8/i32 hpa_igemm
- enable fractional loads
- enable precise bounds check
V4.6.0 Performance improvements, Bug fixes, add source hpa_hgemm
Features
- Merge gfx906 code into gfx900/gfx803 code
- Tune hgemm and sgemm for Resnet50 on gfx906
- Add source hpa_hgemm
- Use precise bounds check when possible
- Tested on ROCm 1.9
Make rocBLAS build with pre-ROCm 1.9 compilers work
A small incremental release to make rocBLAS v14.2.5 buildable with pre-ROCm 1.9 compilers.
V4.5.0 Performance improvements, Bug fixes, add hpa_hgemm
Features
- add support for vega20
- add hpa_hgemm assembly and source
- tuning for sgemm and hgemm
- bug fixes for sgemm and hgemm small sizes
- use SGPR for alpha and beta
V4.4.0 Performance Improvements and Bug Fixes
Features
- Support Global Split U for half and double
- Support Local Split U for half and hpa
- Fix beta for hpa
- Add AssertFree0ElementMultiple requirement and runtime launch check
- Intercept solution selection logic and call hgemm HIP kernel when summation index or first free index is odd
- correct reordered_schedules fallback for hgemm
- disable PreciseBoundsCheck
- update rocblas_hgemm_asm_full.yaml to call source with VW=2 for m,n,k <= 32
- update rocblas_hgemm_asm_full.yaml to call source with VW=1 for m,n,k == 1
- Use alternating sign in random init for half
- use hipGetDevice in place of hipCtxGetDevice
- use _Float16 in place of __fp16
- add device to llvm_fma_v2f16
V4.3.0 Performance Improvements and Bug Fixes
Features
- source kernels for k<=128 to fix stride_b=0, batch_count > 1
- __hfma no longer needed
- Modify default handling for LdsPad, if -1, only pad the TLU=0 cases
- Combine second-to-last MAC iter into common loop
- Reset local pointers at iteration based on PrefetchLocalRead
- Multi-thread the kernel writing, provides 3X-4X speedup for build
- Support -1 default LdsPad (matches VectorWidth)
- refactor .yaml files
- Optimize overhang calculation
- use glvw in overhang calculation
- Enable CodeFromFiles
- Feature detect invalid kernel
- Change order to better match write batching reclaim algorithm
- Allocate LoopCounters in middle of SGPRs so tmp sgpr recovery works
V4.2.0 Performance improvements
Features
- Fractional global capability
- Additional ResNet sizes
- Round up for half vgprs
- Initial code for PersistentKernel (disabled)
- Feature inner unroll2
- Enable BufferStore and buffer_atomic_cmpswap for GSU>1
V4.1.1 Performance improvements
Features
- Support LSHL_ADD
- Vectorize the store-C path
- Enable DirectToLds for half
- Fix sync with DirectToLds when PrefetchLocalRead=0
- Optimize solution merging using lookup
- Align MAC blocks when using half datatype
- Add mi25 Device 6860 to vega10
- Train for DataInitTypeBeta: 0
- Add ResNet1x1 to Exact sizes
v4.0.2 Performance improvements and initial mixed precision support
Features
- Initial mixed precision support
- Performance Improvements
- Use Buffer Load for global reads (saves registers, reduce instruction count)
- Support DirectToLds (save registers, reduce latency)
- Reduce global read offset vgprs (save registers)
- Use Buffer Store for global stores (reduce instruction count)
- Optimize global store address calculaton (reduce instruction count)
- Support LdsPad to reduce LDS write bank conflicts
- Improve debug for assembly path (asserts, state dump, init LDS)
v3.6.0 Hgemm and thread safety fix
Features:
- Hgemm
- assembly for gfx900, source for gfx803
- Bug fixes:
- Additional thread safety fix for solution lookup and module storage