Releases · ROCm/Tensile

Support Global Split U for half and double
Support Local Split U for half and hpa
Fix beta for hpa
Add AssertFree0ElementMultiple requirement and runtime launch check
Intercept solution selection logic and call hgemm HIP kernel when summation index or first free index is odd
correct reordered_schedules fallback for hgemm
disable PreciseBoundsCheck
update rocblas_hgemm_asm_full.yaml to call source with VW=2 for m,n,k <= 32
update rocblas_hgemm_asm_full.yaml to call source with VW=1 for m,n,k == 1
Use alternating sign in random init for half
use hipGetDevice in place of hipCtxGetDevice
use _Float16 in place of __fp16
add device to llvm_fma_v2f16

Assets 2

28 Jun 21:36

amcamd

v4.3.0

510b8e2

V4.3.0 Performance Improvements and Bug Fixes

Features

source kernels for k<=128 to fix stride_b=0, batch_count > 1
__hfma no longer needed
Modify default handling for LdsPad, if -1, only pad the TLU=0 cases
Combine second-to-last MAC iter into common loop
Reset local pointers at iteration based on PrefetchLocalRead
Multi-thread the kernel writing, provides 3X-4X speedup for build
Support -1 default LdsPad (matches VectorWidth)
refactor .yaml files
Optimize overhang calculation
use glvw in overhang calculation
Enable CodeFromFiles
Feature detect invalid kernel
Change order to better match write batching reclaim algorithm
Allocate LoopCounters in middle of SGPRs so tmp sgpr recovery works

Assets 2

15 May 22:10

amcamd

v4.2.0

8040b1f

V4.2.0 Performance improvements

Features

Fractional global capability
Additional ResNet sizes
Round up for half vgprs
Initial code for PersistentKernel (disabled)
Feature inner unroll2
Enable BufferStore and buffer_atomic_cmpswap for GSU>1

Assets 2

26 Apr 20:55

amcamd

v4.1.1

227344b

V4.1.1 Performance improvements

Features

Support LSHL_ADD
Vectorize the store-C path
Enable DirectToLds for half
Fix sync with DirectToLds when PrefetchLocalRead=0
Optimize solution merging using lookup
Align MAC blocks when using half datatype
Add mi25 Device 6860 to vega10
Train for DataInitTypeBeta: 0
Add ResNet1x1 to Exact sizes

Assets 2

10 Apr 22:31

amcamd

v4.0.2

f7e400b

v4.0.2 Performance improvements and initial mixed precision support

Features

Initial mixed precision support
Performance Improvements
- Use Buffer Load for global reads (saves registers, reduce instruction count)
- Support DirectToLds (save registers, reduce latency)
- Reduce global read offset vgprs (save registers)
- Use Buffer Store for global stores (reduce instruction count)
- Optimize global store address calculaton (reduce instruction count)
- Support LdsPad to reduce LDS write bank conflicts
Improve debug for assembly path (asserts, state dump, init LDS)

Assets 2

05 Mar 21:23

amcamd

v3.6.0

4543aeb

v3.6.0 Hgemm and thread safety fix

Features:

Hgemm
- assembly for gfx900, source for gfx803
Bug fixes:
- Additional thread safety fix for solution lookup and module storage

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ROCm/Tensile

V4.7.0 Performance improvements, bug fixes, add assembly hpa_hgemm, initial source hpa_igemm

V4.6.0 Performance improvements, Bug fixes, add source hpa_hgemm

Make rocBLAS build with pre-ROCm 1.9 compilers work

V4.5.0 Performance improvements, Bug fixes, add hpa_hgemm

V4.4.0 Performance Improvements and Bug Fixes

V4.3.0 Performance Improvements and Bug Fixes

V4.2.0 Performance improvements

V4.1.1 Performance improvements

v4.0.2 Performance improvements and initial mixed precision support

v3.6.0 Hgemm and thread safety fix