V4.1.1 Performance improvements
Features
- Support LSHL_ADD
- Vectorize the store-C path
- Enable DirectToLds for half
- Fix sync with DirectToLds when PrefetchLocalRead=0
- Optimize solution merging using lookup
- Align MAC blocks when using half datatype
- Add mi25 Device 6860 to vega10
- Train for DataInitTypeBeta: 0
- Add ResNet1x1 to Exact sizes