Skip to content

Commit 167f601

Browse files
committed
Update todo
1 parent 870946d commit 167f601

File tree

2 files changed

+14
-9
lines changed

2 files changed

+14
-9
lines changed

TODO.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,13 @@
1111

1212
- [ ] Support for ROCm/AMD GPUs
1313
- [ ] Test that CUDA code works on GTX 10-series and RTX 20-series at some point
14-
- [ ] Test performance on P40 (would be a good GPU to support)
15-
- [ ] Tunable kernel parameters
16-
- [ ] Test on Windows
14+
- [x] Test performance on P40 (would be a good GPU to support)
15+
- [ ] Improve performance on P40
16+
- [x] Tunable kernel parameters
17+
- [ ] More tunable kernel parameters
18+
- [x] Test on Windows
19+
- [ ] Easier extension loading on Windows
20+
- [ ] Setup instructions for Windows
1721

1822
## Testing
1923

@@ -24,22 +28,23 @@
2428

2529
- [x] ~~Fix layer streaming so it isn't unusably slow~~ (removed)
2630
- [x] ~~Allow layer streaming to integrate with other features like device splitting~~ Nope
27-
- [ ] Provide alternative backend to allow layers on CPU
31+
- [x] ~~Provide alternative backend to allow layers on CPU~~ Nah
2832

2933
## Speed optimization
3034

3135
- [x] Support for de-quantizing select matrices at load time
32-
- [ ] Better vector-matrix multiplication for de-quantized matrices (or show that it's bandwidth-limited now)
36+
- [x] ~~Better vector-matrix multiplication for de-quantized matrices~~ (dequant was a dead end)
3337
- [ ] Fused QKV projection
3438
- [x] Fused MLP
3539
- [x] Fused RoPE
36-
- [ ] Build attention mask in CUDA rather than PyTorch
40+
- [x] ~~Build attention mask in CUDA rather than PyTorch~~
3741
- [x] ~~Disable attention mask when it isn't needed~~ (not possible with SDP)
38-
- [ ] Figure out why inference appears to be CPU-bound
42+
- [x] Figure out why inference appears to be CPU-bound (kernel launch overhead)
43+
- [ ] Reduce no. kernel launches to minimum (tail launch, fusion etc.)
3944
- [x] Measure PyTorch module overhead (negligible in eval mode)
4045
- [x] Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
4146
- [ ] Implement attention in CUDA
42-
- [ ] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
47+
- [x] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider
4348

4449
## Generation
4550

exllama_ext/cuda_func/column_remap.cu

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
#include "column_remap.cuh"
22
#include "../util.cuh"
33

4-
const int SHUF_BLOCKSIZE_X = 256;
4+
const int SHUF_BLOCKSIZE_X = 1024;
55
const int SHUF_BLOCKSIZE_Y = 16;
66

77
__global__ void column_remap_kernel

0 commit comments

Comments
 (0)