|
11 | 11 |
|
12 | 12 | - [ ] Support for ROCm/AMD GPUs
|
13 | 13 | - [ ] Test that CUDA code works on GTX 10-series and RTX 20-series at some point
|
14 |
| -- [ ] Test performance on P40 (would be a good GPU to support) |
15 |
| -- [ ] Tunable kernel parameters |
16 |
| -- [ ] Test on Windows |
| 14 | +- [x] Test performance on P40 (would be a good GPU to support) |
| 15 | +- [ ] Improve performance on P40 |
| 16 | +- [x] Tunable kernel parameters |
| 17 | +- [ ] More tunable kernel parameters |
| 18 | +- [x] Test on Windows |
| 19 | +- [ ] Easier extension loading on Windows |
| 20 | +- [ ] Setup instructions for Windows |
17 | 21 |
|
18 | 22 | ## Testing
|
19 | 23 |
|
|
24 | 28 |
|
25 | 29 | - [x] ~~Fix layer streaming so it isn't unusably slow~~ (removed)
|
26 | 30 | - [x] ~~Allow layer streaming to integrate with other features like device splitting~~ Nope
|
27 |
| -- [ ] Provide alternative backend to allow layers on CPU |
| 31 | +- [x] ~~Provide alternative backend to allow layers on CPU~~ Nah |
28 | 32 |
|
29 | 33 | ## Speed optimization
|
30 | 34 |
|
31 | 35 | - [x] Support for de-quantizing select matrices at load time
|
32 |
| -- [ ] Better vector-matrix multiplication for de-quantized matrices (or show that it's bandwidth-limited now) |
| 36 | +- [x] ~~Better vector-matrix multiplication for de-quantized matrices~~ (dequant was a dead end) |
33 | 37 | - [ ] Fused QKV projection
|
34 | 38 | - [x] Fused MLP
|
35 | 39 | - [x] Fused RoPE
|
36 |
| -- [ ] Build attention mask in CUDA rather than PyTorch |
| 40 | +- [x] ~~Build attention mask in CUDA rather than PyTorch~~ |
37 | 41 | - [x] ~~Disable attention mask when it isn't needed~~ (not possible with SDP)
|
38 |
| -- [ ] Figure out why inference appears to be CPU-bound |
| 42 | +- [x] Figure out why inference appears to be CPU-bound (kernel launch overhead) |
| 43 | +- [ ] Reduce no. kernel launches to minimum (tail launch, fusion etc.) |
39 | 44 | - [x] Measure PyTorch module overhead (negligible in eval mode)
|
40 | 45 | - [x] Examine if scaled_dot_product_attention is actually the best attention method for single tokens (it's not)
|
41 | 46 | - [ ] Implement attention in CUDA
|
42 |
| -- [ ] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider |
| 47 | +- [x] Rewrite at least the quantized matmul kernel. Should be a bunch of special cases to consider |
43 | 48 |
|
44 | 49 | ## Generation
|
45 | 50 |
|
|
0 commit comments