You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Looking over ISQ (based on your previous ask), I found a few things missing that I've learned are helpful via trial and error.
imatrix, If you look at the discussion here, you can see that calculating the Importance Matrix prior to quantization, can be used to offset some of the negative effects of quantization. In particular, this comment gives a great walkthrough of which tools to use to calculate the imatrix, then how to use it when quantizing.
Also, one of the key benefits of Ollama (Go wrapper around llama.cpp) is in llm/memory.go. In the function EstimateGPULayers, it calculates, based on available VRAM (or system RAM for metal) how many layers can be offloaded to the GPU. This number is then passed to the --n_gpu_layers option of llama.cpp.
What are the chances of incorporating these ideas into ISQ? It would be great to go from safetensors / bf16 on disk to automagically optimal memory loading for inference. :-)
The text was updated successfully, but these errors were encountered:
Oops. I should add the imatrix calculation requires a data file. The outcome of the previously referenced discussion seems to have settled on the file linked in this comment.
This sounds like a great feature which I would love to add. I have begun work on tracking memory usage in #392. I will look into applying the imatrix quants, from what I understand it is a different quantization standard?
My understanding (IANAMLE), is that it calculates a matrix of weights, which adjusts the quantized values of "important" tensors. This leads to a quantized model that more closely mimics the original.
Looking over ISQ (based on your previous ask), I found a few things missing that I've learned are helpful via trial and error.
imatrix
, If you look at the discussion here, you can see that calculating the Importance Matrix prior to quantization, can be used to offset some of the negative effects of quantization. In particular, this comment gives a great walkthrough of which tools to use to calculate the imatrix, then how to use it when quantizing.Also, one of the key benefits of Ollama (Go wrapper around llama.cpp) is in
llm/memory.go
. In the functionEstimateGPULayers
, it calculates, based on available VRAM (or system RAM for metal) how many layers can be offloaded to the GPU. This number is then passed to the--n_gpu_layers
option ofllama.cpp
.What are the chances of incorporating these ideas into ISQ? It would be great to go from safetensors /
bf16
on disk to automagically optimal memory loading for inference. :-)The text was updated successfully, but these errors were encountered: