Replies: 4 comments 18 replies
-
Can you provide a TLDR how this works? |
Beta Was this translation helpful? Give feedback.
-
@ggerganov
Overall, SparseGPT induces only a minor increase in perplexity post-pruning (less than or equal to 1.5), indicating that it can achieve significant weight reduction without sacrificing much accuracy. |
Beta Was this translation helpful? Give feedback.
-
SparseGPT performs unstructured zeroing where model compilation is really needed to realize the gains. Structured pruning is available at https://github.com/horseee/LLaMA-Pruning where llama.cpp could likely use the results more directly. I’m not sure whether that implementation or the integration at https://github.com/VainF/Torch-Pruning is better, or if they are identical. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Hello,
https://github.com/AlpinDale/sparsegpt-for-LLaMA
https://arxiv.org/abs/2301.00774
"We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models."
Looks like someone is has implemented SparseGPT for the Llama model. If I understand correctly that means we can cut in half the size of the llama models without significant loss of precision.
I want to know what you think about it and if you're planning on testing the perplexity of it VS a "normal" sized Llama model.
PS: In less than a month, 65B Llama will work on the super nintendo 😄
Beta Was this translation helpful? Give feedback.
All reactions