-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add llm_build helper functions #3848
Conversation
6692899
to
7db9c96
Compare
Have you considered a more object oriented approach? Basically create classes such as this, and then compose the models from objects. struct llm_norm {
float eps;
tensor weight;
tensor bias;
tensor forward(x) {
return norm(x, eps) * weight + bias;
}
}; To be clear, I know this is nothing new, it's essentially the same as is done in pytorch and other frameworks. It would be a larger refactor, but it may result in a more modular and easier to extend design. |
Yeah, I can see the benefit. We will initialize the modules on model load and then the graph would be very short forward calls. I think we can do this later, because having these helper functions ready, they can be simply called in |
And this is even more Pytorch-like. They have a functional API that does the actual work and then they can be called statically or with class members etc., making it even more composable. |
c6ae530
to
38728a0
Compare
85ec6b0
to
ca15e4a
Compare
ca15e4a
to
3e04625
Compare
d82a5c0
to
f39e607
Compare
4d115ea
to
792d1a1
Compare
I tested running Orca 3B - no problem, exact same results. CausalLM 14B - no problem, exact same results. Mistral Zephyr-beta:
dolphin-2.1 70B:
In both cases the difference was small, both were off by 2,016. |
@KerfuffleV2 Great! Thank you for testing - I'll fix this after lunch. |
@KerfuffleV2 Is this the repo for the first model? https://huggingface.co/HuggingFaceH4/zephyr-7b-beta Also, which command are you using? |
https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/ I tested with Q5_K_M, but I tried a different Mistral model that was Q2_K quantized and got the same error:
Also using The 70B is https://huggingface.co/TheBloke/Dolphin-2.1-70B-GGUF/ Command:
Compiled with:
System is x86 Linux. |
The |
Another opportunity for refactoring could be the input part, which seems to be identical in every model: if (batch.token) {
struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
cb(inp_tokens, "inp_tokens", -1);
embd = ggml_get_rows(ctx0, model.tok_embeddings, inp_tokens);
} else {
#ifdef GGML_USE_MPI
GGML_ASSERT(false && "not implemented");
#endif
embd = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, n_tokens);
}
cb(embd, "inp_embd", -1); I noticed that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm the allocation issue with Mistral and LLaMA2 70B is now fixed. Rechecked the perplexity results (just the first few blocks) - still identical to master
So tested with perplexity
: Mistral, LLaMAv2 70B, CausalLM 14B, Orca3B
I haven't tested anything more exotic like Persimmon and done TG tests. (I assume if perplexity works those won't be broken).
If I get a chance I'll check some more stuff.
Added |
This is getting very close to where you could just load the model definitions from a JSON file or something. Actually, it's probably possible with the current changes. I was looking at the LLaMA definition and messed around a little for my own amusement:
That's not JSON but easy to see how it could be written as it. |
* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper
* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (ggerganov#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (ggerganov#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper
Code reuse via functions
Factored some common code into separate functions:
llm_build_inp_embd()
llm_build_norm()
llm_build_ffn()
llm_build_k_shift()
llm_build_kv_store()
llm_build_qkv()
Tensor offloading improvements
All this stuff is temporary as we will soon integrate a new backend implementation that should automatically take care of tensor offloading, so these changes can be seen as a preparation step for migrating to the new interface.
Added sanity checks for offloading. Look for the following messages in the output:
llama_new_context_with_model: kv self size = 2.75 MB llama_build_graph: non-view tensors processed: 510/510 <--- this is good llama_new_context_with_model: compute buffer total size = 22.75 MB
The latter indicates that some of the tensors in the graph have not been processed with the callback function. This can lead to inefficiencies during inference and can be debugged by enabling
LLAMA_OFFLOAD_DEBUG
:llama.cpp/llama.cpp
Lines 5075 to 5077 in fc5a26a
Some observations
master
):