Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add llm_build helper functions #3848

Merged
merged 18 commits into from
Oct 31, 2023
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 29, 2023

Code reuse via functions

Factored some common code into separate functions:

  • llm_build_inp_embd()
  • llm_build_norm()
  • llm_build_ffn()
  • llm_build_k_shift()
  • llm_build_kv_store()
  • llm_build_qkv()

Tensor offloading improvements

All this stuff is temporary as we will soon integrate a new backend implementation that should automatically take care of tensor offloading, so these changes can be seen as a preparation step for migrating to the new interface.

Added sanity checks for offloading. Look for the following messages in the output:

llama_new_context_with_model: kv self size  =    2.75 MB
llama_build_graph: non-view tensors processed: 510/510                              <--- this is good
llama_new_context_with_model: compute buffer total size = 22.75 MB
llama_new_context_with_model: kv self size  =  256,00 MB
llama_build_graph: non-view tensors processed: 708/740                              <--- this is bad
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 76,66 MB

The latter indicates that some of the tensors in the graph have not been processed with the callback function. This can lead to inefficiencies during inference and can be debugged by enabling LLAMA_OFFLOAD_DEBUG:

llama.cpp/llama.cpp

Lines 5075 to 5077 in fc5a26a

//#define LLAMA_OFFLOAD_DEBUG

Some observations

  • Persimmon's Q, K, V implementation is very cumbersome. There is high probability that something is not right there
  • ALiBi models might not work well with the K-shift (also on master):
# MPT-7B, small context of 256 to trigger shift more often, starts generating junk after first shift
make -j main && ./main -m ./models/mpt-7b/ggml-model-q4_0.gguf -p "I believe the meaning of life is" --ignore-eos -c 256 -n 512 -t 8 -ngl 999 -s 1

@ggerganov ggerganov added refactoring Refactoring need feedback Testing and feedback with results are needed labels Oct 29, 2023
@slaren
Copy link
Collaborator

slaren commented Oct 29, 2023

Have you considered a more object oriented approach? Basically create classes such as this, and then compose the models from objects.

struct llm_norm {
    float eps;
    tensor weight;
    tensor bias;
    tensor forward(x) {
        return norm(x, eps) * weight + bias;
    }
};

To be clear, I know this is nothing new, it's essentially the same as is done in pytorch and other frameworks. It would be a larger refactor, but it may result in a more modular and easier to extend design.

@ggerganov
Copy link
Owner Author

Yeah, I can see the benefit. We will initialize the modules on model load and then the graph would be very short forward calls.

I think we can do this later, because having these helper functions ready, they can be simply called in forward(x) using the struct's members. So the effort from this refactoring probably will not be wasted and will be a first step towards the object-oriented approach.

@ggerganov ggerganov changed the title llama : add llm_build_norm helper function llama : add llm_build helper functions Oct 29, 2023
@monatis
Copy link
Collaborator

monatis commented Oct 29, 2023

having these helper functions ready, they can be simply called in forward(x) using the struct's members.

And this is even more Pytorch-like. They have a functional API that does the actual work and then they can be called statically or with class members etc., making it even more composable.

@ggerganov ggerganov force-pushed the llama-refactor-norm branch from c6ae530 to 38728a0 Compare October 29, 2023 17:23
@ggerganov ggerganov force-pushed the llama-refactor-norm branch 3 times, most recently from 85ec6b0 to ca15e4a Compare October 29, 2023 19:09
@ggerganov ggerganov force-pushed the llama-refactor-norm branch from ca15e4a to 3e04625 Compare October 29, 2023 19:09
@ggerganov ggerganov force-pushed the llama-refactor-norm branch from d82a5c0 to f39e607 Compare October 29, 2023 20:45
@ggerganov ggerganov marked this pull request as ready for review October 29, 2023 20:46
@ggerganov ggerganov force-pushed the llama-refactor-norm branch from 4d115ea to 792d1a1 Compare October 30, 2023 09:49
@KerfuffleV2
Copy link
Collaborator

I tested running perplexity on a few models:

Orca 3B - no problem, exact same results.

CausalLM 14B - no problem, exact same results.

Mistral Zephyr-beta:

pr: ggml_allocr_alloc: not enough space in the buffer (needed 29360128, largest block available 29358112)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

dolphin-2.1 70B:

pr: ggml_allocr_alloc: not enough space in the buffer (needed 58720256, largest block available 58718240)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

In both cases the difference was small, both were off by 2,016.

@ggerganov
Copy link
Owner Author

@KerfuffleV2 Great! Thank you for testing - I'll fix this after lunch.

llama.cpp Show resolved Hide resolved
@ggerganov
Copy link
Owner Author

ggerganov commented Oct 30, 2023

@KerfuffleV2 Is this the repo for the first model?

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

Also, which command are you using?

@KerfuffleV2
Copy link
Collaborator

https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/

I tested with Q5_K_M, but I tried a different Mistral model that was Q2_K quantized and got the same error:

ggml_allocr_alloc: not enough space in the buffer (needed 29360128, largest block available 29358112)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

Also using -ngl 0 doesn't seem to help (this is built with HIPBLAS though).

The 70B is https://huggingface.co/TheBloke/Dolphin-2.1-70B-GGUF/

Command:

./perplexity.pr -f /raid/vantec/ai/wikitext-2-raw/wiki.test.raw -m /blah/openhermes-2-mistral-7b.q2_k.gguf -ngl 0 --log-disable

Compiled with:

make -j8 GPU_TARGETS=gfx1030 LLAMA_HIPBLAS=1 perplexity

System is x86 Linux.

@ggerganov
Copy link
Owner Author

The ggml-alloc issue should be fixed with 2926ef6

llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator

slaren commented Oct 31, 2023

Another opportunity for refactoring could be the input part, which seems to be identical in every model:

    if (batch.token) {
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        cb(inp_tokens, "inp_tokens", -1);

        embd = ggml_get_rows(ctx0, model.tok_embeddings, inp_tokens);
    } else {
#ifdef GGML_USE_MPI
        GGML_ASSERT(false && "not implemented");
#endif

        embd = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, n_tokens);
    }
    cb(embd, "inp_embd", -1);

I noticed that llm_build_persimmon is missing the MPI assert, so making a common function for this would reduce the chance of this kind of errors.

llama.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@KerfuffleV2 KerfuffleV2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm the allocation issue with Mistral and LLaMA2 70B is now fixed. Rechecked the perplexity results (just the first few blocks) - still identical to master

So tested with perplexity: Mistral, LLaMAv2 70B, CausalLM 14B, Orca3B

I haven't tested anything more exotic like Persimmon and done TG tests. (I assume if perplexity works those won't be broken).

If I get a chance I'll check some more stuff.

@ggerganov
Copy link
Owner Author

Another opportunity for refactoring could be the input part, which seems to be identical in every model:

Added llm_build_inp_embd() in 7923b70

@KerfuffleV2
Copy link
Collaborator

This is getting very close to where you could just load the model definitions from a JSON file or something. Actually, it's probably possible with the current changes.

I was looking at the LLaMA definition and messed around a little for my own amusement:

pre_repeating {
  cond has_batch_token {
    new_tensor inp_tokens 1 i32 param:n_tokens
    ggml_op get_rows inp_tokens inpL model_tensor:tok_embeddings inp_tokens
  }
  not_cond has_batch_token {
    new_tensor inpL 2 f32 model_param:n_embed param:n_tokens 
  }
  new_tensor inp_pos 1 i32 param:n_tokens
  new_tensor KQ_scale 1 f32 1
  new_tensor KQ_mask 3 f32 param:n_kv param:n_tokens 1
  rope_shift_if_needed LLM_ROPE
}

repeating {
  bind_tensor inpL inpSA
  llm_op build_norm model_layers:attn_norm LLM_NORM_RMS
}

post_repeating {
}

That's not JSON but easy to see how it could be written as it.

@ggerganov ggerganov merged commit 5baefef into llama-refactor Oct 31, 2023
@Galunid Galunid mentioned this pull request Nov 1, 2023
19 tasks
ggerganov added a commit that referenced this pull request Nov 1, 2023
* llama : factor out ggml-alloc from graph graph build functions

ggml-ci

* metal : disable kernel load log

* llama : factor out tensor offloading outside the build call (wip)

ggml-ci

* llama : offload rest of the models

ggml-ci

* llama : update offload log messages to print node index

* llama : comments

* llama : support offloading result_norm + comments

* llama : factor graph input into a function

* llama : do tensor offload only with CUDA

* llama : fix res_norm offloading

* llama : try to optimize offloading code

* llama : fix non-CUDA build

* llama : try to fix build

* llama : move refact in correct place + optimize graph input

* llama : refactor tensor offloading as callback

* llama : add layer index to all tensor names

* llama : add functional header

* llama : comment

ggml-ci

* llama : remove obsolete map for layer counting

* llama : add llm_build helper functions (#3848)

* llama : add llm_build_norm helper function

ggml-ci

* llama : add llm_build_ffn helper function (#3849)

ggml-ci

* llama : add llm_build_k_shift helper

ggml-ci

* llama : fix offloading after recent changes

* llama : add llm_build_kv_store helper

ggml-ci

* llama : remove obsolete offload names

* llama : fix llm_build_k_shift to use n_head_kv instead of n_head

* llama : simplify falcon Q, K, V computation

* llama : remove obsolete comments in build graphs

* llama : add llm_build_kqv helper

ggml-ci

* llama : minor

* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

* llama : fix input allocation logic

* llama : update offload functions for KQ tensors

* llama : normalize tensor names

ggml-ci

* llama : enable warning about not offloaded tensors

* llama : remove extra ; + deduplicate gate_b logic

* llama : add llm_build_inp_embd helper
olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
* llama : factor out ggml-alloc from graph graph build functions

ggml-ci

* metal : disable kernel load log

* llama : factor out tensor offloading outside the build call (wip)

ggml-ci

* llama : offload rest of the models

ggml-ci

* llama : update offload log messages to print node index

* llama : comments

* llama : support offloading result_norm + comments

* llama : factor graph input into a function

* llama : do tensor offload only with CUDA

* llama : fix res_norm offloading

* llama : try to optimize offloading code

* llama : fix non-CUDA build

* llama : try to fix build

* llama : move refact in correct place + optimize graph input

* llama : refactor tensor offloading as callback

* llama : add layer index to all tensor names

* llama : add functional header

* llama : comment

ggml-ci

* llama : remove obsolete map for layer counting

* llama : add llm_build helper functions (ggerganov#3848)

* llama : add llm_build_norm helper function

ggml-ci

* llama : add llm_build_ffn helper function (ggerganov#3849)

ggml-ci

* llama : add llm_build_k_shift helper

ggml-ci

* llama : fix offloading after recent changes

* llama : add llm_build_kv_store helper

ggml-ci

* llama : remove obsolete offload names

* llama : fix llm_build_k_shift to use n_head_kv instead of n_head

* llama : simplify falcon Q, K, V computation

* llama : remove obsolete comments in build graphs

* llama : add llm_build_kqv helper

ggml-ci

* llama : minor

* llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

* llama : fix input allocation logic

* llama : update offload functions for KQ tensors

* llama : normalize tensor names

ggml-ci

* llama : enable warning about not offloaded tensors

* llama : remove extra ; + deduplicate gate_b logic

* llama : add llm_build_inp_embd helper
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed refactoring Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants