llama : add llm_build helper functions #3848

ggerganov · 2023-10-29T13:46:06Z

Code reuse via functions

Factored some common code into separate functions:

Tensor offloading improvements

All this stuff is temporary as we will soon integrate a new backend implementation that should automatically take care of tensor offloading, so these changes can be seen as a preparation step for migrating to the new interface.

Added sanity checks for offloading. Look for the following messages in the output:

llama_new_context_with_model: kv self size  =    2.75 MB
llama_build_graph: non-view tensors processed: 510/510                              <--- this is good
llama_new_context_with_model: compute buffer total size = 22.75 MB

llama_new_context_with_model: kv self size  =  256,00 MB
llama_build_graph: non-view tensors processed: 708/740                              <--- this is bad
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 76,66 MB

The latter indicates that some of the tensors in the graph have not been processed with the callback function. This can lead to inefficiencies during inference and can be debugged by enabling LLAMA_OFFLOAD_DEBUG:

llama.cpp/llama.cpp

Lines 5075 to 5077 in fc5a26a


	//#define LLAMA_OFFLOAD_DEBUG

Some observations

Persimmon's Q, K, V implementation is very cumbersome. There is high probability that something is not right there
ALiBi models might not work well with the K-shift (also on master):

# MPT-7B, small context of 256 to trigger shift more often, starts generating junk after first shift
make -j main && ./main -m ./models/mpt-7b/ggml-model-q4_0.gguf -p "I believe the meaning of life is" --ignore-eos -c 256 -n 512 -t 8 -ngl 999 -s 1

ggml-ci

slaren · 2023-10-29T14:35:53Z

Have you considered a more object oriented approach? Basically create classes such as this, and then compose the models from objects.

struct llm_norm {
    float eps;
    tensor weight;
    tensor bias;
    tensor forward(x) {
        return norm(x, eps) * weight + bias;
    }
};

To be clear, I know this is nothing new, it's essentially the same as is done in pytorch and other frameworks. It would be a larger refactor, but it may result in a more modular and easier to extend design.

ggerganov · 2023-10-29T14:59:33Z

Yeah, I can see the benefit. We will initialize the modules on model load and then the graph would be very short forward calls.

I think we can do this later, because having these helper functions ready, they can be simply called in forward(x) using the struct's members. So the effort from this refactoring probably will not be wasted and will be a first step towards the object-oriented approach.

ggml-ci

monatis · 2023-10-29T17:13:10Z

having these helper functions ready, they can be simply called in forward(x) using the struct's members.

And this is even more Pytorch-like. They have a functional API that does the actual work and then they can be called statically or with class members etc., making it even more composable.

ggml-ci

KerfuffleV2 · 2023-10-30T10:05:02Z

I tested running perplexity on a few models:

Orca 3B - no problem, exact same results.

CausalLM 14B - no problem, exact same results.

Mistral Zephyr-beta:

pr: ggml_allocr_alloc: not enough space in the buffer (needed 29360128, largest block available 29358112)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

dolphin-2.1 70B:

pr: ggml_allocr_alloc: not enough space in the buffer (needed 58720256, largest block available 58718240)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

In both cases the difference was small, both were off by 2,016.

ggerganov · 2023-10-30T10:18:50Z

@KerfuffleV2 Great! Thank you for testing - I'll fix this after lunch.

llama.cpp

ggerganov · 2023-10-30T10:57:04Z

@KerfuffleV2 Is this the repo for the first model?

https://huggingface.co/HuggingFaceH4/zephyr-7b-beta

Also, which command are you using?

KerfuffleV2 · 2023-10-30T11:12:52Z

https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF/

I tested with Q5_K_M, but I tried a different Mistral model that was Q2_K quantized and got the same error:

ggml_allocr_alloc: not enough space in the buffer (needed 29360128, largest block available 29358112)
GGML_ASSERT: ggml-alloc.c:148: !"not enough space in the buffer"

Also using -ngl 0 doesn't seem to help (this is built with HIPBLAS though).

The 70B is https://huggingface.co/TheBloke/Dolphin-2.1-70B-GGUF/

Command:

./perplexity.pr -f /raid/vantec/ai/wikitext-2-raw/wiki.test.raw -m /blah/openhermes-2-mistral-7b.q2_k.gguf -ngl 0 --log-disable

Compiled with:

make -j8 GPU_TARGETS=gfx1030 LLAMA_HIPBLAS=1 perplexity

System is x86 Linux.

ggerganov · 2023-10-31T06:26:33Z

The ggml-alloc issue should be fixed with 2926ef6

ggml-ci

llama.cpp

slaren · 2023-10-31T11:51:09Z

Another opportunity for refactoring could be the input part, which seems to be identical in every model:

    if (batch.token) {
        struct ggml_tensor * inp_tokens = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
        cb(inp_tokens, "inp_tokens", -1);

        embd = ggml_get_rows(ctx0, model.tok_embeddings, inp_tokens);
    } else {
#ifdef GGML_USE_MPI
        GGML_ASSERT(false && "not implemented");
#endif

        embd = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, n_tokens);
    }
    cb(embd, "inp_embd", -1);

I noticed that llm_build_persimmon is missing the MPI assert, so making a common function for this would reduce the chance of this kind of errors.

llama.cpp

KerfuffleV2

I can confirm the allocation issue with Mistral and LLaMA2 70B is now fixed. Rechecked the perplexity results (just the first few blocks) - still identical to master

So tested with perplexity: Mistral, LLaMAv2 70B, CausalLM 14B, Orca3B

I haven't tested anything more exotic like Persimmon and done TG tests. (I assume if perplexity works those won't be broken).

If I get a chance I'll check some more stuff.

ggerganov · 2023-10-31T14:44:37Z

Another opportunity for refactoring could be the input part, which seems to be identical in every model:

Added llm_build_inp_embd() in 7923b70

KerfuffleV2 · 2023-10-31T14:57:42Z

This is getting very close to where you could just load the model definitions from a JSON file or something. Actually, it's probably possible with the current changes.

I was looking at the LLaMA definition and messed around a little for my own amusement:

pre_repeating {
  cond has_batch_token {
    new_tensor inp_tokens 1 i32 param:n_tokens
    ggml_op get_rows inp_tokens inpL model_tensor:tok_embeddings inp_tokens
  }
  not_cond has_batch_token {
    new_tensor inpL 2 f32 model_param:n_embed param:n_tokens 
  }
  new_tensor inp_pos 1 i32 param:n_tokens
  new_tensor KQ_scale 1 f32 1
  new_tensor KQ_mask 3 f32 param:n_kv param:n_tokens 1
  rope_shift_if_needed LLM_ROPE
}

repeating {
  bind_tensor inpL inpSA
  llm_op build_norm model_layers:attn_norm LLM_NORM_RMS
}

post_repeating {
}

That's not JSON but easy to see how it could be written as it.

* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper

* llama : factor out ggml-alloc from graph graph build functions ggml-ci * metal : disable kernel load log * llama : factor out tensor offloading outside the build call (wip) ggml-ci * llama : offload rest of the models ggml-ci * llama : update offload log messages to print node index * llama : comments * llama : support offloading result_norm + comments * llama : factor graph input into a function * llama : do tensor offload only with CUDA * llama : fix res_norm offloading * llama : try to optimize offloading code * llama : fix non-CUDA build * llama : try to fix build * llama : move refact in correct place + optimize graph input * llama : refactor tensor offloading as callback * llama : add layer index to all tensor names * llama : add functional header * llama : comment ggml-ci * llama : remove obsolete map for layer counting * llama : add llm_build helper functions (ggerganov#3848) * llama : add llm_build_norm helper function ggml-ci * llama : add llm_build_ffn helper function (ggerganov#3849) ggml-ci * llama : add llm_build_k_shift helper ggml-ci * llama : fix offloading after recent changes * llama : add llm_build_kv_store helper ggml-ci * llama : remove obsolete offload names * llama : fix llm_build_k_shift to use n_head_kv instead of n_head * llama : simplify falcon Q, K, V computation * llama : remove obsolete comments in build graphs * llama : add llm_build_kqv helper ggml-ci * llama : minor * llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading * llama : fix input allocation logic * llama : update offload functions for KQ tensors * llama : normalize tensor names ggml-ci * llama : enable warning about not offloaded tensors * llama : remove extra ; + deduplicate gate_b logic * llama : add llm_build_inp_embd helper

ggerganov added refactoring Refactoring need feedback Testing and feedback with results are needed labels Oct 29, 2023

llama : add llm_build_norm helper function

7db9c96

ggml-ci

ggerganov force-pushed the llama-refactor-norm branch from 6692899 to 7db9c96 Compare October 29, 2023 13:48

ggerganov mentioned this pull request Oct 29, 2023

llama : add llm_build_ffn helper function #3849

Merged

llama : add llm_build_ffn helper function (#3849)

dbf836b

ggml-ci

ggerganov changed the title ~~llama : add llm_build_norm helper function~~ llama : add llm_build helper functions Oct 29, 2023

llama : add llm_build_k_shift helper

38728a0

ggml-ci

ggerganov force-pushed the llama-refactor-norm branch from c6ae530 to 38728a0 Compare October 29, 2023 17:23

llama : fix offloading after recent changes

909d644

ggerganov force-pushed the llama-refactor-norm branch 3 times, most recently from 85ec6b0 to ca15e4a Compare October 29, 2023 19:09

llama : add llm_build_kv_store helper

3e04625

ggml-ci

ggerganov force-pushed the llama-refactor-norm branch from ca15e4a to 3e04625 Compare October 29, 2023 19:09

ggerganov added 5 commits October 29, 2023 21:11

llama : remove obsolete offload names

5990861

llama : fix llm_build_k_shift to use n_head_kv instead of n_head

31a12f3

llama : simplify falcon Q, K, V computation

a104abe

llama : remove obsolete comments in build graphs

c9121fd

llama : add llm_build_kqv helper

f39e607

ggml-ci

ggerganov force-pushed the llama-refactor-norm branch from d82a5c0 to f39e607 Compare October 29, 2023 20:45

ggerganov marked this pull request as ready for review October 29, 2023 20:46

llama : minor

792d1a1

ggerganov force-pushed the llama-refactor-norm branch from 4d115ea to 792d1a1 Compare October 30, 2023 09:49

llama : add LLAMA_OFFLOAD_DEBUG + fix starcoder offloading

a3f8001

KerfuffleV2 reviewed Oct 30, 2023

View reviewed changes

llama.cpp Show resolved Hide resolved

ggerganov added 2 commits October 31, 2023 08:23

llama : fix input allocation logic

2926ef6

llama : update offload functions for KQ tensors

6669cd8

ggerganov added 2 commits October 31, 2023 08:48

llama : normalize tensor names

0bfdcdd

ggml-ci

llama : enable warning about not offloaded tensors

fc5a26a

ggerganov requested review from slaren and KerfuffleV2 October 31, 2023 07:05