How to properly initiate 128K context model? #2963

mirek190 · 2023-09-01T18:19:52Z

mirek190
Sep 1, 2023

Like here

https://huggingface.co/TheBloke/Yarn-Llama-2-13B-128K-GGUF

What parameters pass to command line ....

staviq · 2023-09-01T18:54:46Z

staviq
Sep 1, 2023

None. Just the model. The point of gguf is that it already knows what its native context length is.

0 replies

ghost · 2023-09-01T19:27:42Z

ghost
Sep 1, 2023

This is next level. Can someone please explain how these larger context models are generated? What is required to make a 128k context model for the 70B parameter model?

4 replies

KerfuffleV2 Sep 1, 2023
Collaborator

What is required to make a 128k context model for the 70B parameter model?

It takes much more resources and compute than a 7b model. That's why you usually see these sort of very long context tuning/training on small models. Training a 70B is much more expensive.

ghost Sep 1, 2023

Please provide step by step directions. Does the current binary support mpi for clusters and multiple gpus?

KerfuffleV2 Sep 1, 2023
Collaborator

Sorry, I thought you were just asking a general question. I'm not knowledgeable enough to give you specifics. Hopefully someone else can help.

llama.cpp isn't really focused on training though there are efforts in that direction. If you were training a 70B model you almost certainly wouldn't be using llama.cpp.

ghost Sep 1, 2023

Sure, I want to start with adding 128k context to the initial 70B model. Then I want to train with another set of texts in the format of pdfs. Require the instructions for current state of art so I am not re-inventing any wheels

staviq · 2023-09-01T20:08:50Z

staviq
Sep 1, 2023

A very very crude simplification would be to say, a model as a neural network, is build with "cells" which point at other cells, can take the value form other cells, do something with that value ( say multiply or divide by a set number, specific to that particular cell), and they let other cells use the result. Some cells are designated as Input, and they do not point at anything, you ( the code tthst runs the model ) put your text, encoded into numbers, word by word, into each input cell.

The code executes the model by going cell by cell, copying values from other cells, that one points at, computing the multiplication/division that cell wants to do, and so on.

The cells are chained, from the input cells, all the way to the output cells.

Once th program goes through all of the cells, and there is a new value in the output cell, that is a word the model has produced. That word is then copied back into one of remaining free I out cells, and this whole thing repeats, untill you stop it, or the model puts a special value in the output cell, meaning it ran out of ideas.

A model file, consists of definition of all of the cells, what they point at, and what math equation they do with the value they pull from other cells they point at.

Context size, is more or less the number of input cells in a model.

If you just add more input cells, other cells simply do not point at those newly added cells and they would just ignore those.

You have to do some sort of training on the model, with those added input cells, and you have to run the model many many many times, and let it slightly modify cell pointers and the math operations they want to do.

To do that, you typically have to have a training data, with long enough input and output examples, because otherwise, during the training new input cells are always empty, and the model just "learns" they are useless, and does not start using them.

There are other methods, of making the model consider the newly added input cells, and those involve hacky and clever manipulations of how the model run.

Naively, you could just let the model randomly point at N Input cells, different ones each "run" ( a run in this context being a process of generating a single word for the output )

Or, you could let the model run couple of times, on parts of the input, extract from the run how much model is interested in particular words, and "compress" the Input so it fits in the predefined input cells.

All sorts of crazy things are being invented pretty much every week, to solve this problem.

Please note, that is a giant simplification.

0 replies

ghost · 2023-09-01T23:54:31Z

ghost
Sep 1, 2023

Perhaps -c 128000? Rope parameters are set in GGUF, so llama.cpp handles scaling as long as it was converted with correct metadata.

RAM required for a 13B@128k context must be intense.

0 replies

mirek190 · 2023-09-02T00:30:04Z

mirek190
Sep 2, 2023
Author

I just tested .... more than 100 GB of memory RAM or VRAM is requited.

0 replies

netrunnereve · 2023-09-02T03:47:02Z

netrunnereve
Sep 2, 2023
Collaborator

Has anyone gotten good results with the Yarn model when it's GGUF quantized? With Q4_K_M and -c 20000 (that's about all I can do memory wise) it's already visibly degrading at the 6k token mark. At 10k tokens of input it started to spew absolute gibberish. Meanwhile the old LLaMA v1 models with RoPE+SuperHOT could exceed 10k with no issues after setting the correct RoPE parameters.

The input is an excerpt from a novel and as these aren't chat finetunes I just let the model continue writing the story.

0 replies

Green-Sky · 2023-09-02T17:24:36Z

Green-Sky
Sep 2, 2023
Collaborator

The models rope scaling type is yarn, and as such, the rope scale is not read and put into the gguf. I don't think the yarn type rope scaling is implemented.

  "rope_scaling": {
    "factor": 32.0,
    "original_max_position_embeddings": 4096,
    "type": "yarn",
    "finetuned": true
  }

looks like yarn is similar to ntk (v2?) :

    def _rope_scaling_validation(self):
        """
        Validate the `rope_scaling` configuration.
        """
        if self.rope_scaling is None:
            return

        if not isinstance(self.rope_scaling, dict):
            raise ValueError(
                "`rope_scaling` must be a dictionary, "
                f"got {self.rope_scaling}"
            )
        rope_scaling_type = self.rope_scaling.get("type", None)
        rope_scaling_factor = self.rope_scaling.get("factor", None)
        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic", "ntk-by-parts", "yarn", "dynamic-yarn"]:
            raise ValueError(
                f"`rope_scaling`'s name field must be one of ['linear', 'dynamic', 'ntk-by-parts', 'yarn', 'dynamic-yarn'], got {rope_scaling_type}"
            )
        if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
            raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")
        if rope_scaling_type == "ntk-by-parts" or rope_scaling_type == "yarn" or rope_scaling_type == "dynamic-yarn":
            original_max_position_embeddings = self.rope_scaling.get("original_max_position_embeddings", None)
            if original_max_position_embeddings is None or not isinstance(original_max_position_embeddings, int):
                raise ValueError(f"`rope_scaling.original_max_position_embeddings` must be set to an int when using ntk-by-parts, yarn, and dynamic-yarn")

6 replies

ghost Sep 3, 2023

I don't think the yarn type rope scaling is implemented.

Some mention a linear & null rope_scaling. I think you're correct - yarn isn't implemented for llama.cpp

Green-Sky Nov 4, 2023
Collaborator

update: yarn support got merged.

ghost Nov 4, 2023

@Green-Sky Thanks for update! What command line parameters and model do you recommend for an a6000? model here ? https://huggingface.co/TheBloke

Green-Sky Nov 5, 2023
Collaborator

I am not too sure myself, but every NEW yarn finetune will come with the correct values baked in.
(rn only Yarn-Mistral-7B-64k-GGUF, Yarn-Mistral-7B-128k-GGUF)
Also noteworthy is the fact that Mistral uses grouped query attention, which significantly reduces the context size (in bytes). with the 7B Mistral finetune 64K context uses 8GiB and 128K 16GiB respectively. Also, make sure to tweak batch size (-b), because it seems to default to half of the context size and allocates a scratch buffer (on the gpu) respectivly. (i think the scratch buffer size scaling is broken, so just set it lower like -b 64 instead of the default of 512)
More info can be found in the PR #2268

Green-Sky Nov 5, 2023
Collaborator

but every NEW yarn finetune will come with the correct values baked in.

You can also reconvert the old finetunes yourself, to get updated GGUF.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to properly initiate 128K context model? #2963

{{title}}

Replies: 7 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How to properly initiate 128K context model? #2963

Replies: 7 comments · 10 replies

KerfuffleV2 Sep 1, 2023 Collaborator

KerfuffleV2 Sep 1, 2023 Collaborator

mirek190 Sep 2, 2023 Author

netrunnereve Sep 2, 2023 Collaborator

Green-Sky Sep 2, 2023 Collaborator

Green-Sky Nov 4, 2023 Collaborator

Green-Sky Nov 5, 2023 Collaborator

Green-Sky Nov 5, 2023 Collaborator

Replies: 7 comments 10 replies

KerfuffleV2 Sep 1, 2023
Collaborator

KerfuffleV2 Sep 1, 2023
Collaborator

mirek190
Sep 2, 2023
Author

netrunnereve
Sep 2, 2023
Collaborator

Green-Sky
Sep 2, 2023
Collaborator

Green-Sky Nov 4, 2023
Collaborator

Green-Sky Nov 5, 2023
Collaborator

Green-Sky Nov 5, 2023
Collaborator