How to properly initiate 128K context model? #2963
Replies: 7 comments 10 replies
-
None. Just the model. The point of gguf is that it already knows what its native context length is. |
Beta Was this translation helpful? Give feedback.
-
This is next level. Can someone please explain how these larger context models are generated? What is required to make a 128k context model for the 70B parameter model? |
Beta Was this translation helpful? Give feedback.
-
A very very crude simplification would be to say, a model as a neural network, is build with "cells" which point at other cells, can take the value form other cells, do something with that value ( say multiply or divide by a set number, specific to that particular cell), and they let other cells use the result. Some cells are designated as Input, and they do not point at anything, you ( the code tthst runs the model ) put your text, encoded into numbers, word by word, into each input cell. The code executes the model by going cell by cell, copying values from other cells, that one points at, computing the multiplication/division that cell wants to do, and so on. The cells are chained, from the input cells, all the way to the output cells. Once th program goes through all of the cells, and there is a new value in the output cell, that is a word the model has produced. That word is then copied back into one of remaining free I out cells, and this whole thing repeats, untill you stop it, or the model puts a special value in the output cell, meaning it ran out of ideas. A model file, consists of definition of all of the cells, what they point at, and what math equation they do with the value they pull from other cells they point at. Context size, is more or less the number of input cells in a model. If you just add more input cells, other cells simply do not point at those newly added cells and they would just ignore those. You have to do some sort of training on the model, with those added input cells, and you have to run the model many many many times, and let it slightly modify cell pointers and the math operations they want to do. To do that, you typically have to have a training data, with long enough input and output examples, because otherwise, during the training new input cells are always empty, and the model just "learns" they are useless, and does not start using them. There are other methods, of making the model consider the newly added input cells, and those involve hacky and clever manipulations of how the model run. Naively, you could just let the model randomly point at N Input cells, different ones each "run" ( a run in this context being a process of generating a single word for the output ) Or, you could let the model run couple of times, on parts of the input, extract from the run how much model is interested in particular words, and "compress" the Input so it fits in the predefined input cells. All sorts of crazy things are being invented pretty much every week, to solve this problem. Please note, that is a giant simplification. |
Beta Was this translation helpful? Give feedback.
-
Perhaps RAM required for a 13B@128k context must be intense. |
Beta Was this translation helpful? Give feedback.
-
I just tested .... more than 100 GB of memory RAM or VRAM is requited. |
Beta Was this translation helpful? Give feedback.
-
Has anyone gotten good results with the Yarn model when it's GGUF quantized? With Q4_K_M and The input is an excerpt from a novel and as these aren't chat finetunes I just let the model continue writing the story. |
Beta Was this translation helpful? Give feedback.
-
The models rope scaling type is "rope_scaling": {
"factor": 32.0,
"original_max_position_embeddings": 4096,
"type": "yarn",
"finetuned": true
} looks like yarn is similar to ntk (v2?) : def _rope_scaling_validation(self):
"""
Validate the `rope_scaling` configuration.
"""
if self.rope_scaling is None:
return
if not isinstance(self.rope_scaling, dict):
raise ValueError(
"`rope_scaling` must be a dictionary, "
f"got {self.rope_scaling}"
)
rope_scaling_type = self.rope_scaling.get("type", None)
rope_scaling_factor = self.rope_scaling.get("factor", None)
if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic", "ntk-by-parts", "yarn", "dynamic-yarn"]:
raise ValueError(
f"`rope_scaling`'s name field must be one of ['linear', 'dynamic', 'ntk-by-parts', 'yarn', 'dynamic-yarn'], got {rope_scaling_type}"
)
if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")
if rope_scaling_type == "ntk-by-parts" or rope_scaling_type == "yarn" or rope_scaling_type == "dynamic-yarn":
original_max_position_embeddings = self.rope_scaling.get("original_max_position_embeddings", None)
if original_max_position_embeddings is None or not isinstance(original_max_position_embeddings, int):
raise ValueError(f"`rope_scaling.original_max_position_embeddings` must be set to an int when using ntk-by-parts, yarn, and dynamic-yarn") |
Beta Was this translation helpful? Give feedback.
-
Like here
https://huggingface.co/TheBloke/Yarn-Llama-2-13B-128K-GGUF
What parameters pass to command line ....
Beta Was this translation helpful? Give feedback.
All reactions