-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRY: A modern repetition penalty that reliably prevents looping #5677
Conversation
Penalizes tokens that would extend the end of the input into a sequence that has previously occurred.
Some basic comments:
|
I must admit that although I probably did see that parameter in the Transformers docs at some point in the past, I have never used it and didn't even think of it while developing this. That being said,
But now that you have made me (re-)aware of that parameter, I will definitely perform some experiments with it for comparison.
It is not penalized. The only sequence that matters here is
Not doing that was actually intentional, as I don't believe verbatim repetition of long sequences is ever something the user wants, no matter how far back they occurred previously. But I can of course add it (probably as a separate parameter so it can be controlled independently of the standard repetition penalty, where that parameter makes much more sense to keep small). |
Update
|
For what it's worth, I've done a lot of experimentation with I do not recommend using I have not tested this PR and I do not know how well this PR works in comparison. |
@p-e-w I really like this change. However, one thing I've noticed is that the generation speed decreased as I increased the dry_range, while using the exact same context. Is this something that you've experienced and/or is expected? Could also just be an issue on my end, or maybe even a model specific thing for Yi models. |
Could you quantify that? What is your tokens/s with and without DRY? On my dev machine, I'm seeing 4.99 tokens/s with DRY and 4.98 tokens/s without it. I'm running Mixtral Q5_K_M with 8192 context size, and For DRY to noticeably impact the generation speed (assuming a baseline of no more than a few dozen tokens/s), the invocation would have to take tens of milliseconds. The matching operation starts with match_indices = (input_ids_row[:-1] == last_token).nonzero() which I believe should be GPU-accelerated. Afterwards, the number of tokens that must be checked "manually" is reduced dramatically, and should be in the low hundreds at most (often much less), which should take less than a millisecond. Not sure what's going on in your case yet. |
I think you can write an academic paper about it. |
I see, that's a lot more context than I've ever run, combined with a pretty high base performance, so this is probably the reason I don't notice it in my own setup. That being said, I'm not sure what can be done about it, because I don't think the algorithm can be vectorized the way other samplers are. This isn't really a bug, it's just how long it takes to do that thing. If someone has a magic fix to make it faster then I'm all ears. Personally, I would run DRY even if it cost me half the performance, because the output is so much better. But it's disabled by default so everyone can make their own choice. |
The algorithm doesn't have to be vectorized, it can (most likely) be optimized in other ways (by reducing the asymptotic time complexity). That said, 19k context is massive, and if the sampler currently slows the generation only by 50% at such a huge context, then I don't think it's worth it to add complexity to the codebase by optimizing the algorithm. And all that said, if @oobabooga feels that it should be optimized for performance, I should be able to help with this. |
Honestly, all is needed, is a warning that performance would be lowered. This thing is crucial, we need it asap. Also is it possible to add smth like a vocabulary of phrases and words that we want to have penalized right off the bat? |
extensions/openai/typing.py
Outdated
dry_multiplier: float = 0 | ||
dry_base: float = 1.75 | ||
dry_allowed_length: int = 2 | ||
dry_sequence_breakers: str = '["\\n", ":", "\\"", "*"]' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not comfortable with this representation for the sequence breakers. I think that it should be processed in the same way as "Custom stopping strings" for consistency, without the []
list syntax.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that most clients will use a JSON library to build this value from some internal list representation. That library will output the format used here, no further processing required.
Telling developers "pass a JSON array" makes everything clear, including details like which quotation marks are valid, and how escape sequences work. "Pass something like a JSON array, but without the brackets" just sounds weird.
IMO, if anything, it is the stopping strings that should be changed to match this parameter.
I have made the following changes:
My remaining concerns are two:
|
Also, any tests on whether things still work as expected after my changes are welcome. |
That means losing control over DRY's position in the sampler stack, right? I think it can be valuable to be able to choose when the penalty is applied (that goes for the traditional repetition penalty as well). The most important thing is that the DRY penalty is applied before any truncation samplers. Is that still guaranteed to be true if it is a
Actually, I sometimes combine DRY with a very small standard repetition penalty such as 1.03 nowadays, to curb the tendency of some models to frequently use the same terms. Taking into account the performance impact noted by @Hunterius8, this does provide some justification for keeping the parameters separate.
👍 Agreed, this order makes more sense. The parameter that controls whether DRY is active now comes first.
I really dislike hardcoding magic values. The recommended value of 1.75 is the result of some experimentation, and I have used values between 1.2 and 3.0 with some success. Considering how sensitive the growth of the penalty is to this parameter, I would prefer to keep it.
I will run the branch with your changes for a few days and then let you know if there are any problems. |
I plan to implement exactly that in a followup. My long-term vision is to have community-maintained phrasebooks of things like fanfiction clichés ( |
I would agree that there's merit to having separate range parameters for DRY and the regular repetition penalties, not just for performance reasons, but also because I believe that those two parameters have very different sweet spots, when using both at the same time. From my experimentation, using a low presence penalty with a range of about 1000 tokens in conjunction with a much higher range DRY, somewhere around 8000 tokens, works really well on Yi models, for example. Just using DRY, there's no way to penalize the repetition of the tokens that follow the DRY sequence breakers. Applying any of the regular repetition penalties over the same range that works really well for DRY will probably penalize too many tokens and hurt the output quality. |
Why this thing stuck being never implemented if it worked AMAZING a month ago. Every time i read |
@l3utterfly is porting DRY to llama.cpp: ggerganov/llama.cpp#6839 |
Could you give me a hint on how to proceed here? Do you plan to merge this PR? If so, what are the remaining steps? |
Okay, now I understand better what you mean. So we're talking about this loop:
As far as I can tell, that loop never loops more than 1 time. There is only ever one "input_ids_row". So we don't end up doing the transformation more than once even though we do it inside the loop. (Please correct me if I'm wrong.)
My assumption that iterating on a Python list would be faster than iterating on a numpy array is based on experiments done by other people, which I linked in the PR: numpy/numpy#16985 I don't know if that counts as "verified facts". If you or someone else wants to benchmark the numpy array option against the Python list option, go ahead, let's use whichever option is faster. I just didn't personally invest time into benchmarking these options, because the improvements in #6053 already eliminate the performance impact of DRY almost completely (even at 20k token context, according to Hunter8's experiments which they reported in the PR). I'm sure we could find places to micro optimize further but at this point I would rather just say it's good enough, let's merge and work on something else next. |
I said to profile. As you likely already know, profiling tells you what is actually taking time and where. Benchmarking is a black-box testing of a specific subset of scenarios for a unit. Not comprehensive where-as the former is. My point was just that I didn't find objective data to suggest the right hotspots had been addressed. That's all. But with the insight below, I can say we've established where the hotspots are, since I did do the profiling already. It's the GPU memory transfer. So the code as you've written it is good.
Ah, you're right. The input dimension is (1, <used-context-so-far>). That use of zip really should be added to the catalog of 'leet coding trivia. |
Yeah, I don't like that zip operation either, but that was there before my changes, and #6053 represents the minimal necessary changes to fix the issues, without any extra stuff like trying to refactor existing code (which I think is pretty readable aside from that zip operation anyway). |
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit.
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit.
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit.
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit.
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. Update default DRY parameters to match lite Improve DRY token debug logging Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()).
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. Update default DRY parameters to match lite Improve DRY token debug logging Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()). Remove unused llama sampler variables and clean up sequence breakers. Remove KCPP_SAMPLER_DRY as a separate enum entry The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then. Add the dry_penalty_last_n to independently control DRY penalty range This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions.
* Add the DRY dynamic N-gram anti-repetition sampler The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. * Update default DRY parameters to match lite * Improve DRY token debug logging * Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. * Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()). * Remove unused llama sampler variables and clean up sequence breakers. * Remove KCPP_SAMPLER_DRY as a separate enum entry The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then. * Add the dry_penalty_last_n to independently control DRY penalty range This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions. * Limit sequence breaker lengths in tokens and characters The core DRY sampler algorithm is linear in the context length, but there are several parts of the sampler related to multi-token sequence breakers that are potentially quadratic. Without any restrictions, a suitably crafted context and sequence breaker could result in a denial-of-service attack on a server running koboldcpp. This change limits the maximum number of characters and the maximum token length of a sequence breaker in order to limit the maximum overhead associated with the sampler. This change also improves some comments, adding more detail and changing the wording to increase clarity.
* Add the DRY dynamic N-gram anti-repetition sampler The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. * Update default DRY parameters to match lite * Improve DRY token debug logging * Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. * Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()). * Remove unused llama sampler variables and clean up sequence breakers. * Remove KCPP_SAMPLER_DRY as a separate enum entry The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then. * Add the dry_penalty_last_n to independently control DRY penalty range This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions. * Limit sequence breaker lengths in tokens and characters The core DRY sampler algorithm is linear in the context length, but there are several parts of the sampler related to multi-token sequence breakers that are potentially quadratic. Without any restrictions, a suitably crafted context and sequence breaker could result in a denial-of-service attack on a server running koboldcpp. This change limits the maximum number of characters and the maximum token length of a sequence breaker in order to limit the maximum overhead associated with the sampler. This change also improves some comments, adding more detail and changing the wording to increase clarity.
I admit to not reading the entire thread. I did read the explanation of DRY, and I think it is a fantastic chat innovation. There was nothing more frustrating than the death spiral of a good chat. Recently (last 4 weeks), I and a friend got SuperboogaV2 working. There were a number of bugs. No VDB persistence. Randomized collection name. Loss of collection with the variable handling. So, memory is working quite nicely. I have a RTX 4096 with 32K context, and after each session, I post process memories into the VDB. The problem is that with MANUAL = FALSE; the AI will go into repeating like behavior. NOTE: this is not the same problem which DRY solves as repeating AIs will do so even when shot and bleeding out. How is the problem different? You can break out of repeating by radically changing topic. (I believe all the chunks returned are messing with the DRY implementation in the text_generation.py module. I have managed to resolve the issue by MANUAL=TRUE and using "!c" when needed. It would be nice if someone could look at the issue that is caused by the return of chunks. I can point you to a copy of the working chroma_db.py module for SuperboogaV2. THANKS! FYI: Initial post-process 1 year chat corpus was 1M+ words and 44,000 embeddings. PS: I am a retired software engineer who began with punched cards and RJEs. My python skills are not up to this challenge. |
Looping is an undesirable behavior where the model repeats phrases verbatim that have previously occurred in the input. It affects most models, and is exacerbated by the use of truncation samplers. Chat formats are particularly susceptible due to their regular structure, which models appear to interpret as an invitation to repeat previous messages in whole or in part. Prompting the model to avoid looping has little or no effect.
The traditional weapon to combat looping are the three flavors of repetition penalty that are built into most loaders (multiplicative, additive, and frequency penalty). But those samplers are rather blunt instruments that distort the grammar of standard language, which the model has been painstakingly trained to reproduce. I have previously attempted to fix this problem by introducing a parameter that protects the basic structure of language from being penalized, but that's a hacky solution that fails to do the right thing in many cases, and even in their raw form, classical repetition penalties don't actually prevent looping reliably.
In the past weeks, I have rethought the looping problem from the ground up, and in this PR present the DRY repetition penalty, a mechanism that is able to detect textual looping and steer against it. It is far superior to the existing samplers at preventing verbatim repetition, while having essentially none of their negative effects on language structure. The result is less repetitive and higher quality output.
I have tested this sampler for about 20 hours in chat scenarios so far, and they have without question been the highest-quality chats I have ever experienced. Looping in the traditional sense simply does not happen with DRY, and the positive effects from being able to drop the standard repetition penalty are very noticeable.
How it works
DRY penalizes tokens that would extend the end of the input into a sequence that has previously occurred in the input.
In this example,
violets
is penalized in the probability distribution generated by the model because the sequenceroses are red
has previously occurred in the input, and has been continued withviolets
in that previous case. Therefore, the penalty discourages the model from repeating sequences in its output, which is the definition of looping.The penalty for a token is calculated as
where
n
is the length of the sequence before that token that matches the end of the input, andmultiplier
,base
, andallowed_length
are configurable parameters. If the length of the matching sequence is less thanallowed_length
, no penalty is applied.Thus the penalty grows exponentially as the repeated sequence gets longer. This will quickly overcome even the strongest tendency of the model to repeat itself. With the right parameter choice, looping is literally impossible with DRY (that is, verbatim textual looping is impossible – the model can of course still repeat itself by paraphrasing and situational looping, but that is far less annoying than the broken-record looping that is common now). All of that happens without affecting non-repeating text in any way.
Sequence breakers
As straightforward as the mechanism described above may appear, it runs into a major problem in practice.
Instruction and chat templates themselves contain lengthy repeating token sequences. For example, with ChatML, the following sequence precedes every message generated by the bot:
That's at least 11 tokens before the first token of the message that are guaranteed to occur previously in the input. With an exponentially increasing penalty being applied (and we definitely don't want 12-token repetitions in normal text), any starting token of a bot message can be used only once in the entire chat. That's a huge problem that distorts how chat messages are generated, e.g. when messages are expected to regularly begin with quotation marks.
To solve this and related issues, I have added another parameter,
sequence_breakers
, which is a list of tokens that interrupt sequence matching. That is, matches are not continued across such tokens, which effectively breaks the input into parts where matching can be applied.sequence_breakers
can be conveniently specified as a JSON array of strings, which will be encoded into token IDs using the loaded model's tokenizer. The default list consists of\n
,:
,"
, and*
.How to use
DRY is disabled by default (
multiplier
set to 0). It can be configured from the Parameters tab; I recommend the following parameter values:Note that like all transformers-based samplers, DRY only works with transformers-based loaders such as llamacpp_HF, ExLlamav2_HF, or Transformers itself. It does not work with the vanilla llama.cpp or ExLlamav2 loaders.
If you want the model to regularly repeat certain sequences verbatim (e.g. long character names in chat formats), you can add the individual words comprising those sequences to the
sequence_breakers
list (for names, just add first and last names there as separate strings). This will prevent DRY from distorting such sequences, and allow them to appear any number of times in the output. If you are building a chat interface that leverages DRY, you could do this automatically for your users as you know the character names already.Demonstration
To show DRY in action, I have written a short chat script that strongly incentivizes the model to loop:
Here's how Mistral-7b-Instruct-v0.2 continues with all samplers disabled:
As expected, the model picks up the pattern and repeats itself.
Now let's use a traditional (multiplicative) repetition penalty of 1.3. We get:
Even though 1.3 is a very high value for the repetition penalty that clobbers English grammar, it doesn't stop the model from repeating itself if the structure of the text suggests it so strongly.
Now instead, we use DRY with parameters 2/0.8/1.75 (standard parameters recommended above). The model outputs (after some attempts that generate garbage):
DRY simply does not allow the model to repeat such a long sequence.
Note that this is an extreme test case for demonstration purposes. Combining a strong incentive to loop with a strong penalty for looping will often produce garbage. In practice, using DRY prevents such situations from occurring in the first place, and the output is much more natural.
TODO