DRY: A modern repetition penalty that reliably prevents looping #5677

p-e-w · 2024-03-10T12:27:20Z

Looping is an undesirable behavior where the model repeats phrases verbatim that have previously occurred in the input. It affects most models, and is exacerbated by the use of truncation samplers. Chat formats are particularly susceptible due to their regular structure, which models appear to interpret as an invitation to repeat previous messages in whole or in part. Prompting the model to avoid looping has little or no effect.

The traditional weapon to combat looping are the three flavors of repetition penalty that are built into most loaders (multiplicative, additive, and frequency penalty). But those samplers are rather blunt instruments that distort the grammar of standard language, which the model has been painstakingly trained to reproduce. I have previously attempted to fix this problem by introducing a parameter that protects the basic structure of language from being penalized, but that's a hacky solution that fails to do the right thing in many cases, and even in their raw form, classical repetition penalties don't actually prevent looping reliably.

In the past weeks, I have rethought the looping problem from the ground up, and in this PR present the DRY repetition penalty, a mechanism that is able to detect textual looping and steer against it. It is far superior to the existing samplers at preventing verbatim repetition, while having essentially none of their negative effects on language structure. The result is less repetitive and higher quality output.

I have tested this sampler for about 20 hours in chat scenarios so far, and they have without question been the highest-quality chats I have ever experienced. Looping in the traditional sense simply does not happen with DRY, and the positive effects from being able to drop the standard repetition penalty are very noticeable.

How it works

DRY penalizes tokens that would extend the end of the input into a sequence that has previously occurred in the input.

In this example, violets is penalized in the probability distribution generated by the model because the sequence roses are red has previously occurred in the input, and has been continued with violets in that previous case. Therefore, the penalty discourages the model from repeating sequences in its output, which is the definition of looping.

The penalty for a token is calculated as

multiplier * base ^ (n - allowed_length)

where n is the length of the sequence before that token that matches the end of the input, and multiplier, base, and allowed_length are configurable parameters. If the length of the matching sequence is less than allowed_length, no penalty is applied.

Thus the penalty grows exponentially as the repeated sequence gets longer. This will quickly overcome even the strongest tendency of the model to repeat itself. With the right parameter choice, looping is literally impossible with DRY (that is, verbatim textual looping is impossible – the model can of course still repeat itself by paraphrasing and situational looping, but that is far less annoying than the broken-record looping that is common now). All of that happens without affecting non-repeating text in any way.

Sequence breakers

As straightforward as the mechanism described above may appear, it runs into a major problem in practice.

Instruction and chat templates themselves contain lengthy repeating token sequences. For example, with ChatML, the following sequence precedes every message generated by the bot:

\n
<|im_end|> \n
<|im_start|>assistant \n
Bot name:

That's at least 11 tokens before the first token of the message that are guaranteed to occur previously in the input. With an exponentially increasing penalty being applied (and we definitely don't want 12-token repetitions in normal text), any starting token of a bot message can be used only once in the entire chat. That's a huge problem that distorts how chat messages are generated, e.g. when messages are expected to regularly begin with quotation marks.

To solve this and related issues, I have added another parameter, sequence_breakers, which is a list of tokens that interrupt sequence matching. That is, matches are not continued across such tokens, which effectively breaks the input into parts where matching can be applied.

sequence_breakers can be conveniently specified as a JSON array of strings, which will be encoded into token IDs using the loaded model's tokenizer. The default list consists of \n, :, ", and *.

How to use

DRY is disabled by default (multiplier set to 0). It can be configured from the Parameters tab; I recommend the following parameter values:

Note that like all transformers-based samplers, DRY only works with transformers-based loaders such as llamacpp_HF, ExLlamav2_HF, or Transformers itself. It does not work with the vanilla llama.cpp or ExLlamav2 loaders.

If you want the model to regularly repeat certain sequences verbatim (e.g. long character names in chat formats), you can add the individual words comprising those sequences to the sequence_breakers list (for names, just add first and last names there as separate strings). This will prevent DRY from distorting such sequences, and allow them to appear any number of times in the output. If you are building a chat interface that leverages DRY, you could do this automatically for your users as you know the character names already.

Demonstration

To show DRY in action, I have written a short chat script that strongly incentivizes the model to loop:

Detective: Where were you last night at 6 PM?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Did you know the victim personally?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have money problems?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have a criminal record?

Suspect: On the advice of my attorneys, I invoke

Here's how Mistral-7b-Instruct-v0.2 continues with all samplers disabled:

my Fifth Amendment right to not answer that question.

As expected, the model picks up the pattern and repeats itself.

Now let's use a traditional (multiplicative) repetition penalty of 1.3. We get:

my Fifth Amendment right to not answer that question.

Even though 1.3 is a very high value for the repetition penalty that clobbers English grammar, it doesn't stop the model from repeating itself if the structure of the text suggests it so strongly.

Now instead, we use DRY with parameters 2/0.8/1.75 (standard parameters recommended above). The model outputs (after some attempts that generate garbage):

secrecy of the grand jury proceedings, which includes my criminal history, if any.

DRY simply does not allow the model to repeat such a long sequence.

Note that this is an extreme test case for demonstration purposes. Combining a strong incentive to loop with a strong penalty for looping will often produce garbage. In practice, using DRY prevents such situations from occurring in the first place, and the output is much more natural.

TODO

I have read the Contributing guidelines.
More testing (I have rewritten this cleanly from scratch after hacking on the codebase while experimenting, so this version isn't as well tested as what I used previously).
Make sure this works over the API.

Penalizes tokens that would extend the end of the input into a sequence that has previously occurred.

oobabooga · 2024-03-10T16:15:05Z

Some basic comments:

Have you compared how well this works vs the existing no_repeat_ngram_size parameter?
To end a chat turn, the model has to generate something like \nChiharu Yamada: or \nYou:. Is that penalized, such that the model is artificially forced to generate longer replies, or is sequence_breakers enough to prevent this artifact?
repetition_penalty_range should probably be considered in this parameter, just like it is considered in the existing repetition/frequence/presence penalty parameters.

p-e-w · 2024-03-10T17:37:33Z

Have you compared how well this works vs the existing no_repeat_ngram_size parameter?

I must admit that although I probably did see that parameter in the Transformers docs at some point in the past, I have never used it and didn't even think of it while developing this.

That being said, no_repeat_ngram_size (which appears to completely forbid all n-gram repetitions over a certain length, and completely allow all below that length) strikes me as something that would produce very unnatural outputs, where suddenly the model slams into a concrete wall where the token it might strongly prefer above all others is hard-disallowed. By contrast, DRY steers the model away from repetition over several successive generation steps, finding the balance point where the model's tendency to repeat is overcome by the penalty. This allows "necessary" repetitions to occur (such as fixed turns of phrase) if the probability distribution is sufficiently skewed, while idle looping is smoothly avoided at an early stage.

no_repeat_ngram_size also appears to lack an equivalent to dry_sequence_breakers, which would make it borderline unusable in practice, just as DRY was before I introduced that parameter.

But now that you have made me (re-)aware of that parameter, I will definitely perform some experiments with it for comparison.

To end a chat turn, the model has to generate something like \nChiharu Yamada: or \nYou:. Is that penalized, such that the model is artificially forced to generate longer replies, or is sequence_breakers enough to prevent this artifact?

It is not penalized. \n is a sequence breaker, so Ch (the first token comprising Chiharu) isn't penalized at all since there is no preceding sequence that could previously occur in the input. The same is true for everything following :, which is also a sequence breaker. Also, sequence breakers themselves are never penalized, so \n etc. can always be freely generated (unlike with standard repetition penalty, which can lead to wall-of-text replies).

The only sequence that matters here is Chiharu Yamada (5 tokens in Mistral). With the standard parameters, aru will receive an additive penalty of 0.8, which shouldn't be a problem, but will grow rapidly from there. With very long names that are expected to be repeated verbatim in the output every time, this can become an issue, and I have noticed it a few times in my testing. This is of course inherent in every repetition penalty system, and I doubt there's an automated way to handle this, especially since the name can occur not only in the label but also in the message itself. In exceptional cases where this becomes enough of an issue to corrupt names, adding the first name to dry_sequence_breakers (which will automatically extract the last token comprising it) should suffice.

repetition_penalty_range should probably be considered in this parameter, just like it is considered in the existing repetition/frequence/presence penalty parameters.

Not doing that was actually intentional, as I don't believe verbatim repetition of long sequences is ever something the user wants, no matter how far back they occurred previously. But I can of course add it (probably as a separate parameter so it can be controlled independently of the standard repetition penalty, where that parameter makes much more sense to keep small).

p-e-w · 2024-03-12T04:53:20Z

Update

Added a parameter to control the range over which DRY looks for matching sequences in the input, mirroring the classical repetition penalties.
More testing with both chat and creative writing. Confirmed that the recommended parameters work well for both use cases.
Confirmed that the parameters work over the API.
Did some experiments with no_repeat_ngram_size. As expected, that parameter is unusable for chat formats. Even without template markup, chat logs at minimum need to contain repeating structures like "\n\nName: ", which is already 6 tokens (more if the name is more complex). So to generate well-formed chat output, no_repeat_ngram_size must be at least 7. But that means that such pearls of GPT prose as her voice barely above a whisper cannot be penalized. And I certainly don't want to see such phrases twice in a chat (I don't even want to see them once, but that's not something a sampler can fix 🤷). By comparison, DRY can easily prevent even shorter phrases from repeating. DRY can also emulate no_repeat_ngram_size by setting dry_multiplier to a huge number, and dry_allowed_length to no_repeat_ngram_size-1, which gives you essentially the original no_repeat_ngram_size plus the benefit of sequence breakers. Overall, I just don't think setting a hard limit on how long repeating sequences may be is the right approach for natural languages.

belladoreai · 2024-03-12T12:50:37Z

For what it's worth, I've done a lot of experimentation with no_repeat_ngram_size in the past and I can confirm it's fairly useless in a chat context. It might be useful in other contexts, especially in contexts where the input is relatively small. But when a chat message history grows, using no_repeat_ngram_size typically leads to situations where the model is intentionally writing broken english (like writing "engglish" instead of "english"), where the brokenness of the language just grows more and more absurd over time. This seems to happen because in many cases (especially with smaller models) the model perceives repetitive output to be extremely likely - so likely, that even broken versions of the repetitive output appear more likely than some other alternative continuation of the text. So when we prevent the model from generating the exact same repetitive continuation to the text, it chooses to use a broken alternative version of the same repetitive text instead of choosing some more natural text.

I do not recommend using no_repeat_ngram_size except at very high values, if no other "circuit breaker" for repetition exists.

I have not tested this PR and I do not know how well this PR works in comparison.

Hunterius8 · 2024-03-25T23:11:58Z

@p-e-w I really like this change. However, one thing I've noticed is that the generation speed decreased as I increased the dry_range, while using the exact same context. Is this something that you've experienced and/or is expected? Could also just be an issue on my end, or maybe even a model specific thing for Yi models.

p-e-w · 2024-03-29T03:29:01Z

@Hunterius8

Could you quantify that? What is your tokens/s with and without DRY?

On my dev machine, I'm seeing 4.99 tokens/s with DRY and 4.98 tokens/s without it. I'm running Mixtral Q5_K_M with 8192 context size, and dry_range = 0, meaning it goes over the full context window.

For DRY to noticeably impact the generation speed (assuming a baseline of no more than a few dozen tokens/s), the invocation would have to take tens of milliseconds. The matching operation starts with

match_indices = (input_ids_row[:-1] == last_token).nonzero()

which I believe should be GPU-accelerated. Afterwards, the number of tokens that must be checked "manually" is reduced dramatically, and should be in the low hundreds at most (often much less), which should take less than a millisecond. Not sure what's going on in your case yet.

Hunterius8 · 2024-03-30T14:08:09Z

@p-e-w

Yeah, ran through a few generations again, here's the tokens/s with every sampler turned off.

Then for the next one I turned on just DRY and set the range to 2048.

And for the last one I set the DRY range to 0.

On my end at least, it seems to have a pretty big impact on the generation speed. I'm wondering if it isn't because I'm using an exl2 quant of the yi-34b-200k model, so I'll retry with a gguf model later.

Update: Got the same results for the gguf models I tested. The issue also persisted on a completely fresh install.

Touch-Night · 2024-04-05T14:28:50Z

I think you can write an academic paper about it.

p-e-w · 2024-04-06T06:08:20Z

@Hunterius8

I see, that's a lot more context than I've ever run, combined with a pretty high base performance, so this is probably the reason I don't notice it in my own setup.

That being said, I'm not sure what can be done about it, because I don't think the algorithm can be vectorized the way other samplers are. This isn't really a bug, it's just how long it takes to do that thing. If someone has a magic fix to make it faster then I'm all ears. Personally, I would run DRY even if it cost me half the performance, because the output is so much better. But it's disabled by default so everyone can make their own choice.

belladoreai · 2024-04-06T13:52:10Z

I'm not sure what can be done about it, because I don't think the algorithm can be vectorized the way other samplers are. This isn't really a bug, it's just how long it takes to do that thing. If someone has a magic fix to make it faster then I'm all ears.

The algorithm doesn't have to be vectorized, it can (most likely) be optimized in other ways (by reducing the asymptotic time complexity).

That said, 19k context is massive, and if the sampler currently slows the generation only by 50% at such a huge context, then I don't think it's worth it to add complexity to the codebase by optimizing the algorithm.

And all that said, if @oobabooga feels that it should be optimized for performance, I should be able to help with this.

Priestru · 2024-04-09T11:37:01Z

Honestly, all is needed, is a warning that performance would be lowered. This thing is crucial, we need it asap. Also is it possible to add smth like a vocabulary of phrases and words that we want to have penalized right off the bat?

oobabooga · 2024-04-11T05:17:15Z

extensions/openai/typing.py

+    dry_multiplier: float = 0
+    dry_base: float = 1.75
+    dry_allowed_length: int = 2
+    dry_sequence_breakers: str = '["\\n", ":", "\\"", "*"]'


I'm not comfortable with this representation for the sequence breakers. I think that it should be processed in the same way as "Custom stopping strings" for consistency, without the [] list syntax.

I expect that most clients will use a JSON library to build this value from some internal list representation. That library will output the format used here, no further processing required.

Telling developers "pass a JSON array" makes everything clear, including details like which quotation marks are valid, and how escape sequences work. "Pass something like a JSON array, but without the brackets" just sounds weird.

IMO, if anything, it is the stopping strings that should be changed to match this parameter.

oobabooga · 2024-04-11T05:21:02Z

I have made the following changes:

Make it a LogitsProcessor like other repetition penalties
Reuse the repetition_penalty_range parameter (I don't want to add a new parameter that does the same thing, and there is no reason to use more than 1 type of repetition penalty at the same time)
Minor UI changes

My remaining concerns are two:

The dry_sequence_breakers format, as commented above
About the base and multiplier parameters, is base really needed? Is there a reason to not hardcode it at 1.75 and leave only multiplier for simplicity and less parameters?

oobabooga · 2024-04-11T05:21:48Z

Also, any tests on whether things still work as expected after my changes are welcome.

p-e-w · 2024-04-13T07:20:52Z

Make it a LogitsProcessor like other repetition penalties

That means losing control over DRY's position in the sampler stack, right? I think it can be valuable to be able to choose when the penalty is applied (that goes for the traditional repetition penalty as well).

The most important thing is that the DRY penalty is applied before any truncation samplers. Is that still guaranteed to be true if it is a LogitsProcessor?

Reuse the repetition_penalty_range parameter (I don't want to add a new parameter that does the same thing, and there is no reason to use more than 1 type of repetition penalty at the same time)

Actually, I sometimes combine DRY with a very small standard repetition penalty such as 1.03 nowadays, to curb the tendency of some models to frequently use the same terms. Taking into account the performance impact noted by @Hunterius8, this does provide some justification for keeping the parameters separate.

Minor UI changes

👍 Agreed, this order makes more sense. The parameter that controls whether DRY is active now comes first.

About the base and multiplier parameters, is base really needed? Is there a reason to not hardcode it at 1.75 and leave only multiplier for simplicity and less parameters?

I really dislike hardcoding magic values. The recommended value of 1.75 is the result of some experimentation, and I have used values between 1.2 and 3.0 with some success. Considering how sensitive the growth of the penalty is to this parameter, I would prefer to keep it.

Also, any tests on whether things still work as expected after my changes are welcome.

I will run the branch with your changes for a few days and then let you know if there are any problems.

p-e-w · 2024-04-13T07:27:46Z

@Priestru

Also is it possible to add smth like a vocabulary of phrases and words that we want to have penalized right off the bat?

I plan to implement exactly that in a followup. My long-term vision is to have community-maintained phrasebooks of things like fanfiction clichés (her tears were clinging to her eyelashes like morning dew etc.) that people can select in a frontend like SillyTavern, which will then be passed to DRY in order to prevent such garbage from ever appearing in the output.

Hunterius8 · 2024-04-14T05:39:57Z

I would agree that there's merit to having separate range parameters for DRY and the regular repetition penalties, not just for performance reasons, but also because I believe that those two parameters have very different sweet spots, when using both at the same time. From my experimentation, using a low presence penalty with a range of about 1000 tokens in conjunction with a much higher range DRY, somewhere around 8000 tokens, works really well on Yi models, for example.

Just using DRY, there's no way to penalize the repetition of the tokens that follow the DRY sequence breakers. Applying any of the regular repetition penalties over the same range that works really well for DRY will probably penalize too many tokens and hurt the output quality.

Priestru · 2024-04-22T16:29:36Z

Why this thing stuck being never implemented if it worked AMAZING a month ago. Every time i read barely above a whisper i login here to check if this thing added or not, and it never did. Manipulating code for this thing to work on my local becomes increasingly hard especially after no value commits has been added to this pull request. I have to choose between other updates and this. @p-e-w is there any chance your magnificent creation could find it's way to production? Maybe something else supports it?

p-e-w · 2024-04-24T03:24:23Z

@l3utterfly is porting DRY to llama.cpp: ggerganov/llama.cpp#6839

p-e-w · 2024-04-26T01:53:19Z

@oobabooga

Could you give me a hint on how to proceed here? Do you plan to merge this PR? If so, what are the remaining steps?

belladoreai · 2024-06-03T20:23:53Z

I specifically referred to this, which if the tensor resides on GPU (which it does in most cases on a GPU enabled system) the tolist operation will make an implicit GPU->RAM memory transfer, and then convert the values to a python list.

Okay, now I understand better what you mean. So we're talking about this loop:

for input_ids_row, scores_row in zip(input_ids, scores):
            input_ids = input_ids_row.tolist()

As far as I can tell, that loop never loops more than 1 time. There is only ever one "input_ids_row". So we don't end up doing the transformation more than once even though we do it inside the loop. (Please correct me if I'm wrong.)

It read to me that the decision was based on assumptions instead of verified facts. Apologies if that was a misunderstanding of what you meant to convey.

My assumption that iterating on a Python list would be faster than iterating on a numpy array is based on experiments done by other people, which I linked in the PR: numpy/numpy#16985

I don't know if that counts as "verified facts". If you or someone else wants to benchmark the numpy array option against the Python list option, go ahead, let's use whichever option is faster. I just didn't personally invest time into benchmarking these options, because the improvements in #6053 already eliminate the performance impact of DRY almost completely (even at 20k token context, according to Hunter8's experiments which they reported in the PR). I'm sure we could find places to micro optimize further but at this point I would rather just say it's good enough, let's merge and work on something else next.

jojje · 2024-06-03T20:50:21Z

I just didn't personally invest time into benchmarking....

I said to profile. As you likely already know, profiling tells you what is actually taking time and where. Benchmarking is a black-box testing of a specific subset of scenarios for a unit. Not comprehensive where-as the former is. My point was just that I didn't find objective data to suggest the right hotspots had been addressed. That's all.

But with the insight below, I can say we've established where the hotspots are, since I did do the profiling already. It's the GPU memory transfer. So the code as you've written it is good.

As far as I can tell, that loop never loops more than 1 time

Ah, you're right. The input dimension is (1, <used-context-so-far>).
The use of zip for that operation then seems to be a very convoluted way of expressing squeeze (dropping the first dimension) from each of the two tensors.

That use of zip really should be added to the catalog of 'leet coding trivia.

belladoreai · 2024-06-03T20:58:28Z

Yeah, I don't like that zip operation either, but that was there before my changes, and #6053 represents the minimal necessary changes to fix the issues, without any extra stuff like trying to refactor existing code (which I think is pretty readable aside from that zip operation anyway).

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit.

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. Update default DRY parameters to match lite Improve DRY token debug logging Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()).

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. Update default DRY parameters to match lite Improve DRY token debug logging Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()). Remove unused llama sampler variables and clean up sequence breakers. Remove KCPP_SAMPLER_DRY as a separate enum entry The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then. Add the dry_penalty_last_n to independently control DRY penalty range This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions.

* Add the DRY dynamic N-gram anti-repetition sampler The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram repetition penalty that negatively scores tokens that would extend sequences that already appear in the context. See this discussion for a motivation and explanation of the sampler: oobabooga/text-generation-webui#5677 This implementation of DRY mostly aligns with the obabooga version with a few modifications. It uses a more efficient linear scanning algorithm to identify repetitions. It also supports multi-token sequence breakers. As a limitation, this implementation reuses the rep pen range parameter, rather than introducing a new range just for the DRY sampler. There is a separate change to lite.koboldai.net that exposes the DRY sampler parameters to KoboldAI Lite, so none of the embed files have been changed as part of this commit. * Update default DRY parameters to match lite * Improve DRY token debug logging * Replace `and` with `&&` to fix MSVC compile error Little known fact: The C++98 standard defines `and` as an alternative token for the `&&` operator (along with a bunch of other digraphs). MSVC does not allow these without using the /Za option or including the <iso646.h> header. Change to the more standard operator to make this code more portable. * Fix MSVC compile error because log is not constexpr Replace the compile-time computation with a floating-point approximation of log(std::numeric_limits<float>::max()). * Remove unused llama sampler variables and clean up sequence breakers. * Remove KCPP_SAMPLER_DRY as a separate enum entry The DRY sampler is effectively a repetition penalty and there are very few reasons to apply it at a different place in sampler order than the standard single-token penalty. There are also multiple projects that have dependencies on the existing sampler IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order to minimize the impact of the dependencies of adding the DRY sampler to koboldcpp, it makes the most sense to not add a new ID for now, and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future if we find a use case for splitting the application of rep pen and DRY we can introduce a new enum entry then. * Add the dry_penalty_last_n to independently control DRY penalty range This parameter follows the oobabooga semantics: it's optional, with a default value of zero. Zero means that DRY should sample the entire context. Otherwise, it's the number of tokens from the end of the context that are scanned for repetitions. * Limit sequence breaker lengths in tokens and characters The core DRY sampler algorithm is linear in the context length, but there are several parts of the sampler related to multi-token sequence breakers that are potentially quadratic. Without any restrictions, a suitably crafted context and sequence breaker could result in a denial-of-service attack on a server running koboldcpp. This change limits the maximum number of characters and the maximum token length of a sequence breaker in order to limit the maximum overhead associated with the sampler. This change also improves some comments, adding more detail and changing the wording to increase clarity.

MarkShotGitHub · 2024-07-19T18:42:02Z

I admit to not reading the entire thread. I did read the explanation of DRY, and I think it is a fantastic chat innovation. There was nothing more frustrating than the death spiral of a good chat.

Recently (last 4 weeks), I and a friend got SuperboogaV2 working. There were a number of bugs. No VDB persistence. Randomized collection name. Loss of collection with the variable handling. So, memory is working quite nicely. I have a RTX 4096 with 32K context, and after each session, I post process memories into the VDB.

The problem is that with MANUAL = FALSE; the AI will go into repeating like behavior. NOTE: this is not the same problem which DRY solves as repeating AIs will do so even when shot and bleeding out. How is the problem different? You can break out of repeating by radically changing topic. (I believe all the chunks returned are messing with the DRY implementation in the text_generation.py module.

I have managed to resolve the issue by MANUAL=TRUE and using "!c" when needed.

It would be nice if someone could look at the issue that is caused by the return of chunks. I can point you to a copy of the working chroma_db.py module for SuperboogaV2.

THANKS!

FYI: Initial post-process 1 year chat corpus was 1M+ words and 44,000 embeddings.

PS: I am a retired software engineer who began with punched cards and RJEs. My python skills are not up to this challenge.

…booga#5677)

Add DRY repetition penalty

b796884

Penalizes tokens that would extend the end of the input into a sequence that has previously occurred.

Add dry_range parameter

85f1da0

Merge branch 'dev' into dry

cdfd440

Priestru mentioned this pull request Apr 9, 2024

[FEATURE_REQUEST] How do I add sampling parameters to default ones? SillyTavern/SillyTavern#2048

Closed

oobabooga added 5 commits April 10, 2024 21:56

Merge branch 'dev' into p-e-w-dry

09b9da5

Reorganize the UI

f3f955c

Remove dry_range, make DRY a LogitsProcessor

a7f9754

Organize

bc7a54b

Fix a bug

98da23f

oobabooga reviewed Apr 11, 2024

View reviewed changes

l3utterfly mentioned this pull request Apr 23, 2024

added implementation of DRY sampler ggerganov/llama.cpp#6839

Closed

Nexesenex mentioned this pull request Jul 8, 2024

Add the DRY dynamic N-gram anti-repetition sampler Nexesenex/croco.cpp#221

Closed

pi6am mentioned this pull request Jul 9, 2024

Add the DRY dynamic N-gram anti-repetition sampler LostRuins/koboldcpp#982

Merged

p-e-w mentioned this pull request Jul 27, 2024

Add DRY repetition penalty EricLBuehler/mistral.rs#635

Closed

fodevac33 mentioned this pull request Aug 2, 2024

[Roadmap] vLLM Roadmap Q3 2024 vllm-project/vllm#5805

Closed

46 tasks

BuildBackBuehler mentioned this pull request Aug 10, 2024

[Feature Request] Novel Config Enhancements: Min-P & DRY Penalization mlc-ai/mlc-llm#2793

Open

81549361 mentioned this pull request Aug 22, 2024

Support min-p sampling sgl-project/sglang#1167

Merged

3 tasks

jeroen-mostert mentioned this pull request Aug 26, 2024

Dynatemp and min_p upgrade? ggerganov/llama.cpp#9178

Closed

Abdulhanan535 mentioned this pull request Sep 6, 2024

[Feature]: Implementation of DRY sampler. aphrodite-engine/aphrodite-engine#574

Open

ghost mentioned this pull request Sep 8, 2024

[Feature] DRY repetition penalty sgl-project/sglang#1350

Closed

2 tasks

Shreyansh1311 mentioned this pull request Sep 18, 2024

[Feature]: DRY Sampling vllm-project/vllm#8581

Open

1 task

p-e-w mentioned this pull request Sep 20, 2024

Dry sample sgl-project/sglang#1187

Closed

3 tasks

chimezie mentioned this pull request Sep 30, 2024

repetiton_penalty and logits_bias just using logits_processors ml-explore/mlx-examples#1004

Merged

wwoodsTM mentioned this pull request Oct 1, 2024

added implementation of DRY sampler (post-refactor) ggerganov/llama.cpp#9702

Merged

4 tasks

PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024

DRY: A modern repetition penalty that reliably prevents looping (ooba…

85b0e9b

…booga#5677)

This was referenced Nov 5, 2024

feat: Expose DRY and XTC parameters in frontend janhq/jan#3950

Open

Expose DRY and XTC parameters ollama/ollama#7504

Open

BradKML mentioned this pull request Dec 18, 2024

[Bug]: Incorrectly interpreting a Python error as a success message All-Hands-AI/OpenHands#5637

Closed

1 task

p-e-w mentioned this pull request Jan 4, 2025

[Frontend][Misc] Don't Repeat Yourself (DRY) Sampling vllm-project/vllm#11368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRY: A modern repetition penalty that reliably prevents looping #5677

DRY: A modern repetition penalty that reliably prevents looping #5677

p-e-w commented Mar 10, 2024 •

edited

Loading

oobabooga commented Mar 10, 2024 •

edited

Loading

p-e-w commented Mar 10, 2024

p-e-w commented Mar 12, 2024

belladoreai commented Mar 12, 2024 •

edited

Loading

Hunterius8 commented Mar 25, 2024

p-e-w commented Mar 29, 2024

Hunterius8 commented Mar 30, 2024 •

edited

Loading

Touch-Night commented Apr 5, 2024

p-e-w commented Apr 6, 2024

belladoreai commented Apr 6, 2024

Priestru commented Apr 9, 2024

oobabooga Apr 11, 2024

p-e-w Apr 13, 2024

oobabooga commented Apr 11, 2024

oobabooga commented Apr 11, 2024

p-e-w commented Apr 13, 2024

p-e-w commented Apr 13, 2024

Hunterius8 commented Apr 14, 2024

Priestru commented Apr 22, 2024

p-e-w commented Apr 24, 2024

p-e-w commented Apr 26, 2024

belladoreai commented Jun 3, 2024

jojje commented Jun 3, 2024 •

edited

Loading

belladoreai commented Jun 3, 2024

MarkShotGitHub commented Jul 19, 2024

DRY: A modern repetition penalty that reliably prevents looping #5677

DRY: A modern repetition penalty that reliably prevents looping #5677

Conversation

p-e-w commented Mar 10, 2024 • edited Loading

How it works

Sequence breakers

How to use

Demonstration

TODO

oobabooga commented Mar 10, 2024 • edited Loading

p-e-w commented Mar 10, 2024

p-e-w commented Mar 12, 2024

Update

belladoreai commented Mar 12, 2024 • edited Loading

Hunterius8 commented Mar 25, 2024

p-e-w commented Mar 29, 2024

Hunterius8 commented Mar 30, 2024 • edited Loading

Touch-Night commented Apr 5, 2024

p-e-w commented Apr 6, 2024

belladoreai commented Apr 6, 2024

Priestru commented Apr 9, 2024

oobabooga Apr 11, 2024

Choose a reason for hiding this comment

p-e-w Apr 13, 2024

Choose a reason for hiding this comment

oobabooga commented Apr 11, 2024

oobabooga commented Apr 11, 2024

p-e-w commented Apr 13, 2024

p-e-w commented Apr 13, 2024

Hunterius8 commented Apr 14, 2024

Priestru commented Apr 22, 2024

p-e-w commented Apr 24, 2024

p-e-w commented Apr 26, 2024

belladoreai commented Jun 3, 2024

jojje commented Jun 3, 2024 • edited Loading

belladoreai commented Jun 3, 2024

MarkShotGitHub commented Jul 19, 2024

p-e-w commented Mar 10, 2024 •

edited

Loading

oobabooga commented Mar 10, 2024 •

edited

Loading

belladoreai commented Mar 12, 2024 •

edited

Loading

Hunterius8 commented Mar 30, 2024 •

edited

Loading

jojje commented Jun 3, 2024 •

edited

Loading