Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRY: A modern repetition penalty that reliably prevents looping #5677

Merged
merged 13 commits into from
May 20, 2024

Conversation

p-e-w
Copy link
Contributor

@p-e-w p-e-w commented Mar 10, 2024

Looping is an undesirable behavior where the model repeats phrases verbatim that have previously occurred in the input. It affects most models, and is exacerbated by the use of truncation samplers. Chat formats are particularly susceptible due to their regular structure, which models appear to interpret as an invitation to repeat previous messages in whole or in part. Prompting the model to avoid looping has little or no effect.

The traditional weapon to combat looping are the three flavors of repetition penalty that are built into most loaders (multiplicative, additive, and frequency penalty). But those samplers are rather blunt instruments that distort the grammar of standard language, which the model has been painstakingly trained to reproduce. I have previously attempted to fix this problem by introducing a parameter that protects the basic structure of language from being penalized, but that's a hacky solution that fails to do the right thing in many cases, and even in their raw form, classical repetition penalties don't actually prevent looping reliably.

In the past weeks, I have rethought the looping problem from the ground up, and in this PR present the DRY repetition penalty, a mechanism that is able to detect textual looping and steer against it. It is far superior to the existing samplers at preventing verbatim repetition, while having essentially none of their negative effects on language structure. The result is less repetitive and higher quality output.

I have tested this sampler for about 20 hours in chat scenarios so far, and they have without question been the highest-quality chats I have ever experienced. Looping in the traditional sense simply does not happen with DRY, and the positive effects from being able to drop the standard repetition penalty are very noticeable.

How it works

DRY penalizes tokens that would extend the end of the input into a sequence that has previously occurred in the input.

dry

In this example, violets is penalized in the probability distribution generated by the model because the sequence roses are red has previously occurred in the input, and has been continued with violets in that previous case. Therefore, the penalty discourages the model from repeating sequences in its output, which is the definition of looping.

The penalty for a token is calculated as

multiplier * base ^ (n - allowed_length)

where n is the length of the sequence before that token that matches the end of the input, and multiplier, base, and allowed_length are configurable parameters. If the length of the matching sequence is less than allowed_length, no penalty is applied.

Thus the penalty grows exponentially as the repeated sequence gets longer. This will quickly overcome even the strongest tendency of the model to repeat itself. With the right parameter choice, looping is literally impossible with DRY (that is, verbatim textual looping is impossible – the model can of course still repeat itself by paraphrasing and situational looping, but that is far less annoying than the broken-record looping that is common now). All of that happens without affecting non-repeating text in any way.

Sequence breakers

As straightforward as the mechanism described above may appear, it runs into a major problem in practice.

Instruction and chat templates themselves contain lengthy repeating token sequences. For example, with ChatML, the following sequence precedes every message generated by the bot:

\n
<|im_end|> \n
<|im_start|>assistant \n
Bot name: 

That's at least 11 tokens before the first token of the message that are guaranteed to occur previously in the input. With an exponentially increasing penalty being applied (and we definitely don't want 12-token repetitions in normal text), any starting token of a bot message can be used only once in the entire chat. That's a huge problem that distorts how chat messages are generated, e.g. when messages are expected to regularly begin with quotation marks.

To solve this and related issues, I have added another parameter, sequence_breakers, which is a list of tokens that interrupt sequence matching. That is, matches are not continued across such tokens, which effectively breaks the input into parts where matching can be applied.

sequence_breakers can be conveniently specified as a JSON array of strings, which will be encoded into token IDs using the loaded model's tokenizer. The default list consists of \n, :, ", and *.

How to use

DRY is disabled by default (multiplier set to 0). It can be configured from the Parameters tab; I recommend the following parameter values:

dry-parameters

Note that like all transformers-based samplers, DRY only works with transformers-based loaders such as llamacpp_HF, ExLlamav2_HF, or Transformers itself. It does not work with the vanilla llama.cpp or ExLlamav2 loaders.

If you want the model to regularly repeat certain sequences verbatim (e.g. long character names in chat formats), you can add the individual words comprising those sequences to the sequence_breakers list (for names, just add first and last names there as separate strings). This will prevent DRY from distorting such sequences, and allow them to appear any number of times in the output. If you are building a chat interface that leverages DRY, you could do this automatically for your users as you know the character names already.

Demonstration

To show DRY in action, I have written a short chat script that strongly incentivizes the model to loop:

Detective: Where were you last night at 6 PM?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Did you know the victim personally?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have money problems?

Suspect: On the advice of my attorneys, I invoke my Fifth Amendment right to not answer that question.

Detective: Do you have a criminal record?

Suspect: On the advice of my attorneys, I invoke

Here's how Mistral-7b-Instruct-v0.2 continues with all samplers disabled:

my Fifth Amendment right to not answer that question.

As expected, the model picks up the pattern and repeats itself.

Now let's use a traditional (multiplicative) repetition penalty of 1.3. We get:

my Fifth Amendment right to not answer that question.

Even though 1.3 is a very high value for the repetition penalty that clobbers English grammar, it doesn't stop the model from repeating itself if the structure of the text suggests it so strongly.

Now instead, we use DRY with parameters 2/0.8/1.75 (standard parameters recommended above). The model outputs (after some attempts that generate garbage):

secrecy of the grand jury proceedings, which includes my criminal history, if any.

DRY simply does not allow the model to repeat such a long sequence.

Note that this is an extreme test case for demonstration purposes. Combining a strong incentive to loop with a strong penalty for looping will often produce garbage. In practice, using DRY prevents such situations from occurring in the first place, and the output is much more natural.

TODO

  • I have read the Contributing guidelines.
  • More testing (I have rewritten this cleanly from scratch after hacking on the codebase while experimenting, so this version isn't as well tested as what I used previously).
  • Make sure this works over the API.

Penalizes tokens that would extend the end of the input into a sequence that has previously occurred.
@oobabooga
Copy link
Owner

oobabooga commented Mar 10, 2024

Some basic comments:

  1. Have you compared how well this works vs the existing no_repeat_ngram_size parameter?
  2. To end a chat turn, the model has to generate something like \nChiharu Yamada: or \nYou:. Is that penalized, such that the model is artificially forced to generate longer replies, or is sequence_breakers enough to prevent this artifact?
  3. repetition_penalty_range should probably be considered in this parameter, just like it is considered in the existing repetition/frequence/presence penalty parameters.

@p-e-w
Copy link
Contributor Author

p-e-w commented Mar 10, 2024

Have you compared how well this works vs the existing no_repeat_ngram_size parameter?

I must admit that although I probably did see that parameter in the Transformers docs at some point in the past, I have never used it and didn't even think of it while developing this.

That being said, no_repeat_ngram_size (which appears to completely forbid all n-gram repetitions over a certain length, and completely allow all below that length) strikes me as something that would produce very unnatural outputs, where suddenly the model slams into a concrete wall where the token it might strongly prefer above all others is hard-disallowed. By contrast, DRY steers the model away from repetition over several successive generation steps, finding the balance point where the model's tendency to repeat is overcome by the penalty. This allows "necessary" repetitions to occur (such as fixed turns of phrase) if the probability distribution is sufficiently skewed, while idle looping is smoothly avoided at an early stage.

no_repeat_ngram_size also appears to lack an equivalent to dry_sequence_breakers, which would make it borderline unusable in practice, just as DRY was before I introduced that parameter.

But now that you have made me (re-)aware of that parameter, I will definitely perform some experiments with it for comparison.

To end a chat turn, the model has to generate something like \nChiharu Yamada: or \nYou:. Is that penalized, such that the model is artificially forced to generate longer replies, or is sequence_breakers enough to prevent this artifact?

It is not penalized. \n is a sequence breaker, so Ch (the first token comprising Chiharu) isn't penalized at all since there is no preceding sequence that could previously occur in the input. The same is true for everything following :, which is also a sequence breaker. Also, sequence breakers themselves are never penalized, so \n etc. can always be freely generated (unlike with standard repetition penalty, which can lead to wall-of-text replies).

The only sequence that matters here is Chiharu Yamada (5 tokens in Mistral). With the standard parameters, aru will receive an additive penalty of 0.8, which shouldn't be a problem, but will grow rapidly from there. With very long names that are expected to be repeated verbatim in the output every time, this can become an issue, and I have noticed it a few times in my testing. This is of course inherent in every repetition penalty system, and I doubt there's an automated way to handle this, especially since the name can occur not only in the label but also in the message itself. In exceptional cases where this becomes enough of an issue to corrupt names, adding the first name to dry_sequence_breakers (which will automatically extract the last token comprising it) should suffice.

repetition_penalty_range should probably be considered in this parameter, just like it is considered in the existing repetition/frequence/presence penalty parameters.

Not doing that was actually intentional, as I don't believe verbatim repetition of long sequences is ever something the user wants, no matter how far back they occurred previously. But I can of course add it (probably as a separate parameter so it can be controlled independently of the standard repetition penalty, where that parameter makes much more sense to keep small).

@p-e-w
Copy link
Contributor Author

p-e-w commented Mar 12, 2024

Update

  • Added a parameter to control the range over which DRY looks for matching sequences in the input, mirroring the classical repetition penalties.
  • More testing with both chat and creative writing. Confirmed that the recommended parameters work well for both use cases.
  • Confirmed that the parameters work over the API.
  • Did some experiments with no_repeat_ngram_size. As expected, that parameter is unusable for chat formats. Even without template markup, chat logs at minimum need to contain repeating structures like "\n\nName: ", which is already 6 tokens (more if the name is more complex). So to generate well-formed chat output, no_repeat_ngram_size must be at least 7. But that means that such pearls of GPT prose as her voice barely above a whisper cannot be penalized. And I certainly don't want to see such phrases twice in a chat (I don't even want to see them once, but that's not something a sampler can fix 🤷). By comparison, DRY can easily prevent even shorter phrases from repeating. DRY can also emulate no_repeat_ngram_size by setting dry_multiplier to a huge number, and dry_allowed_length to no_repeat_ngram_size-1, which gives you essentially the original no_repeat_ngram_size plus the benefit of sequence breakers. Overall, I just don't think setting a hard limit on how long repeating sequences may be is the right approach for natural languages.

@belladoreai
Copy link
Contributor

belladoreai commented Mar 12, 2024

For what it's worth, I've done a lot of experimentation with no_repeat_ngram_size in the past and I can confirm it's fairly useless in a chat context. It might be useful in other contexts, especially in contexts where the input is relatively small. But when a chat message history grows, using no_repeat_ngram_size typically leads to situations where the model is intentionally writing broken english (like writing "engglish" instead of "english"), where the brokenness of the language just grows more and more absurd over time. This seems to happen because in many cases (especially with smaller models) the model perceives repetitive output to be extremely likely - so likely, that even broken versions of the repetitive output appear more likely than some other alternative continuation of the text. So when we prevent the model from generating the exact same repetitive continuation to the text, it chooses to use a broken alternative version of the same repetitive text instead of choosing some more natural text.

I do not recommend using no_repeat_ngram_size except at very high values, if no other "circuit breaker" for repetition exists.

I have not tested this PR and I do not know how well this PR works in comparison.

@Hunterius8
Copy link

@p-e-w I really like this change. However, one thing I've noticed is that the generation speed decreased as I increased the dry_range, while using the exact same context. Is this something that you've experienced and/or is expected? Could also just be an issue on my end, or maybe even a model specific thing for Yi models.

@p-e-w
Copy link
Contributor Author

p-e-w commented Mar 29, 2024

@Hunterius8

Could you quantify that? What is your tokens/s with and without DRY?

On my dev machine, I'm seeing 4.99 tokens/s with DRY and 4.98 tokens/s without it. I'm running Mixtral Q5_K_M with 8192 context size, and dry_range = 0, meaning it goes over the full context window.

For DRY to noticeably impact the generation speed (assuming a baseline of no more than a few dozen tokens/s), the invocation would have to take tens of milliseconds. The matching operation starts with

match_indices = (input_ids_row[:-1] == last_token).nonzero()

which I believe should be GPU-accelerated. Afterwards, the number of tokens that must be checked "manually" is reduced dramatically, and should be in the low hundreds at most (often much less), which should take less than a millisecond. Not sure what's going on in your case yet.

@Hunterius8
Copy link

Hunterius8 commented Mar 30, 2024

@p-e-w

Yeah, ran through a few generations again, here's the tokens/s with every sampler turned off.
dry_off

Then for the next one I turned on just DRY and set the range to 2048.
dry_2048

And for the last one I set the DRY range to 0.
dry_full

On my end at least, it seems to have a pretty big impact on the generation speed. I'm wondering if it isn't because I'm using an exl2 quant of the yi-34b-200k model, so I'll retry with a gguf model later.

Update: Got the same results for the gguf models I tested. The issue also persisted on a completely fresh install.

@Touch-Night
Copy link
Contributor

I think you can write an academic paper about it.

@p-e-w
Copy link
Contributor Author

p-e-w commented Apr 6, 2024

@Hunterius8

I see, that's a lot more context than I've ever run, combined with a pretty high base performance, so this is probably the reason I don't notice it in my own setup.

That being said, I'm not sure what can be done about it, because I don't think the algorithm can be vectorized the way other samplers are. This isn't really a bug, it's just how long it takes to do that thing. If someone has a magic fix to make it faster then I'm all ears. Personally, I would run DRY even if it cost me half the performance, because the output is so much better. But it's disabled by default so everyone can make their own choice.

@belladoreai
Copy link
Contributor

I'm not sure what can be done about it, because I don't think the algorithm can be vectorized the way other samplers are. This isn't really a bug, it's just how long it takes to do that thing. If someone has a magic fix to make it faster then I'm all ears.

The algorithm doesn't have to be vectorized, it can (most likely) be optimized in other ways (by reducing the asymptotic time complexity).

That said, 19k context is massive, and if the sampler currently slows the generation only by 50% at such a huge context, then I don't think it's worth it to add complexity to the codebase by optimizing the algorithm.

And all that said, if @oobabooga feels that it should be optimized for performance, I should be able to help with this.

@Priestru
Copy link

Priestru commented Apr 9, 2024

Honestly, all is needed, is a warning that performance would be lowered. This thing is crucial, we need it asap. Also is it possible to add smth like a vocabulary of phrases and words that we want to have penalized right off the bat?

dry_multiplier: float = 0
dry_base: float = 1.75
dry_allowed_length: int = 2
dry_sequence_breakers: str = '["\\n", ":", "\\"", "*"]'
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable with this representation for the sequence breakers. I think that it should be processed in the same way as "Custom stopping strings" for consistency, without the [] list syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that most clients will use a JSON library to build this value from some internal list representation. That library will output the format used here, no further processing required.

Telling developers "pass a JSON array" makes everything clear, including details like which quotation marks are valid, and how escape sequences work. "Pass something like a JSON array, but without the brackets" just sounds weird.

IMO, if anything, it is the stopping strings that should be changed to match this parameter.

@oobabooga
Copy link
Owner

I have made the following changes:

  • Make it a LogitsProcessor like other repetition penalties
  • Reuse the repetition_penalty_range parameter (I don't want to add a new parameter that does the same thing, and there is no reason to use more than 1 type of repetition penalty at the same time)
  • Minor UI changes

My remaining concerns are two:

  1. The dry_sequence_breakers format, as commented above
  2. About the base and multiplier parameters, is base really needed? Is there a reason to not hardcode it at 1.75 and leave only multiplier for simplicity and less parameters?

@oobabooga
Copy link
Owner

Also, any tests on whether things still work as expected after my changes are welcome.

@p-e-w
Copy link
Contributor Author

p-e-w commented Apr 13, 2024

Make it a LogitsProcessor like other repetition penalties

That means losing control over DRY's position in the sampler stack, right? I think it can be valuable to be able to choose when the penalty is applied (that goes for the traditional repetition penalty as well).

The most important thing is that the DRY penalty is applied before any truncation samplers. Is that still guaranteed to be true if it is a LogitsProcessor?

Reuse the repetition_penalty_range parameter (I don't want to add a new parameter that does the same thing, and there is no reason to use more than 1 type of repetition penalty at the same time)

Actually, I sometimes combine DRY with a very small standard repetition penalty such as 1.03 nowadays, to curb the tendency of some models to frequently use the same terms. Taking into account the performance impact noted by @Hunterius8, this does provide some justification for keeping the parameters separate.

Minor UI changes

👍 Agreed, this order makes more sense. The parameter that controls whether DRY is active now comes first.

About the base and multiplier parameters, is base really needed? Is there a reason to not hardcode it at 1.75 and leave only multiplier for simplicity and less parameters?

I really dislike hardcoding magic values. The recommended value of 1.75 is the result of some experimentation, and I have used values between 1.2 and 3.0 with some success. Considering how sensitive the growth of the penalty is to this parameter, I would prefer to keep it.

Also, any tests on whether things still work as expected after my changes are welcome.

I will run the branch with your changes for a few days and then let you know if there are any problems.

@p-e-w
Copy link
Contributor Author

p-e-w commented Apr 13, 2024

@Priestru

Also is it possible to add smth like a vocabulary of phrases and words that we want to have penalized right off the bat?

I plan to implement exactly that in a followup. My long-term vision is to have community-maintained phrasebooks of things like fanfiction clichés (her tears were clinging to her eyelashes like morning dew etc.) that people can select in a frontend like SillyTavern, which will then be passed to DRY in order to prevent such garbage from ever appearing in the output.

@Hunterius8
Copy link

I would agree that there's merit to having separate range parameters for DRY and the regular repetition penalties, not just for performance reasons, but also because I believe that those two parameters have very different sweet spots, when using both at the same time. From my experimentation, using a low presence penalty with a range of about 1000 tokens in conjunction with a much higher range DRY, somewhere around 8000 tokens, works really well on Yi models, for example.

Just using DRY, there's no way to penalize the repetition of the tokens that follow the DRY sequence breakers. Applying any of the regular repetition penalties over the same range that works really well for DRY will probably penalize too many tokens and hurt the output quality.

@Priestru
Copy link

Why this thing stuck being never implemented if it worked AMAZING a month ago. Every time i read barely above a whisper i login here to check if this thing added or not, and it never did. Manipulating code for this thing to work on my local becomes increasingly hard especially after no value commits has been added to this pull request. I have to choose between other updates and this. @p-e-w is there any chance your magnificent creation could find it's way to production? Maybe something else supports it?

@p-e-w
Copy link
Contributor Author

p-e-w commented Apr 24, 2024

@l3utterfly is porting DRY to llama.cpp: ggerganov/llama.cpp#6839

@p-e-w
Copy link
Contributor Author

p-e-w commented Apr 26, 2024

@oobabooga

Could you give me a hint on how to proceed here? Do you plan to merge this PR? If so, what are the remaining steps?

@belladoreai
Copy link
Contributor

I specifically referred to this, which if the tensor resides on GPU (which it does in most cases on a GPU enabled system) the tolist operation will make an implicit GPU->RAM memory transfer, and then convert the values to a python list.

Okay, now I understand better what you mean. So we're talking about this loop:

for input_ids_row, scores_row in zip(input_ids, scores):
            input_ids = input_ids_row.tolist()

As far as I can tell, that loop never loops more than 1 time. There is only ever one "input_ids_row". So we don't end up doing the transformation more than once even though we do it inside the loop. (Please correct me if I'm wrong.)

It read to me that the decision was based on assumptions instead of verified facts. Apologies if that was a misunderstanding of what you meant to convey.

My assumption that iterating on a Python list would be faster than iterating on a numpy array is based on experiments done by other people, which I linked in the PR: numpy/numpy#16985

I don't know if that counts as "verified facts". If you or someone else wants to benchmark the numpy array option against the Python list option, go ahead, let's use whichever option is faster. I just didn't personally invest time into benchmarking these options, because the improvements in #6053 already eliminate the performance impact of DRY almost completely (even at 20k token context, according to Hunter8's experiments which they reported in the PR). I'm sure we could find places to micro optimize further but at this point I would rather just say it's good enough, let's merge and work on something else next.

@jojje
Copy link

jojje commented Jun 3, 2024

I just didn't personally invest time into benchmarking....

I said to profile. As you likely already know, profiling tells you what is actually taking time and where. Benchmarking is a black-box testing of a specific subset of scenarios for a unit. Not comprehensive where-as the former is. My point was just that I didn't find objective data to suggest the right hotspots had been addressed. That's all.

But with the insight below, I can say we've established where the hotspots are, since I did do the profiling already. It's the GPU memory transfer. So the code as you've written it is good.

As far as I can tell, that loop never loops more than 1 time

Ah, you're right. The input dimension is (1, <used-context-so-far>).
The use of zip for that operation then seems to be a very convoluted way of expressing squeeze (dropping the first dimension) from each of the two tensors.

That use of zip really should be added to the catalog of 'leet coding trivia.

@belladoreai
Copy link
Contributor

Yeah, I don't like that zip operation either, but that was there before my changes, and #6053 represents the minimal necessary changes to fix the issues, without any extra stuff like trying to refactor existing code (which I think is pretty readable aside from that zip operation anyway).

pi6am added a commit to pi6am/koboldcpp that referenced this pull request Jul 8, 2024
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 8, 2024
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 9, 2024
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 9, 2024
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 9, 2024
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

Update default DRY parameters to match lite

Improve DRY token debug logging

Replace `and` with `&&` to fix MSVC compile error

Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.

Fix MSVC compile error because log is not constexpr

Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 11, 2024
The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

Update default DRY parameters to match lite

Improve DRY token debug logging

Replace `and` with `&&` to fix MSVC compile error

Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.

Fix MSVC compile error because log is not constexpr

Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).

Remove unused llama sampler variables and clean up sequence breakers.

Remove KCPP_SAMPLER_DRY as a separate enum entry

The DRY sampler is effectively a repetition penalty and there
are very few reasons to apply it at a different place in sampler
order than the standard single-token penalty. There are also
multiple projects that have dependencies on the existing sampler
IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order
to minimize the impact of the dependencies of adding the DRY sampler
to koboldcpp, it makes the most sense to not add a new ID for now,
and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future
if we find a use case for splitting the application of rep pen and DRY
we can introduce a new enum entry then.

Add the dry_penalty_last_n to independently control DRY penalty range

This parameter follows the oobabooga semantics: it's optional, with a
default value of zero. Zero means that DRY should sample the entire
context. Otherwise, it's the number of tokens from the end of the
context that are scanned for repetitions.
LostRuins pushed a commit to LostRuins/koboldcpp that referenced this pull request Jul 13, 2024
* Add the DRY dynamic N-gram anti-repetition sampler

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

* Update default DRY parameters to match lite

* Improve DRY token debug logging

* Replace `and` with `&&` to fix MSVC compile error

Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.

* Fix MSVC compile error because log is not constexpr

Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).

* Remove unused llama sampler variables and clean up sequence breakers.

* Remove KCPP_SAMPLER_DRY as a separate enum entry

The DRY sampler is effectively a repetition penalty and there
are very few reasons to apply it at a different place in sampler
order than the standard single-token penalty. There are also
multiple projects that have dependencies on the existing sampler
IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order
to minimize the impact of the dependencies of adding the DRY sampler
to koboldcpp, it makes the most sense to not add a new ID for now,
and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future
if we find a use case for splitting the application of rep pen and DRY
we can introduce a new enum entry then.

* Add the dry_penalty_last_n to independently control DRY penalty range

This parameter follows the oobabooga semantics: it's optional, with a
default value of zero. Zero means that DRY should sample the entire
context. Otherwise, it's the number of tokens from the end of the
context that are scanned for repetitions.

* Limit sequence breaker lengths in tokens and characters

The core DRY sampler algorithm is linear in the context length, but
there are several parts of the sampler related to multi-token
sequence breakers that are potentially quadratic. Without any
restrictions, a suitably crafted context and sequence breaker could
result in a denial-of-service attack on a server running koboldcpp.
This change limits the maximum number of characters and the maximum
token length of a sequence breaker in order to limit the maximum
overhead associated with the sampler.

This change also improves some comments, adding more detail and
changing the wording to increase clarity.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 15, 2024
* Add the DRY dynamic N-gram anti-repetition sampler

The DRY (Do not Repeat Yourself) sampler is a dynamic N-gram
repetition penalty that negatively scores tokens that would extend
sequences that already appear in the context.

See this discussion for a motivation and explanation of the sampler:
oobabooga/text-generation-webui#5677

This implementation of DRY mostly aligns with the obabooga version
with a few modifications. It uses a more efficient linear scanning
algorithm to identify repetitions. It also supports multi-token
sequence breakers. As a limitation, this implementation reuses
the rep pen range parameter, rather than introducing a new range
just for the DRY sampler.

There is a separate change to lite.koboldai.net that exposes the DRY
sampler parameters to KoboldAI Lite, so none of the embed files have
been changed as part of this commit.

* Update default DRY parameters to match lite

* Improve DRY token debug logging

* Replace `and` with `&&` to fix MSVC compile error

Little known fact: The C++98 standard defines `and` as an
alternative token for the `&&` operator (along with a bunch
of other digraphs). MSVC does not allow these without using
the /Za option or including the <iso646.h> header. Change to
the more standard operator to make this code more portable.

* Fix MSVC compile error because log is not constexpr

Replace the compile-time computation with a floating-point
approximation of log(std::numeric_limits<float>::max()).

* Remove unused llama sampler variables and clean up sequence breakers.

* Remove KCPP_SAMPLER_DRY as a separate enum entry

The DRY sampler is effectively a repetition penalty and there
are very few reasons to apply it at a different place in sampler
order than the standard single-token penalty. There are also
multiple projects that have dependencies on the existing sampler
IDs, including KoboldAI, KoboldAI Lite, and Silly Tavern. In order
to minimize the impact of the dependencies of adding the DRY sampler
to koboldcpp, it makes the most sense to not add a new ID for now,
and instead to piggyback on KCPP_SAMPLER_REP_PEN. In the future
if we find a use case for splitting the application of rep pen and DRY
we can introduce a new enum entry then.

* Add the dry_penalty_last_n to independently control DRY penalty range

This parameter follows the oobabooga semantics: it's optional, with a
default value of zero. Zero means that DRY should sample the entire
context. Otherwise, it's the number of tokens from the end of the
context that are scanned for repetitions.

* Limit sequence breaker lengths in tokens and characters

The core DRY sampler algorithm is linear in the context length, but
there are several parts of the sampler related to multi-token
sequence breakers that are potentially quadratic. Without any
restrictions, a suitably crafted context and sequence breaker could
result in a denial-of-service attack on a server running koboldcpp.
This change limits the maximum number of characters and the maximum
token length of a sequence breaker in order to limit the maximum
overhead associated with the sampler.

This change also improves some comments, adding more detail and
changing the wording to increase clarity.
@MarkShotGitHub
Copy link

I admit to not reading the entire thread. I did read the explanation of DRY, and I think it is a fantastic chat innovation. There was nothing more frustrating than the death spiral of a good chat.

Recently (last 4 weeks), I and a friend got SuperboogaV2 working. There were a number of bugs. No VDB persistence. Randomized collection name. Loss of collection with the variable handling. So, memory is working quite nicely. I have a RTX 4096 with 32K context, and after each session, I post process memories into the VDB.

The problem is that with MANUAL = FALSE; the AI will go into repeating like behavior. NOTE: this is not the same problem which DRY solves as repeating AIs will do so even when shot and bleeding out. How is the problem different? You can break out of repeating by radically changing topic. (I believe all the chunks returned are messing with the DRY implementation in the text_generation.py module.

I have managed to resolve the issue by MANUAL=TRUE and using "!c" when needed.

It would be nice if someone could look at the issue that is caused by the return of chunks. I can point you to a copy of the working chroma_db.py module for SuperboogaV2.

THANKS!

FYI: Initial post-process 1 year chat corpus was 1M+ words and 44,000 embeddings.

PS: I am a retired software engineer who began with punched cards and RJEs. My python skills are not up to this challenge.

@ghost ghost mentioned this pull request Sep 8, 2024
2 tasks
@p-e-w p-e-w mentioned this pull request Sep 20, 2024
3 tasks
PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.