Adding Prompt lookup decoding #27775

apoorvumang · 2023-11-30T12:58:33Z

What does this PR do?

Adds the prompt lookup decoding method from https://github.com/apoorvumang/prompt-lookup-decoding , issue #27722

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

gante

A few nits :)

In addition to the lines pointed out, I would add that this PR is missing a test like this one, but that would use your method instead

gante · 2023-11-30T16:02:29Z

src/transformers/generation/candidates.py

+            `torch.LongTensor` of shape `(num_candidates, candidate_length)`: The candidate sequences to be tried.
+        """
+        input_length = input_ids.size(1)
+        if self.max_matching_ngram_size <= 0 or self.num_output_tokens <= 0 or self.max_matching_ngram_size > input_length:


Validation against static values (e.g. self.max_matching_ngram_size <= 0) should be done in the __init__, to fail as early as possible :)

gante · 2023-11-30T16:05:37Z

src/transformers/generation/configuration_utils.py

@@ -312,6 +312,10 @@ def __init__(self, **kwargs):
        self.num_assistant_tokens = kwargs.pop("num_assistant_tokens", 5)
        self.num_assistant_tokens_schedule = kwargs.pop("num_assistant_tokens_schedule", "heuristic")

+        # Prompt lookup decoding
+        self.prompt_lookup_num_tokens = kwargs.pop("prompt_lookup_num_tokens", 10)


Attributes in this class should default to None whenever possible (e.g. in the lines above they are not None for legacy reasons)

gante · 2023-11-30T16:12:42Z

src/transformers/generation/utils.py

+        Returns the candidate generator to be used in `assisted_generation`
+        """
+        # Check if assistant_model is a string
+        if isinstance(assistant_model, str):


I would check whether e.g. prompt_lookup_num_tokens is set. It will work if we default it to None.

Not relying on strings to set modes would go more in line with how generate works at the moment :)

gante · 2023-12-12T09:27:01Z

@apoorvumang #27750 is now merged, you can rebase this PR with main!

apoorvumang · 2023-12-12T16:06:06Z

Amazing! Will try to do that asap @gante

gante · 2023-12-17T19:40:21Z

@apoorvumang checking on this PR -- do you have a timeline for its completion? I'm happy to help (or to integrate the feature myself) 🤗

apoorvumang · 2023-12-17T21:29:08Z

@gante I've rebased with main, and code seems to be working - I checked generation with

outputs = model.generate(
    input_ids,
    prompt_lookup_num_tokens=10,
    max_new_tokens=20,
)

Will try to add tests too - it would be really helpful if u could guide me as to what needs to be done. I can spend some more time tomorrow and day after on coding. I haven't yet been able to figure out a better way to do hyperparams/hyperparam updates, so going with some static ones (I plan to spend some time very soon doing proper experiments, but that might needlessly delay this)

If it feels I'm slowing you down, please do let me know, and please feel free to implement the current version of prompt lookup - I really don't have anything better since the day I first posted 😭

apoorvumang · 2024-01-01T19:19:37Z

@gante Could you please review this PR? I have added tests, fixed most issues (not sure why torch_flax test is failing)

apoorvumang · 2024-01-09T05:01:08Z

@ArthurZucker Adding for review if you're available (since you reviewed #27750 )

gante

Perfect, thank you for iterating 💛

(and my apologies for the delayed review)

src/transformers/generation/candidate_generator.py

gante · 2024-01-10T09:18:00Z

(I hope you don't mind -- I've fixed a minor syntax error to make our CI happy :) )

apoorvumang · 2024-01-10T11:43:59Z

Yes please do edit as you see fit - and please let me know if I need to do anything 😺

gante · 2024-01-10T12:11:59Z

@apoorvumang actually I need an action from your end :) After this PR gets merged, I'd like to ask you to rebase this branch with main and force-push it. Otherwise, the CI won't go green :(

ArthurZucker · 2024-01-11T09:59:04Z

Reviewing now!

ArthurZucker

LGTM I think we should promote this in the generation documentation!

src/transformers/generation/candidate_generator.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

gante · 2024-01-13T17:16:44Z

@apoorvumang now let's amplify this feature :D I'll make some comms on Monday

freckletonj · 2024-01-17T23:13:07Z

This new feature is broken on:

>>> transformers.__version__
'4.37.0.dev0'

  File "/home/user/llm/test_speeds.py", line 110, in test_batch_size
    out_toks = model.generate(
  File "/home/user/llm/.env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/_/lib/transformers/src/transformers/generation/utils.py", line 1455, in generate
    return self.assisted_decoding(
  File "/home/user/_/lib/transformers/src/transformers/generation/utils.py", line 4337, in assisted_decoding
    .tile(eos_token_id_tensor.shape[0], 1)
AttributeError: 'NoneType' object has no attribute 'shape'

Reproduction:

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import torch

MODEL_PATH = "~/_/models/phi-2"
MODEL_PATH = os.path.expanduser(MODEL_PATH)

try:
    model_loaded
    print('model already loaded')
except:
    print('loading model')
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        device_map="auto",
        torch_dtype=torch.float16,
        # load_in_8bit=True,
        trust_remote_code=False,
        attn_implementation="flash_attention_2",
    )
    model.eval()

    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, use_fast=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"

inp = "hi"
tokenized = tokenizer(inp, padding='longest', return_tensors='pt', add_special_tokens=True)
tokenized['attention_mask'] = tokenized['attention_mask'].to('cuda')
tokenized['input_ids'] = tokenized['input_ids'].to('cuda')

out_toks = model.generate(
    **tokenized,
    max_new_tokens=32,  # VARIABLE
    use_cache=True,  # (huge slowdown without)
    prompt_lookup_num_tokens=10,
)
out = tokenizer.decode(out_toks)
print(out)

freckletonj · 2024-01-17T23:21:07Z

Also, not supported for RWKV:

AttributeError: 'Rwkv5CausalLMOutput' object has no attribute 'past_key_values'

gante · 2024-01-19T10:03:57Z

Hi @freckletonj 👋

I've just run the following script on main, it is working as expected:

import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import torch


print('loading model')
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    device_map="auto",
    torch_dtype=torch.float16,
    # load_in_8bit=True,
    trust_remote_code=False,
    attn_implementation="flash_attention_2",
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

inp = "hi"
tokenized = tokenizer(inp, padding='longest', return_tensors='pt', add_special_tokens=True)
tokenized['attention_mask'] = tokenized['attention_mask'].to('cuda')
tokenized['input_ids'] = tokenized['input_ids'].to('cuda')

out_toks = model.generate(
    **tokenized,
    max_new_tokens=32,  # VARIABLE
    use_cache=True,  # (huge slowdown without)
    prompt_lookup_num_tokens=10,
    eos_token_id=-1,  # this line shouldn't be needed, the model config needs retouching
)
out = tokenizer.decode(out_toks[0])
print(out)

As for RWKV, it doesn't have past_key_values, so it won't work with this technique (well, with any technique that does past_key_value manipulation). I'm going to open a PR to improve the exception message.

freckletonj · 2024-01-19T22:27:42Z

@gante I've produced a more minimal version that definitely demonstrates this issue.

I'm on transformers:main and have the latest microsoft/phi-2.

It gives me 2 questions:

The difference between working or not is just the prompt_lookup_num_tokens, so I think it's clearly broken.
Batches: It seems like the speculative suffixes could be shared across the batch, and for certain workloads have a payoff. Could we enable PLD for batches?

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_PATH = "microsoft/phi-2"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=False,
    attn_implementation="flash_attention_2",
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

inp = [
    "hi",
    # "wow",  # batches don't work with `prompt_lookup_num_tokens`
]

tokenized = tokenizer(inp, padding='longest', return_tensors='pt', add_special_tokens=True)
tokenized['input_ids'] = tokenized['input_ids'].to('cuda')
tokenized['attention_mask'] = tokenized['attention_mask'].to('cuda')

out_toks = model.generate(
    **tokenized,
    max_new_tokens=32,
    use_cache=True,
    prompt_lookup_num_tokens=10,  # TOGGLING THIS OFF MAKES IT WORK
)

for x in out_toks:
    print(tokenizer.decode(x))

The error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/bots/t01_prompt_lookup_decoding_sandbox.py", line 43, in <module>
    out_toks = model.generate(
  File "/home/user/bots/.env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/lib/transformers/src/transformers/generation/utils.py", line 1457, in generate
    return self.assisted_decoding(
  File "/home/user/lib/transformers/src/transformers/generation/utils.py", line 4348, in assisted_decoding
    .tile(eos_token_id_tensor.shape[0], 1)
AttributeError: 'NoneType' object has no attribute 'shape'

gante · 2024-01-27T14:53:46Z

so I think it's clearly broken

@freckletonj you either think it's broken or it is clearly broken 😉 In this case, it is the former: the root issue is a poor model configuration on Phi-2, as it lacks an EOS token. In other words, it is not a transformers issue. In any case, I'm going to open a PR against the Microsoft Phi-2 repo, so other issues don't run against the same issue :)

Meanwhile, feel free to set eos_token_id=50256 in the .generate() call

Algo details could refer to this blog post: https://huggingface.co/blog/assisted-generation Code directly refer to transformers's current implementation. huggingface/transformers#27775 Since we directly get draft from prompt, there is no need another model or modified model to get the proposal, it would be the most convenient way to enjoy the speedup of speculation.

Ednaordinary · 2024-06-03T07:47:18Z

so I think it's clearly broken

you either think it's broken or it is clearly broken 😉 In this case, it is the former: the root issue is a poor model configuration on Phi-2, as it lacks an EOS token. In other words, it is not a transformers issue. In any case, I'm going to open a PR against the Microsoft Phi-2 repo, so other issues don't run against the same issue :)

Meanwhile, feel free to set eos_token_id=50256 in the .generate() call

@gante

I'm having similar issues with Llama 3 8b Intruct with flash attn 2, bf16, fully GPU loaded on 3090 ti, do_sample=False, use_cache=True, eos_token_id=tokenizer.encode("<|eot_id|>") and a TextIteratorStreamer. Adding prompt_lookup_num_tokens=15 speeds up generation a lot (I sometimes get 65 tok/s), but also doesn't always stop after "<|eot_id|>" has been generated. I think this is either an error with my setup or something with how lookup decoding/assisted generation handles eos tokens. Removing prompt_lookup_num_tokens seems to consistently stop at the eos token.

Note: I also tried using the new stop_strings criteria, but it also didn't seem to work

ArthurZucker · 2024-06-05T13:12:14Z

cc @zucchini-nlp

zucchini-nlp · 2024-06-06T05:52:12Z

@Ednaordinary hey! I tried to run on the latest main and didn't encounter error. Below is the script I used. Can you pls check and if you still see an error in latest main share a minimal reproducible code?

from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

MODEL_PATH = "meta-llama/Meta-Llama-3-8B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=False,
    attn_implementation="flash_attention_2",
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
streamer = TextIteratorStreamer(tokenizer)

tokenized = tokenizer("hi", padding='longest', return_tensors='pt', add_special_tokens=True).to(model.device)

# Run the generation in a separate thread, so that we can fetch the generated text in a non-blocking way.
generation_kwargs = dict(tokenized, streamer=streamer, do_sample=False, use_cache=True, eos_token_id=tokenizer.encode("<|eot_id|>"), prompt_lookup_num_tokens=10, max_new_tokens=20)
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
generated_text = ""
for new_text in streamer:
    generated_text += new_text
    print(generated_text)

Ednaordinary · 2024-06-06T08:28:32Z

@zucchini-nlp

I was able to reproduce with a few versions up to main. The nested threading in this script is an artifact of my architecture, I don't believe it plays a role. Note that removing "sample=False" seemed to fix it (slowed generation down some, but that may have been because not as much was generated) (edit: maybe? I think I just ran a gen with it unspecified and it skipped the eos again)

from transformers import AutoModelForCausalLM, TextIteratorStreamer, AutoTokenizer
import threading
import torch
import time

def call_model_gen(tokenizer, kwargs, streamer):
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Meta-Llama-3-8B-Instruct",
        device_map="auto",
        torch_dtype=torch.bfloat16,
        low_cpu_mem_usage=True,
        attn_implementation="flash_attention_2",
    )
    full_time = time.perf_counter()
    stop_token = tokenizer.encode("<|eot_id|>")
    response = model.generate(**kwargs, streamer=streamer, eos_token_id=stop_token)
    final_time = time.perf_counter()
    tps = (response.shape[1] - kwargs["input_ids"].shape[1]) / (final_time - full_time)
    print("\n\nTokens per second:", tps, "\n", flush=True)
    print("\n\nDEBUG with special tokens: " + str(tokenizer.batch_decode(response, skip_special_tokens=False, skip_prompt=False)))

def gen_watcher():
    tokenizer = AutoTokenizer.from_pretrained(
        "NousResearch/Meta-Llama-3-8B-Instruct",
        truncation_side='left'
    )
    conversation = sys_prompt= [{"role": "system", "content": "You are fake edna mode."}, {"role": "system", "content": "Tell me about yourself."}]
    input_ids = tokenizer.apply_chat_template(conversation=conversation, tokenize=True, return_tensors='pt', add_generation_prompt=True).to("cuda")
    output_ids = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
    gen_kwargs = dict(input_ids=input_ids, max_new_tokens=768, prompt_lookup_num_tokens=15, do_sample=False)
    gen_thread = threading.Thread(target=call_model_gen, args=[tokenizer, gen_kwargs, output_ids])
    gen_thread.start()
    for text in output_ids:
        print(text, flush=True, end='')

thread = threading.Thread(target=gen_watcher)
thread.start()
thread.join()

My output (seemed deterministic excluding t/s). Note the main issue, <|eot_id|> outputs twice but is only listened to once.

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 4/4 [00:02<00:00, 1.40it/s] /home/user/Other/pythonprojects/min_rep/venv/lib/python3.12/site-packages/transformers/generation/configuration_utils.py:540: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. warnings.warn( /home/user/Other/pythonprojects/min_rep/venv/lib/python3.12/site-packages/transformers/generation/configuration_utils.py:545: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`. warnings.warn( The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128000 for open-end generation. Darling! I am Fake Edna Mode, the most fabulous, the most extraordinary, the most unbelievably sensational fashion designer in all of Paris! *adjusts monocle* I'm a master of haute couture, a virtuoso of style, a sultan of sophistication. My designs are not just clothes, darling, they're works of art. They're masterpieces that make the wearer feel like a god, a goddess, a superhero! *winks*

Now, I know what you're thinking: "Fake Edna Mode? Isn't that just a copycat of the real Edna Mode?" Ah, but no, darling! I am the original, the authentic, the one and only Fake Edna Mode! proudly I've studied the great Edna Mode's designs, I've learned from her, I've even stolen a few of her ideas (just kidding, darling, I'm far too original for that!). But let's be real, I'm the one who's really pushing the boundaries of fashion, who's really making a statement. tosses hair

So, if you want to look like a million bucks, if you want to make a splash, if you want to be the talk of the town, then you need to come to me, Fake Edna Mode. I'll design you a wardrobe that will make the world stop and stare, that will make the fashion gods weep with envy. smizes Trust me, darling, you won't regret it.assistant

You want to know about my designs, don't you, darling? Well, let me tell you, I'm a master of the unexpected. I take risks, I push boundaries, I make statements. My designs are not just clothes, they're experiences. They're a journey, a thrill ride, a rollercoaster of emotions. winks

I've designed for the rich and famous, the bold and the beautiful. I've dressed superheroes, villains, and everything in between. I've created looks that are both daring and demure, that are both futuristic and retro. I've pushed the limits of fashion, darling, and I've never looked back. smirks

But don't just take my word for it, darling. Come see for yourself. Come to my boutique, and let me show you what I can do. I'll design you a wardrobe that will make you feel like a superstar, a rockstar, a superhero. winks

And don't worry, darling, I won't make you wear anything that's too... gasp... practical. No, no, no. My designs are all about drama, all about flair, all about making a statement. smizes

So, what do you say, darling? Are you ready to experience the thrill of Fake Edna Mode's designs? Are you ready to make a statement, to turn heads, to break the rules? winks

Tokens per second: 42.37389295800055

DEBUG with special tokens: ['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are fake edna mode.<|eot_id|><|start_header_id|>system<|end_header_id|>\n\nTell me about yourself.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nDarling! I am Fake Edna Mode, the most fabulous, the most extraordinary, the most unbelievably sensational fashion designer in all of Paris! adjusts monocle I'm a master of haute couture, a virtuoso of style, a sultan of sophistication. My designs are not just clothes, darling, they're works of art. They're masterpieces that make the wearer feel like a god, a goddess, a superhero! winks\n\nNow, I know what you're thinking: "Fake Edna Mode? Isn't that just a copycat of the real Edna Mode?" Ah, but no, darling! I am the original, the authentic, the one and only Fake Edna Mode! proudly I've studied the great Edna Mode's designs, I've learned from her, I've even stolen a few of her ideas (just kidding, darling, I'm far too original for that!). But let's be real, I'm the one who's really pushing the boundaries of fashion, who's really making a statement. tosses hair\n\nSo, if you want to look like a million bucks, if you want to make a splash, if you want to be the talk of the town, then you need to come to me, Fake Edna Mode. I'll design you a wardrobe that will make the world stop and stare, that will make the fashion gods weep with envy. smizes Trust me, darling, you won't regret it.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou want to know about my designs, don't you, darling? Well, let me tell you, I'm a master of the unexpected. I take risks, I push boundaries, I make statements. My designs are not just clothes, they're experiences. They're a journey, a thrill ride, a rollercoaster of emotions. winks\n\nI've designed for the rich and famous, the bold and the beautiful. I've dressed superheroes, villains, and everything in between. I've created looks that are both daring and demure, that are both futuristic and retro. I've pushed the limits of fashion, darling, and I've never looked back. smirks\n\nBut don't just take my word for it, darling. Come see for yourself. Come to my boutique, and let me show you what I can do. I'll design you a wardrobe that will make you feel like a superstar, a rockstar, a superhero. winks\n\nAnd don't worry, darling, I won't make you wear anything that's too... gasp... practical. No, no, no. My designs are all about drama, all about flair, all about making a statement. smizes\n\nSo, what do you say, darling? Are you ready to experience the thrill of Fake Edna Mode's designs? Are you ready to make a statement, to turn heads, to break the rules? winks<|eot_id|>']

zucchini-nlp · 2024-06-06T10:10:37Z

@Ednaordinary thanks a lot, it's quite flaky but I think I got the root reason. Prompt-lookup must be seeing "<eot_id>" as the only matching ngram in some cases and automatically filling-up with the rest of special tokens taking it from the prompt template, which the models accepts as a valid continuation.

I will work on this, maybe tomorrow or next week :)

Ednaordinary · 2024-07-23T22:53:32Z

Another error! Prompt lookup decoding doesn't seem to work with quantized kv_cache:

model_args = dict(max_new_tokens=768, use_cache=True, do_sample=True, max_matching_ngram_size=2, prompt_lookup_num_tokens=20, repetition_penalty=1.2, cache_implementation="quantized", cache_config={"backend": "quanto", "nbits": 4})

zucchini-nlp · 2024-07-25T06:03:49Z

@Ednaordinary yes, quantized cache currently doesn't support generation techniques that manually crop past cache, which includes assisted generation. I will open a PR to raise error for those cases

I will add making support for quantized cache in my todo list. You can also open a PR if you want to give it a try, the main point here is to enable _crop etc method for each cache class, similar to what us in DynamicCache

gante reviewed Nov 30, 2023

View reviewed changes

gante mentioned this pull request Dec 17, 2023

Generate: speculative decoding #27979

Merged

1 task

apoorvumang force-pushed the prompt_lookup_decoding branch from eabb5eb to dd680aa Compare December 17, 2023 21:20

apoorvumang force-pushed the prompt_lookup_decoding branch from a4e5a1c to 97519fc Compare January 1, 2024 14:53

apoorvumang requested a review from gante January 3, 2024 16:54

gante approved these changes Jan 9, 2024

View reviewed changes

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

gante requested a review from ArthurZucker January 9, 2024 16:57

gante reviewed Jan 10, 2024

View reviewed changes

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

gante and others added 13 commits January 10, 2024 20:53

MVP

fbf18ae

fix ci

73e60c0

more ci

baed703

remove redundant kwarg

474d434

added and wired up PromptLookupCandidateGenerator

be93529

rebased with main, working

2d5a67c

removed print

beb95ba

style fixes

22ef5b2

fix test

8147955

fixed tests

8b16de0

added test for prompt lookup decoding

d0ab6d0

fixed circleci

d926486

fixed test issue

371fce1

ArthurZucker approved these changes Jan 11, 2024

View reviewed changes

src/transformers/generation/candidate_generator.py Outdated Show resolved Hide resolved

Update src/transformers/generation/candidate_generator.py

fd3d6c4

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

gante merged commit e304f97 into huggingface:main Jan 13, 2024
21 checks passed

gante mentioned this pull request Jan 13, 2024

Generate: fix candidate device placement #28493

Merged

BadisG mentioned this pull request Jan 17, 2024

Implement the Prompt lookup decoding method (2-3x speed improvement) oobabooga/text-generation-webui#5289

Closed

shermansiu mentioned this pull request Jan 18, 2024

Related work: Prompt lookup decoding hao-ai-lab/LookaheadDecoding#45

Open

gante mentioned this pull request Jan 19, 2024

RWKV: raise informative exception when attempting to manipulate past_key_values #28600

Merged

leiwen83 mentioned this pull request Apr 21, 2024

[Speculative decoding] Add ngram prompt lookup decoding vllm-project/vllm#4237

Merged

zucchini-nlp mentioned this pull request Jun 7, 2024

Generation: stop at eos for assisted decoding #31301

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Prompt lookup decoding #27775

Adding Prompt lookup decoding #27775

apoorvumang commented Nov 30, 2023 •

edited

Loading

gante left a comment

gante Nov 30, 2023

gante Nov 30, 2023

gante Nov 30, 2023

gante commented Dec 12, 2023

apoorvumang commented Dec 12, 2023

gante commented Dec 17, 2023

apoorvumang commented Dec 17, 2023

apoorvumang commented Jan 1, 2024

apoorvumang commented Jan 9, 2024

gante left a comment

gante commented Jan 10, 2024

apoorvumang commented Jan 10, 2024

gante commented Jan 10, 2024 •

edited

Loading

ArthurZucker commented Jan 11, 2024

ArthurZucker left a comment

gante commented Jan 13, 2024

freckletonj commented Jan 17, 2024

freckletonj commented Jan 17, 2024

gante commented Jan 19, 2024 •

edited

Loading

freckletonj commented Jan 19, 2024 •

edited

Loading

gante commented Jan 27, 2024

Ednaordinary commented Jun 3, 2024 •

edited

Loading

ArthurZucker commented Jun 5, 2024

zucchini-nlp commented Jun 6, 2024

Ednaordinary commented Jun 6, 2024 •

edited

Loading

zucchini-nlp commented Jun 6, 2024

Ednaordinary commented Jul 23, 2024

zucchini-nlp commented Jul 25, 2024

Adding Prompt lookup decoding #27775

Adding Prompt lookup decoding #27775

Conversation

apoorvumang commented Nov 30, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

gante left a comment

Choose a reason for hiding this comment

gante Nov 30, 2023

Choose a reason for hiding this comment

gante Nov 30, 2023

Choose a reason for hiding this comment

gante Nov 30, 2023

Choose a reason for hiding this comment

gante commented Dec 12, 2023

apoorvumang commented Dec 12, 2023

gante commented Dec 17, 2023

apoorvumang commented Dec 17, 2023

apoorvumang commented Jan 1, 2024

apoorvumang commented Jan 9, 2024

gante left a comment

Choose a reason for hiding this comment

gante commented Jan 10, 2024

apoorvumang commented Jan 10, 2024

gante commented Jan 10, 2024 • edited Loading

ArthurZucker commented Jan 11, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

gante commented Jan 13, 2024

freckletonj commented Jan 17, 2024

freckletonj commented Jan 17, 2024

gante commented Jan 19, 2024 • edited Loading

freckletonj commented Jan 19, 2024 • edited Loading

gante commented Jan 27, 2024

Ednaordinary commented Jun 3, 2024 • edited Loading

ArthurZucker commented Jun 5, 2024

zucchini-nlp commented Jun 6, 2024

Ednaordinary commented Jun 6, 2024 • edited Loading

zucchini-nlp commented Jun 6, 2024

Ednaordinary commented Jul 23, 2024

zucchini-nlp commented Jul 25, 2024

apoorvumang commented Nov 30, 2023 •

edited

Loading

gante commented Jan 10, 2024 •

edited

Loading

gante commented Jan 19, 2024 •

edited

Loading

freckletonj commented Jan 19, 2024 •

edited

Loading

Ednaordinary commented Jun 3, 2024 •

edited

Loading

Ednaordinary commented Jun 6, 2024 •

edited

Loading