Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

garbage output from h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-13b #281

Closed
tjedwards opened this issue Jun 27, 2023 · 5 comments · Fixed by #298
Closed

garbage output from h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-13b #281

tjedwards opened this issue Jun 27, 2023 · 5 comments · Fixed by #298

Comments

@tjedwards
Copy link

Using the simple python script on the "supported-models" page I was able to successfully generate output from TheBloke/Wizard-Vicuna-13B-Uncensored-HF, but h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-13b generates garbage.

I'm running CUDA 11.7.1 on RHEL 8.4 and an NVIDIA A100-SXM-80GB.

Here's the script:

import sys
from vllm import LLM
llm = LLM(model=sys.argv[1])
output = llm.generate("Hello, my name is")
print(output)

Here's output from TheBloke:

prompt='Hello, my name is'
text="Bastian Mehl and I'm going to talk about how we can solve"

And here's output from h2oai:

prompt='Hello, my name is'
text='\u0442\u0435 Business up t");ymbol\u7532 _ itsardervesag t beskrevs t \u201c

Here's the full output:

Loading h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-13b
INFO 06-27 10:46:22 llm_engine.py:59] Initializing an LLM engine with config: model='/tmp/h2oai/h2ogpt-gm-oasst1-en-2048-open-llama-13b', dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 06-27 10:46:22 tokenizer_utils.py:30] Using the LLaMA fast tokenizer in 'hf-internal-testing/llama-tokenizer' to avoid potential protobuf errors.
INFO 06-27 10:51:57 llm_engine.py:128] # GPU blocks: 3808, # CPU blocks: 327
Processed prompts: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 1/1 [00:00<00:00,  3.17it/s]
[RequestOutput(request_id=0, prompt='Hello, my name is', prompt_token_ids=[1, 15043, 29892, 590, 1024, 338], outputs=[CompletionOutput(index=0, text='\u0442\u0435 Business up t");ymbol\u7532 _ itsardervesag t beskrevs t \u201c', token_ids=[730, 15197, 701, 260, 1496, 2789, 31843, 903, 967, 538, 20098, 351, 260, 7718, 260, 1346], cumulative_logprob=-90.18120217323303, logprobs={}, finish_reason=length)], finished=True)]
@zhuohan123
Copy link
Member

This seems like the tokenizer's issue. We are adding support for custom tokenizers (#111). Meanwhile, you can try to directly modify the function below to use the correct tokenizer.

def get_tokenizer(

@tjedwards
Copy link
Author

Thanks, I'll give that a try! BTW I just tried openlm-research/open_llama_13b and it produced viable results.

@tjedwards
Copy link
Author

It was an easy fix:

    if "open_llama" in model_name or "open-llama" in model_name:
        kwargs["use_fast"] = False

I added the extra "or" testing for the dash variant, and all is well!

@tjedwards
Copy link
Author

tjedwards commented Jun 28, 2023

In hindsight, this is better:

if "open" in model_name and "llama" in model_name:
    kwargs["use_fast"] = False

Ah, never mind! I just looked at the changes to allow a custom tokenizer, and this whole test goes away. :)

@WoosukKwon
Copy link
Collaborator

@tjedwards We've added a new argument tokenizer_mode, which can be either auto or slow.

Try out the following:

llm = LLM(model="openlm-research/open_llama_13b", tokenizer_mode="slow")

yukavio pushed a commit to yukavio/vllm that referenced this issue Jul 3, 2024
status should have an initial value

Co-authored-by: dhuangnm <dhuang@MacBook-Pro-2.local>
jikunshang pushed a commit to jikunshang/vllm that referenced this issue Sep 24, 2024
Increase garbage collector's threshold in order to reduce it's frequency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants