Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] add truncate_prompt_tokens to work offline, directly from LLM class. #4598

Closed
wants to merge 3 commits into from

Conversation

yecohn
Copy link
Contributor

@yecohn yecohn commented May 4, 2024

Fixes #4507.

  • added function _validate_prompt to make sure prompt is a right format and run some validations.
  • truncate_prompt_tokens should take truncate the prompt to the left as seen in vllm/entrypoints/openai/serving_engine.py line 182.

@simon-mo
Copy link
Collaborator

simon-mo commented May 4, 2024

@tdoublep can you help review this?

@tdoublep
Copy link
Member

tdoublep commented May 6, 2024

@simon-mo sure, will try to get to that later today

Comment on lines +261 to +266
self,
prompt: Optional[str] = None,
prompt_token_ids: Optional[List[int]] = None,
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None
) -> Tuple[Optional[str], Optional[List[int]]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function re-implements the same logic as this function. It's worth thinking about whether we should have it implemented in a single place and re-used accordingly. I think that would be preferable, unless there is some strong reason to have it duplicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also studying #3512 to understand if that needs to be merged before this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this is a good point. let me know what do you think is the best and I'll implement it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now let's just make sure that the behaviour is equivalent (see the below discussion re: using truncation from the tokenizer). I think it would make more sense to think about refactoring it after #3512 is merged because after that point the completions API will be using the tokenizer from the engine anyway.

Comment on lines +273 to +277
if prompt_token_ids is None:
prompt_token_ids = self.llm_engine.tokenizer.tokenizer(
prompt).input_ids
prompt_token_ids = prompt_token_ids[-truncate_prompt_tokens:]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't exactly the same behaviour as the equivalent code in the completion API. In that case, we use the built-in capability of the tokenizer to perform truncation in the case that the user provides a prompt, and only implement the truncation ourselves in the case when the user provides prompt ids. I think it would be better to do the same here to be consistent.

Copy link
Contributor Author

@yecohn yecohn May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed, the thing I remarked though is that using the tokenizer with kwargs does a right truncation and not a left one basically doing prompt_token_ids[:truncated_prompt_tokens]. I ll recheck to make sure I am not mistaken.

here the traceback I got:

`

Preparing working directory 2s
  # Creating "/workspace/build/buildkite/vllm/ci"
  $ cd /workspace/build/buildkite/vllm/ci
  $ git clone -v -- https://github.com/vllm-project/vllm.git .
  Cloning into '.'...
  POST git-upload-pack (175 bytes)
  POST git-upload-pack (gzip 2402 to 1225 bytes)
  remote: Enumerating objects: 16905, done.
  remote: Counting objects: 100% (3599/3599), done.
  remote: Compressing objects: 100% (711/711), done.
  remote: Total 16905 (delta 3181), reused 2987 (delta 2887), pack-reused 13306
  Receiving objects: 100% (16905/16905), 8.71 MiB | 20.47 MiB/s, done.
  Resolving deltas: 100% (12703/12703), done.
  $ git clean -ffxdq
  # Fetch and checkout pull request head from GitHub
  $ git fetch -- origin refs/pull/4598/head
  remote: Enumerating objects: 15, done.
  remote: Counting objects: 100% (15/15), done.
  remote: Compressing objects: 100% (10/10), done.
  remote: Total 15 (delta 7), reused 13 (delta 5), pack-reused 0
  Unpacking objects: 100% (15/15), 3.66 KiB | 749.00 KiB/s, done.
  From https://github.com/vllm-project/vllm
  * branch refs/pull/4598/head -> FETCH_HEAD
  # FETCH_HEAD is now 968d5e9651ca28b90498e282e3095f9654ceebe0
  $ git checkout -f 968d5e9
  Note: switching to '968d5e9651ca28b90498e282e3095f9654ceebe0'.
   
  You are in 'detached HEAD' state. You can look around, make experimental
  changes and commit them, and you can discard any commits you make in this
  state without impacting any branches by switching back to a branch.
   
  If you want to create a new branch to retain commits you create, you may
  do so (now or later) by using -c with the switch command. Example:
   
  git switch -c
   
  Or undo this operation with:
   
  git switch -
   
  Turn off this advice by setting config variable advice.detachedHead to false
   
  HEAD is now at 968d5e9 fix dependency injection
  # Cleaning again to catch any post-checkout changes
  $ git clean -ffxdq
  # Checking to see if Git data needs to be sent to Buildkite
  $ buildkite-agent meta-data exists buildkite:git:commit
  Running commands
  $ trap 'kill -- $$' INT TERM QUIT; cp -r ~/.ssh /workspace/.ssh && chmod -R 777 /workspace
  Some containers had unknown exit statuses. Perhaps they were in ImagePullBackOff.

`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding the build I got an error buildkite/ci/pr/distributed-tests-multiple-groups — Failed (exit status -10) do you have any idea what could it be ? and how could I debug this ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yecohn Regarding the truncation side issue. You are right, that by default the tokenizers do truncation from the right. That's why we added this argument when calling get_tokenizer to set the truncation side to left.

Re: the build error I don't think it looks related to your changes currently.

@yecohn yecohn closed this Jul 23, 2024
@yecohn yecohn deleted the truncate_usage_pr branch July 23, 2024 16:16
@JackChuang
Copy link

Hi @tdoublep, @yecohn, and @simon-mo,
It looks like this feature has not been merged yet. According to my tests, truncate_prompt_tokens still does not work for offline inference.

I’d like to see this feature implemented. In my case, I need the truncation to occur in vllm, and I would also like to experiment with the generated KV caches.

Are there any plans to support this feature? Can I help in any way?
Or, alternatively, are there any other methods to perform prompt truncation while running offline inference in the vllm?

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants