-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] add truncate_prompt_tokens to work offline, directly from LLM class. #4598
Conversation
@tdoublep can you help review this? |
@simon-mo sure, will try to get to that later today |
self, | ||
prompt: Optional[str] = None, | ||
prompt_token_ids: Optional[List[int]] = None, | ||
truncate_prompt_tokens: Optional[Annotated[int, Field(ge=1)]] = None | ||
) -> Tuple[Optional[str], Optional[List[int]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function re-implements the same logic as this function. It's worth thinking about whether we should have it implemented in a single place and re-used accordingly. I think that would be preferable, unless there is some strong reason to have it duplicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also studying #3512 to understand if that needs to be merged before this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is a good point. let me know what do you think is the best and I'll implement it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now let's just make sure that the behaviour is equivalent (see the below discussion re: using truncation from the tokenizer). I think it would make more sense to think about refactoring it after #3512 is merged because after that point the completions API will be using the tokenizer from the engine anyway.
if prompt_token_ids is None: | ||
prompt_token_ids = self.llm_engine.tokenizer.tokenizer( | ||
prompt).input_ids | ||
prompt_token_ids = prompt_token_ids[-truncate_prompt_tokens:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't exactly the same behaviour as the equivalent code in the completion API. In that case, we use the built-in capability of the tokenizer to perform truncation in the case that the user provides a prompt, and only implement the truncation ourselves in the case when the user provides prompt ids. I think it would be better to do the same here to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed, the thing I remarked though is that using the tokenizer with kwargs does a right truncation and not a left one basically doing prompt_token_ids[:truncated_prompt_tokens]. I ll recheck to make sure I am not mistaken.
here the traceback I got:
`
Preparing working directory | 2s |
---|---|
# Creating "/workspace/build/buildkite/vllm/ci" | |
$ cd /workspace/build/buildkite/vllm/ci | |
$ git clone -v -- https://github.com/vllm-project/vllm.git . | |
Cloning into '.'... | |
POST git-upload-pack (175 bytes) | |
POST git-upload-pack (gzip 2402 to 1225 bytes) | |
remote: Enumerating objects: 16905, done. | |
remote: Counting objects: 100% (3599/3599), done. | |
remote: Compressing objects: 100% (711/711), done. | |
remote: Total 16905 (delta 3181), reused 2987 (delta 2887), pack-reused 13306 | |
Receiving objects: 100% (16905/16905), 8.71 MiB | 20.47 MiB/s, done. | |
Resolving deltas: 100% (12703/12703), done. | |
$ git clean -ffxdq | |
# Fetch and checkout pull request head from GitHub | |
$ git fetch -- origin refs/pull/4598/head | |
remote: Enumerating objects: 15, done. | |
remote: Counting objects: 100% (15/15), done. | |
remote: Compressing objects: 100% (10/10), done. | |
remote: Total 15 (delta 7), reused 13 (delta 5), pack-reused 0 | |
Unpacking objects: 100% (15/15), 3.66 KiB | 749.00 KiB/s, done. | |
From https://github.com/vllm-project/vllm | |
* branch refs/pull/4598/head -> FETCH_HEAD | |
# FETCH_HEAD is now 968d5e9651ca28b90498e282e3095f9654ceebe0 |
|
$ git checkout -f 968d5e9 | |
Note: switching to '968d5e9651ca28b90498e282e3095f9654ceebe0'. | |
You are in 'detached HEAD' state. You can look around, make experimental | |
changes and commit them, and you can discard any commits you make in this | |
state without impacting any branches by switching back to a branch. | |
If you want to create a new branch to retain commits you create, you may | |
do so (now or later) by using -c with the switch command. Example: | |
git switch -c | |
Or undo this operation with: | |
git switch - | |
Turn off this advice by setting config variable advice.detachedHead to false | |
HEAD is now at 968d5e9 fix dependency injection | |
# Cleaning again to catch any post-checkout changes | |
$ git clean -ffxdq | |
# Checking to see if Git data needs to be sent to Buildkite | |
$ buildkite-agent meta-data exists buildkite:git:commit | |
Running commands | |
$ trap 'kill -- $$' INT TERM QUIT; cp -r ~/.ssh /workspace/.ssh && chmod -R 777 /workspace | |
Some containers had unknown exit statuses. Perhaps they were in ImagePullBackOff. |
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regarding the build I got an error buildkite/ci/pr/distributed-tests-multiple-groups — Failed (exit status -10)
do you have any idea what could it be ? and how could I debug this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yecohn Regarding the truncation side issue. You are right, that by default the tokenizers do truncation from the right. That's why we added this argument when calling get_tokenizer
to set the truncation side to left.
Re: the build error I don't think it looks related to your changes currently.
Hi @tdoublep, @yecohn, and @simon-mo, I’d like to see this feature implemented. In my case, I need the truncation to occur in vllm, and I would also like to experiment with the generated KV caches. Are there any plans to support this feature? Can I help in any way? Thanks, |
Fixes #4507.