-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using web-hosted model for inference #44
Comments
Hi @dnnp2011, thank you for your support with LLMLingua. In fact, you can use any web-hosted version of models, as long as it provides an interface similar to 'logprob' for calculating perplexity |
Thanks for getting back to me @iofu728 How exactly do I implement this in practice? I'm not clear on how to pass any HuggingFace or OpenAI API details to define the model host and pass along any API keys. The only reference to something like this I've seen is using OpenAI embeddings for the re-ranking step. |
Hi @dnnp2011, Sorry, the fact is that it's currently not possible to use the web-host API for this purpose, as we can't obtain the log probabilities of the prompt part through the web-hoster API. Previously, it was feasible to get the log probabilities of the prompt by calling the OpenAI API and setting max_token=0. Therefore, unless there is an API available that provides the log probabilities for the prompt, we can only implement this through self-deployed models. |
@iofu728 can we use old openai api version? Do you in what version it was? |
After our confirmation, some OAI models can obtain log probabilities from the prompt side. You can refer to the following code: logp = openai.Completion.create(
model="davinci-002",
prompt="Please return the logprobs",
logprobs=0,
max_tokens=0,
echo=True,
temperature=0,
)
Out[3]:
<OpenAIObject text_completion id=-at > JSON: {
"id": "",
"object": "text_completion",
"created": 1707295146,
"model": "davinci-002",
"choices": [
{
"text": "Please return the logprobs",
"index": 0,
"logprobs": {
"tokens": [
"Please",
" return",
" the",
" log",
"pro",
"bs"
],
"token_logprobs": [
null,
-6.9668007,
-2.047512,
-8.885729,
-13.960022,
-5.479665
],
"top_logprobs": null,
"text_offset": [
0,
6,
13,
17,
21,
24
]
},
"finish_reason": "length"
}
],
"usage": {
"prompt_tokens": 6,
"total_tokens": 6
}
} |
vllm-project/vllm#1203 |
lm-sys/FastChat#2612 |
Hi @DumoeDss, Thank you for your information. It seems very useful, especially FastChat, which appears to support echo, enabling the return of logprobs from the prompt side. We will consider using the relevant engine in the future. If you are willing to do some adaptations, we would greatly welcome it. |
@iofu728 I'd be happy to try to do it, but I'd have to dive into the source code first, and I'm not sure how to start yet. |
Hi @DumoeDss, the core issue involves implementing the |
@iofu728 First of all the two issues/prs I mentioned above on github both apply to the completion interface and don't support the chatcompletion interface. The pr's for vllm support chatcompletion, but after trying them out I realized that they don't work very well. The instruction I use is "Please repeat the following and do not output anything else: I tried using the models yi-34B-chat, qwen1.5-0.5B-chat, qwen1.5-1.8B-chat, qwen1.5-4B-chat, and qwen1.5-7B-chat, and the output of the models above 4B is slightly more satisfactory, but there are cases where the output is not in accordance with the original text, which results in unable to calculate the original logprobs. There is a modification here where I added the interface using fastapi, which might be an acceptable solution. I sent you an email to continue to communicate again~ |
Hi @DumoeDss, Thank you for your help. However, there seems to be an issue with the API call parameters. You can refer to the following: logp = openai.Completion.create(
model="davinci-002",
prompt="Please return the logprobs",
logprobs=0,
max_tokens=0,
echo=True,
temperature=0,
) By setting |
To join on here, support for models hosted through the Azure AI Studio would be fantastic too. |
Currently the
NousResearch/Llama-2-7b-chat-hf
model appears to be running locally on my machine, which can take quite a while for long prompts. I'd like to use more AI-optimized hardware to speed this process up.Is it possible to use a web-hosted version of the model, or use a different web-hosted model entirely?
The text was updated successfully, but these errors were encountered: