Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using web-hosted model for inference #44

Open
dnnp2011 opened this issue Jan 5, 2024 · 13 comments
Open

Using web-hosted model for inference #44

dnnp2011 opened this issue Jan 5, 2024 · 13 comments
Assignees
Labels
question Further information is requested

Comments

@dnnp2011
Copy link

dnnp2011 commented Jan 5, 2024

Currently the NousResearch/Llama-2-7b-chat-hf model appears to be running locally on my machine, which can take quite a while for long prompts. I'd like to use more AI-optimized hardware to speed this process up.

Is it possible to use a web-hosted version of the model, or use a different web-hosted model entirely?

@iofu728
Copy link
Contributor

iofu728 commented Jan 8, 2024

Hi @dnnp2011, thank you for your support with LLMLingua.

In fact, you can use any web-hosted version of models, as long as it provides an interface similar to 'logprob' for calculating perplexity

@iofu728 iofu728 self-assigned this Jan 8, 2024
@iofu728 iofu728 added the question Further information is requested label Jan 8, 2024
@dnnp2011
Copy link
Author

Thanks for getting back to me @iofu728

How exactly do I implement this in practice? I'm not clear on how to pass any HuggingFace or OpenAI API details to define the model host and pass along any API keys. The only reference to something like this I've seen is using OpenAI embeddings for the re-ranking step.

@iofu728
Copy link
Contributor

iofu728 commented Jan 11, 2024

Hi @dnnp2011,

Sorry, the fact is that it's currently not possible to use the web-host API for this purpose, as we can't obtain the log probabilities of the prompt part through the web-hoster API. Previously, it was feasible to get the log probabilities of the prompt by calling the OpenAI API and setting max_token=0. Therefore, unless there is an API available that provides the log probabilities for the prompt, we can only implement this through self-deployed models.

@snarb
Copy link

snarb commented Feb 7, 2024

@iofu728 can we use old openai api version? Do you in what version it was?

@iofu728
Copy link
Contributor

iofu728 commented Feb 7, 2024

@iofu728 can we use old openai api version? Do you in what version it was?

After our confirmation, some OAI models can obtain log probabilities from the prompt side. You can refer to the following code:

logp = openai.Completion.create(
    model="davinci-002",
    prompt="Please return the logprobs",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,
)
Out[3]:
<OpenAIObject text_completion id=-at > JSON: {
  "id": "",
  "object": "text_completion",
  "created": 1707295146,
  "model": "davinci-002",
  "choices": [
    {
      "text": "Please return the logprobs",
      "index": 0,
      "logprobs": {
        "tokens": [
          "Please",
          " return",
          " the",
          " log",
          "pro",
          "bs"
        ],
        "token_logprobs": [
          null,
          -6.9668007,
          -2.047512,
          -8.885729,
          -13.960022,
          -5.479665
        ],
        "top_logprobs": null,
        "text_offset": [
          0,
          6,
          13,
          17,
          21,
          24
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 6
  }
}

@DumoeDss
Copy link

vllm-project/vllm#1203
Hey, I was wondering if this would be useful? vllm's openai interface provides results for logprobs.
I think this issue could also be implemented through the vllm interface (letting the user choose to use the llm model for the corresponding language)

@DumoeDss
Copy link

lm-sys/FastChat#2612
And the fastchat server support it too.

@iofu728
Copy link
Contributor

iofu728 commented Feb 19, 2024

Hi @DumoeDss,

Thank you for your information. It seems very useful, especially FastChat, which appears to support echo, enabling the return of logprobs from the prompt side. We will consider using the relevant engine in the future. If you are willing to do some adaptations, we would greatly welcome it.

@DumoeDss
Copy link

@iofu728 I'd be happy to try to do it, but I'd have to dive into the source code first, and I'm not sure how to start yet.

@iofu728
Copy link
Contributor

iofu728 commented Feb 20, 2024

@iofu728 I'd be happy to try to do it, but I'd have to dive into the source code first, and I'm not sure how to start yet.

Hi @DumoeDss, the core issue involves implementing the self.get_ppl function through web API calls. Please take a look at the relevant code, and if you need any assistance, feel free to reply.

@DumoeDss
Copy link

@iofu728
I tried outputting logprobs with fastchat/vllm and ran into some more troublesome situations during the pre-processing.

First of all the two issues/prs I mentioned above on github both apply to the completion interface and don't support the chatcompletion interface. The pr's for vllm support chatcompletion, but after trying them out I realized that they don't work very well.

The instruction I use is "Please repeat the following and do not output anything else: content".

I tried using the models yi-34B-chat, qwen1.5-0.5B-chat, qwen1.5-1.8B-chat, qwen1.5-4B-chat, and qwen1.5-7B-chat, and the output of the models above 4B is slightly more satisfactory, but there are cases where the output is not in accordance with the original text, which results in unable to calculate the original logprobs.
But even with the 4B model, for a 3000+ token length content, it takes 20s to output it in full, while it takes less than 400ms to calculate it directly using the 0.5B model.
I don't know if I'm doing anything wrong, I think using the chat model output to calculate logprobs might not be a step in the right direction.

There is a modification here where I added the interface using fastapi, which might be an acceptable solution.

I sent you an email to continue to communicate again~

@iofu728
Copy link
Contributor

iofu728 commented Feb 21, 2024

Hi @DumoeDss,

Thank you for your help. However, there seems to be an issue with the API call parameters. You can refer to the following:

logp = openai.Completion.create(
    model="davinci-002",
    prompt="Please return the logprobs",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,
)

By setting max_tokens to 0 and echo to True, the model will not generate new tokens but will return the logprobs of the prompt side.
I briefly checked, and FastChat should support this. If you have more questions, feel free to ask.

@codylittle
Copy link

To join on here, support for models hosted through the Azure AI Studio would be fantastic too.
And a Typescript library too (;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants