Using web-hosted model for inference #44

dnnp2011 · 2024-01-05T14:50:11Z

Currently the NousResearch/Llama-2-7b-chat-hf model appears to be running locally on my machine, which can take quite a while for long prompts. I'd like to use more AI-optimized hardware to speed this process up.

Is it possible to use a web-hosted version of the model, or use a different web-hosted model entirely?

The text was updated successfully, but these errors were encountered:

iofu728 · 2024-01-08T09:57:43Z

Hi @dnnp2011, thank you for your support with LLMLingua.

In fact, you can use any web-hosted version of models, as long as it provides an interface similar to 'logprob' for calculating perplexity

dnnp2011 · 2024-01-10T14:35:50Z

Thanks for getting back to me @iofu728

How exactly do I implement this in practice? I'm not clear on how to pass any HuggingFace or OpenAI API details to define the model host and pass along any API keys. The only reference to something like this I've seen is using OpenAI embeddings for the re-ranking step.

iofu728 · 2024-01-11T06:08:09Z

Hi @dnnp2011,

Sorry, the fact is that it's currently not possible to use the web-host API for this purpose, as we can't obtain the log probabilities of the prompt part through the web-hoster API. Previously, it was feasible to get the log probabilities of the prompt by calling the OpenAI API and setting max_token=0. Therefore, unless there is an API available that provides the log probabilities for the prompt, we can only implement this through self-deployed models.

snarb · 2024-02-07T06:41:08Z

@iofu728 can we use old openai api version? Do you in what version it was?

iofu728 · 2024-02-07T08:40:16Z

@iofu728 can we use old openai api version? Do you in what version it was?

After our confirmation, some OAI models can obtain log probabilities from the prompt side. You can refer to the following code:

logp = openai.Completion.create(
    model="davinci-002",
    prompt="Please return the logprobs",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,
)
Out[3]:
<OpenAIObject text_completion id=-at > JSON: {
  "id": "",
  "object": "text_completion",
  "created": 1707295146,
  "model": "davinci-002",
  "choices": [
    {
      "text": "Please return the logprobs",
      "index": 0,
      "logprobs": {
        "tokens": [
          "Please",
          " return",
          " the",
          " log",
          "pro",
          "bs"
        ],
        "token_logprobs": [
          null,
          -6.9668007,
          -2.047512,
          -8.885729,
          -13.960022,
          -5.479665
        ],
        "top_logprobs": null,
        "text_offset": [
          0,
          6,
          13,
          17,
          21,
          24
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 6
  }
}

DumoeDss · 2024-02-18T16:29:46Z

vllm-project/vllm#1203
Hey, I was wondering if this would be useful? vllm's openai interface provides results for logprobs.
I think this issue could also be implemented through the vllm interface (letting the user choose to use the llm model for the corresponding language)

DumoeDss · 2024-02-18T16:35:15Z

lm-sys/FastChat#2612
And the fastchat server support it too.

iofu728 · 2024-02-19T07:51:17Z

Hi @DumoeDss,

Thank you for your information. It seems very useful, especially FastChat, which appears to support echo, enabling the return of logprobs from the prompt side. We will consider using the relevant engine in the future. If you are willing to do some adaptations, we would greatly welcome it.

DumoeDss · 2024-02-19T08:07:29Z

@iofu728 I'd be happy to try to do it, but I'd have to dive into the source code first, and I'm not sure how to start yet.

iofu728 · 2024-02-20T08:48:53Z

@iofu728 I'd be happy to try to do it, but I'd have to dive into the source code first, and I'm not sure how to start yet.

Hi @DumoeDss, the core issue involves implementing the self.get_ppl function through web API calls. Please take a look at the relevant code, and if you need any assistance, feel free to reply.

DumoeDss · 2024-02-20T18:15:04Z

@iofu728
I tried outputting logprobs with fastchat/vllm and ran into some more troublesome situations during the pre-processing.

First of all the two issues/prs I mentioned above on github both apply to the completion interface and don't support the chatcompletion interface. The pr's for vllm support chatcompletion, but after trying them out I realized that they don't work very well.

The instruction I use is "Please repeat the following and do not output anything else: content".

I tried using the models yi-34B-chat, qwen1.5-0.5B-chat, qwen1.5-1.8B-chat, qwen1.5-4B-chat, and qwen1.5-7B-chat, and the output of the models above 4B is slightly more satisfactory, but there are cases where the output is not in accordance with the original text, which results in unable to calculate the original logprobs.
But even with the 4B model, for a 3000+ token length content, it takes 20s to output it in full, while it takes less than 400ms to calculate it directly using the 0.5B model.
I don't know if I'm doing anything wrong, I think using the chat model output to calculate logprobs might not be a step in the right direction.

There is a modification here where I added the interface using fastapi, which might be an acceptable solution.

I sent you an email to continue to communicate again~

iofu728 · 2024-02-21T14:29:17Z

Hi @DumoeDss,

Thank you for your help. However, there seems to be an issue with the API call parameters. You can refer to the following:

logp = openai.Completion.create(
    model="davinci-002",
    prompt="Please return the logprobs",
    logprobs=0,
    max_tokens=0,
    echo=True,
    temperature=0,
)

By setting max_tokens to 0 and echo to True, the model will not generate new tokens but will return the logprobs of the prompt side.
I briefly checked, and FastChat should support this. If you have more questions, feel free to ask.

codylittle · 2024-03-02T04:13:19Z

To join on here, support for models hosted through the Azure AI Studio would be fantastic too.
And a Typescript library too (;

iofu728 self-assigned this Jan 8, 2024

iofu728 added the question Further information is requested label Jan 8, 2024

iofu728 mentioned this issue Jan 18, 2024

Support for remote LLM through API #65

Open

iofu728 mentioned this issue Mar 7, 2024

[Feature Request]: Token compression using GPT-3.5-turbo #101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using web-hosted model for inference #44

Using web-hosted model for inference #44

dnnp2011 commented Jan 5, 2024

iofu728 commented Jan 8, 2024

dnnp2011 commented Jan 10, 2024

iofu728 commented Jan 11, 2024

snarb commented Feb 7, 2024

iofu728 commented Feb 7, 2024

DumoeDss commented Feb 18, 2024

DumoeDss commented Feb 18, 2024

iofu728 commented Feb 19, 2024

DumoeDss commented Feb 19, 2024

iofu728 commented Feb 20, 2024

DumoeDss commented Feb 20, 2024

iofu728 commented Feb 21, 2024

codylittle commented Mar 2, 2024

Using web-hosted model for inference #44

Using web-hosted model for inference #44

Comments

dnnp2011 commented Jan 5, 2024

iofu728 commented Jan 8, 2024

dnnp2011 commented Jan 10, 2024

iofu728 commented Jan 11, 2024

snarb commented Feb 7, 2024

iofu728 commented Feb 7, 2024

DumoeDss commented Feb 18, 2024

DumoeDss commented Feb 18, 2024

iofu728 commented Feb 19, 2024

DumoeDss commented Feb 19, 2024

iofu728 commented Feb 20, 2024

DumoeDss commented Feb 20, 2024

iofu728 commented Feb 21, 2024

codylittle commented Mar 2, 2024