Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement /v1/chat/completions endpoint for CPU mode #1979

Merged
merged 2 commits into from
Mar 11, 2024

Conversation

johannesploetner
Copy link
Contributor

@johannesploetner johannesploetner commented Feb 18, 2024

Describe your changes

The /v1/chat/completions endpoint was not implemented in gpt4all-api (only returned an "Echo" of the original message, as mentioned in #1700). This PR implements the endpoint for CPU mode and adds an appropriate test.

Issue ticket number and link

#1700

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • I have added thorough documentation for my code.
  • I have tagged PR with relevant project labels. I acknowledge that a PR without labels may be dismissed.
  • If this PR addresses a bug, I have provided both a screenshot/video of the original bug and the working solution.

Demo

try the openai.ChatCompletion.create() function (as described here: https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models)

import openai

openai.api_base="http://localhost:4891/v1"
openai.api_key="ABCDEFG"

stream = openai.ChatCompletion.create(
    model="mistral-7b-openorca.Q4_0.gguf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Orange."},
        ],
       temperature=0
)
print(stream.choices[0].message.content)

Before the PR, we would just get an "Echo: " of the last message. After the PR, the we actually get a result.

Steps to Reproduce

  • see test_chat_completion() in gpt4all_api/app/tests/test_endpoints.py

Notes

  • GPU support will need to be added later. Worth merging anyway IMHO, as it actually adds functionality.

Signed-off-by: Johannes Plötner <johannes.w.m.ploetner@gmail.com>
gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py Outdated Show resolved Hide resolved
Comment on lines +64 to +79
# format system message and conversation history correctly
formatted_messages = ""
for message in request.messages:
formatted_messages += f"<|im_start|>{message.role}\n{message.content}<|im_end|>\n"

# the LLM will complete the response of the assistant
formatted_messages += "<|im_start|>assistant\n"
response = model.generate(
prompt=formatted_messages,
temp=request.temperature
)

# the LLM may continue to hallucinate the conversation, but we want only the first response
# so, cut off everything after first <|im_end|>
index = response.find("<|im_end|>")
response_content = response[:index].strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is ChatML hard-coded here? Normally GPT4All models have a customizable prompt template that defaults to the value in models2.json.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done tue to the very specifics of OpenAI's API. OpenAI's ChatCompletions endpoint can receive an array of messages containing the system message, and past user questions/statements and the assistants replies (see https://platform.openai.com/docs/api-reference/chat/create and https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models):

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Knock knock."},
            {"role": "assistant", "content": "Who's there?"},
            {"role": "user", "content": "Orange."},
    ]
  }'

Goal was here to parse the past messages and to provide the LLM with the complete history. From the above example, I am creating a result like this:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Knock knock.<|im_end|>
<|im_start|>assistant
Who's there?<|im_end|>
<|im_start|>user
Orange.<|im_end|>
<|im_start|>assistant

IMHO we would a different representation of prompt templates in model2.json to be able to reliably parse and use those for this specific use case... Until then, I decided to hardcode ChatML.

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
Signed-off-by: johannesploetner <52075191+johannesploetner@users.noreply.github.com>
@manyoso manyoso merged commit c951a5b into nomic-ai:main Mar 11, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants