Implement /v1/chat/completions endpoint for CPU mode #1979

johannesploetner · 2024-02-18T21:11:08Z

Describe your changes

The /v1/chat/completions endpoint was not implemented in gpt4all-api (only returned an "Echo" of the original message, as mentioned in #1700). This PR implements the endpoint for CPU mode and adds an appropriate test.

Issue ticket number and link

#1700

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
I have added thorough documentation for my code.
I have tagged PR with relevant project labels. I acknowledge that a PR without labels may be dismissed.
If this PR addresses a bug, I have provided both a screenshot/video of the original bug and the working solution.

Demo

try the openai.ChatCompletion.create() function (as described here: https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models)

import openai

openai.api_base="http://localhost:4891/v1"
openai.api_key="ABCDEFG"

stream = openai.ChatCompletion.create(
    model="mistral-7b-openorca.Q4_0.gguf",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Knock knock."},
        {"role": "assistant", "content": "Who's there?"},
        {"role": "user", "content": "Orange."},
        ],
       temperature=0
)
print(stream.choices[0].message.content)

Before the PR, we would just get an "Echo: " of the last message. After the PR, the we actually get a result.

Steps to Reproduce

see test_chat_completion() in gpt4all_api/app/tests/test_endpoints.py

Notes

GPU support will need to be added later. Worth merging anyway IMHO, as it actually adds functionality.

Signed-off-by: Johannes Plötner <johannes.w.m.ploetner@gmail.com>

gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py

cebtenzzre · 2024-02-18T22:04:50Z

gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py

+        # format system message and conversation history correctly
+        formatted_messages = ""
+        for message in request.messages:
+            formatted_messages += f"<|im_start|>{message.role}\n{message.content}<|im_end|>\n"
+
+        # the LLM will complete the response of the assistant
+        formatted_messages += "<|im_start|>assistant\n"
+        response = model.generate(
+            prompt=formatted_messages,
+            temp=request.temperature
+            )
+
+        # the LLM may continue to hallucinate the conversation, but we want only the first response
+        # so, cut off everything after first <|im_end|>
+        index = response.find("<|im_end|>")
+        response_content = response[:index].strip()


Why is ChatML hard-coded here? Normally GPT4All models have a customizable prompt template that defaults to the value in models2.json.

This is done tue to the very specifics of OpenAI's API. OpenAI's ChatCompletions endpoint can receive an array of messages containing the system message, and past user questions/statements and the assistants replies (see https://platform.openai.com/docs/api-reference/chat/create and https://cookbook.openai.com/examples/how_to_format_inputs_to_chatgpt_models):

curl https://api.openai.com/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{ "model": "gpt-3.5-turbo", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Knock knock."}, {"role": "assistant", "content": "Who's there?"}, {"role": "user", "content": "Orange."}, ] }'

Goal was here to parse the past messages and to provide the LLM with the complete history. From the above example, I am creating a result like this:

<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user Knock knock.<|im_end|> <|im_start|>assistant Who's there?<|im_end|> <|im_start|>user Orange.<|im_end|> <|im_start|>assistant

IMHO we would a different representation of prompt templates in model2.json to be able to reliably parse and use those for this specific use case... Until then, I decided to hardcode ChatML.

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Signed-off-by: johannesploetner <52075191+johannesploetner@users.noreply.github.com>

johannesploetner force-pushed the fix-issue1700 branch from a48b37b to 1ee4290 Compare February 18, 2024 21:28

johannesploetner closed this Feb 18, 2024

Implement /v1/chat/completions endpoint for CPU mode

7a2ed6a

Signed-off-by: Johannes Plötner <johannes.w.m.ploetner@gmail.com>

johannesploetner reopened this Feb 18, 2024

johannesploetner force-pushed the fix-issue1700 branch from 1ee4290 to 7a2ed6a Compare February 18, 2024 21:54

cebtenzzre reviewed Feb 18, 2024

View reviewed changes

Update gpt4all-api/gpt4all_api/app/api_v1/routes/chat.py

6d23654

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> Signed-off-by: johannesploetner <52075191+johannesploetner@users.noreply.github.com>

cebtenzzre requested a review from AndriyMulyar February 19, 2024 20:55

manyoso approved these changes Mar 11, 2024

View reviewed changes

manyoso merged commit c951a5b into nomic-ai:main Mar 11, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement /v1/chat/completions endpoint for CPU mode #1979

Implement /v1/chat/completions endpoint for CPU mode #1979

johannesploetner commented Feb 18, 2024 •

edited

Loading

cebtenzzre Feb 18, 2024

johannesploetner Feb 19, 2024

Implement /v1/chat/completions endpoint for CPU mode #1979

Implement /v1/chat/completions endpoint for CPU mode #1979

Conversation

johannesploetner commented Feb 18, 2024 • edited Loading

Describe your changes

Issue ticket number and link

Checklist before requesting a review

Demo

Steps to Reproduce

Notes

cebtenzzre Feb 18, 2024

Choose a reason for hiding this comment

johannesploetner Feb 19, 2024

Choose a reason for hiding this comment

johannesploetner commented Feb 18, 2024 •

edited

Loading