-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault #5655
Comments
allow to enable VERBOSE mode
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Added in: issues.feature @ggerganov @ngxson On it 👍 |
…t request. server: tests: add multi users embeddings as fixed
…t request. server: tests: add multi users embeddings as fixed
@TruongGiangBT Please confirm you do not face any error anymore. Feel free to reopen if any. |
@phymbert The segmentation fault error has been fixed. However, when I simultaneously execute multiple requests with the same input value, the embedding results are different. |
Could you please open a dedicated discussion on embedding feature ? |
What do you think if i fix it like this? |
@ngxson Hi, any idea on the matter please ? i |
It's true that there's a bug on the line that @TruongGiangBT pointed out: the The fact that you see different results was because you're using
It's better to check with @ggerganov I think. |
@TruongGiangBT Can you open a PR with the changes you proposed? |
My changes only help me run correctly to get the embedding result of multiple sequences in the decode batch. These modifications will affect the embedding function result. The crux of this issue is that we need to determine the index of the last token of the sequences, while also having a reasonable plan to store the embedding result of the sequences into ctx->embedding. |
Can you add a scenario in issues.feature ? it will really help to trace your issue. |
I think this is expected due to the way the KV cache works. But I need to verify before explaining. If you could provide some basic instructions / commands to run this scenario it would be of great help. Otherwise, I have to invest time to create curl queries and / or bash scripts to match what you are doing |
@ggerganov I also think it might be due to the way the KV cache operates. Although the results are somewhat different, they can be acceptable. As discussed above, I encountered a problem with extracting embeddings and I have temporarily fixed it (I also shared it). But it only works well for my requirements, we need to find a general solution. Docker command: async def requests_post_async(*args, **kwargs): model_url = "http://127.0.0.1:6900" Thank you |
Ok thanks. I was willing to look more into this, but you are not making it easy for me. I copied this code in a import asyncio
import requests
async def requests_post_async(*args, **kwargs):
return await asyncio.to_thread(requests.post, *args, **kwargs)
model_url = "http://127.0.0.1:6900"
responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
url= f"{model_url}/embedding",
json= {"content": "0"*1024}
) for i in range(8)])
for response in responses:
embedding = response.json()["embedding"]
print(embedding[-4:]) However I get an error: $ ▶ python3 test.py
File "/Users/ggerganov/development/github/llama.cpp/test.py", line 8
responses: list[requests.Response] = await asyncio.gather(*[requests_post_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: 'await' outside function I have no idea what is the issue and don't want to spend time to fix it. Providing simple repro instructions goes a long way to help maintainers help you |
Oh, I am sorry. I am using a Jupyter notebook. Here is the code for a .py file:
|
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ggerganov#5699) * server: ggerganov#5655 - continue to update other slots on embedding concurrent request. * server: tests: add multi users embeddings as fixed * server: tests: adding OAI compatible embedding concurrent endpoint * server: tests: adding OAI compatible embedding with multiple inputs
* server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault ggerganov#5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ggerganov#5699) * server: ggerganov#5655 - continue to update other slots on embedding concurrent request. * server: tests: add multi users embeddings as fixed * server: tests: adding OAI compatible embedding concurrent endpoint * server: tests: adding OAI compatible embedding with multiple inputs
Originally posted by @TruongGiangBT in #3876 (comment)
The text was updated successfully, but these errors were encountered: