-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
terminated by signal SIGSEGV (Address boundary error) if import onnxruntime before onnxruntime-genai #257
Comments
Not only onnxruntime, it failed when import |
Could you share more details on this error? And if possible, a stacktrace. That would help me reproduce this. I tried running your script and was able to successfully execute it. Here is the output: Loading model...
Model loaded in 15.46 seconds
Creating generator ...
Generator created
.
In the end, the friends realized that their journey had not only deepened their understanding of the world but also strengthened their bond. They had learned the importance of embracing diversity, respecting different perspectives, and finding common ground.
As they bid farewell to the Enchanted Forest, they carried with them the memories of their adventure and the wisdom gained from their encounters. They knew that their shared experiences would forever shape their lives and inspire them to continue exploring the wonders of the world.
And so, the three friends returned to their small town, forever changed by their journey. They became advocates for unity, spreading the message of acceptance and understanding wherever they went. Their story became a testament to the power of friendship, curiosity, and the pursuit of knowledge.
As they looked back on their adventure, they realized that the Enchanted Forest had not only taught them about the world but also about themselves. They had discovered their own strengths, overcome their fears, andPrompt tokens: 3, New tokens: 197, Time to first: 0.25s, New tokens per second: 14.55 tps I am using Platform: windows, linux... |
|
Thanks for you trying @baijumeswani :) , please let me know if anything is missed. Thanks!! |
Ok, I was able to reproduce the problem on my end. I was also able to narrow down the problem to be related to std::filesystem. Summary of the problem:
The problem is introduced because when importing onnxruntime/torch/transformers first, the symbols for std::filesystem are loaded from libstdc++.so.6. These symbols are incompatible with the symbols needed for std::filesystem for GCC 8. There is more meaningful information provided here: https://bugs.launchpad.net/ubuntu/+source/gcc-8/+bug/1824721/comments/6 I think to resolve this problem, we might need publish a patch release that is built with a higher GCC version. cc @jchen351 |
I can reproduce the problem on Mariner2 Linux. Supposedly the two packages use the same compiler and there is no discrepancy between them. I could see the genai package uses a fs implementation from itself, instead of libstdc++. |
I am rebuilding the packages with symbols. |
Now I have an onnxruntime-gpu debug package, I don't have a genai debug package yet. So I still do not have enough clues. But I see when it crashed the callstack has both packages' native code there. It seems that the onnxruntime-gpu package somehow calls into GenAI, which I didn't expect to happen. Is it possible that the onnxruntime-gpu code constructed a std::filesystem::path object then passed it to the GenAI package's C++ code? |
Do the two packages pass C++ objects around? |
No, there should be no interaction between the two packages. My hypothesis is that this is caused because we statically link against stdc++fs and when importing any python module that requires libstdc++.so, the symbols are not distinguishable (between the statically linked code and the shared lib) and ort-genai invokes std::filesystem symbols from libstdc++.so which would result in the crash. |
I am guessing that if we tried doing this same thing on an ubuntu 18.04 machine, we wouldn't see this crash. |
But when it crashed in the callstack I see both packages' native code, which is abnormal. |
ort-genai calls into onnxruntime.so library which is embedded inside the ort-genai python package. Are you sure you saw native code from ort python package in the callstack? Or from ort-genai calling into ort.so inside the ort-genai python package? Could you paste your callstack? |
Native code from ort python package in the callstack. I am 100% sure on that. |
Here is the callstack.
|
The function call at the line "# 12 0x00007ffff528b911 in pybind11::detail::pybind11_meta_call" was from ORT's python package. pybind11_meta_call + 77 in section .text of /home/chasun/.local/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.cpython-39-x86_64-linux-gnu.so |
It is unexpected for ort and ort-genai to share any objects specially since the python script does not invoke anything on onnxruntime (except for importing it and the crash does not happen during importing onnxruntime). |
I got a genai debug package, but the callstack is wired. But, I found an issue.
It only prints two things:
However, with the same command the GenAI package's binary prints a lot of stdc++ symbols. Which means you didn't hide the symbols per suggestion from https://gcc.gnu.org/wiki/Visibility |
Does GenAI has this file? python/version_script.lds |
The symbol exported from GenAI should be PyInit_onnxruntime_genai |
Got a new callstack with GenAI's debug symbols.
Update: the above call stack is not very useful , because the program already went wrong before that. |
Any updates? :) |
A few solutions for us to explore:
I'll work on finding the right path forward. |
I think I can confirm the root cause is https://stackoverflow.com/questions/63902528/program-crashes-when-filesystempath-is-destroyed , because after I set a breakpoint at "std::filesystem::__cxx11::path::_M_split_cmpts()", I got the following stacktrace:
Though onnxruntime_genai.cpython-39-x86_64-linux-gnu.so contains a copy of the function:
At runtime from the callstack you can see actually an implementation from /lib/libstdc++.so.6 was used. Given the layout of std::filesystem::path object between GCC 8 and higher GCC versions are different(and incompatible), we should not see one calls into another. |
Thanks! |
I started upgrading the GCC. |
### Description Use a common set of prebuilt manylinux base images to build the packages, to avoid building the manylinux part again and again. The base images can be used in GenAI and other projects too. This PR also updates the GCC version for inference python CUDA11/CUDA12 builds from 8 to 11. Later on I will update all other CUDA pipelines to use GCC 11, to avoid the issue described in onnx/onnx#6047 and microsoft/onnxruntime-genai#257 . ### Motivation and Context To extract the common part as a reusable build infra among different ONNX Runtime projects.
1. Use an internal prebuilt base image instead of a public image, so that we do not need to rebuild the manylinux part again and again. 2. The new CUDA 11 base image is based on almalinux 8 instead of ubi8, so that we can get GCC 11. ubi8 only has GCC 8 and GCC 12, but GCC 12 is compatible with CUDA 11. So before this change we use GCC 8 in CUDA 11 build. After this change we will use GCC 11 instead. 3. Drop the support for GCC 10 and below. This PR provides another solution for #257 .
1. Use an internal prebuilt base image instead of a public image, so that we do not need to rebuild the manylinux part again and again. 2. The new CUDA 11 base image is based on almalinux 8 instead of ubi8, so that we can get GCC 11. ubi8 only has GCC 8 and GCC 12, but GCC 12 is compatible with CUDA 11. So before this change we use GCC 8 in CUDA 11 build. After this change we will use GCC 11 instead. 3. Drop the support for GCC 10 and below. This PR provides another solution for #257 .
### Description Use a common set of prebuilt manylinux base images to build the packages, to avoid building the manylinux part again and again. The base images can be used in GenAI and other projects too. This PR also updates the GCC version for inference python CUDA11/CUDA12 builds from 8 to 11. Later on I will update all other CUDA pipelines to use GCC 11, to avoid the issue described in onnx/onnx#6047 and microsoft/onnxruntime-genai#257 . ### Motivation and Context To extract the common part as a reusable build infra among different ONNX Runtime projects.
Use a common set of prebuilt manylinux base images to build the packages, to avoid building the manylinux part again and again. The base images can be used in GenAI and other projects too. This PR also updates the GCC version for inference python CUDA11/CUDA12 builds from 8 to 11. Later on I will update all other CUDA pipelines to use GCC 11, to avoid the issue described in onnx/onnx#6047 and microsoft/onnxruntime-genai#257 . To extract the common part as a reusable build infra among different ONNX Runtime projects.
Cannot create og.Model if user import onnxruntime before the onnxruntime-gneai.
The job is terminated by signal SIGSEGV (Address boundary error)
How to reproduce:
The text was updated successfully, but these errors were encountered: