Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 3.6 breaks PyTorch #242

Closed
assert-0 opened this issue Jan 28, 2025 · 4 comments · Fixed by #244
Closed

Release 3.6 breaks PyTorch #242

assert-0 opened this issue Jan 28, 2025 · 4 comments · Fixed by #244
Assignees

Comments

@assert-0
Copy link

When installing both torch (any version, tested with 2.5, 2.4 and 2.3) and pulsar-client==3.6.0, a crash is caused by the following code:

import pulsar
import torch

Output:
free(): invalid pointer
[1] 107824 IOT instruction (core dumped)

The same crash is not present when using pulsar-client==3.5.0. The imports work normally.

@BewareMyPower
Copy link
Contributor

I believe there is something wrong with the Linux wheels. It can be reproduced in a ubuntu:22.04 container with Python 3.10.

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000fffff7d825d4 in __pthread_once_slow (once_control=0xfffff47dfc78 <dnnl::impl::cpu::aarch64::acl_thread_utils::acl_thread_bind()::flag_once>, 
    init_routine=0xfffff64d1e60 <__once_proxy>) at ./nptl/pthread_once.c:116
#2  0x0000fffff1d1e700 in dnnl::impl::cpu::aarch64::acl_thread_utils::set_acl_threading() ()
   from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#3  0x0000fffff12068d4 in dnnl_engine_create () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#4  0x0000ffffece70668 in ideep::engine::cpu_engine() () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#5  0x0000ffffec45cbb4 in _GLOBAL__sub_I_IDeepRegistration.cpp () from /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so
#6  0x0000fffff7fc7624 in call_init (env=0xaaaaab061c20, argv=0xfffffffff698, argc=2, l=<optimized out>) at ./elf/dl-init.c:70
#7  call_init (l=<optimized out>, argc=2, argv=0xfffffffff698, env=0xaaaaab061c20) at ./elf/dl-init.c:26
#8  0x0000fffff7fc772c in _dl_init (main_map=0xaaaaab2dff90, argc=2, argv=0xfffffffff698, env=0xaaaaab061c20) at ./elf/dl-init.c:117
#9  0x0000fffff7e2d360 in __GI__dl_catch_exception (exception=0x0, operate=0xfffff7fcdd20 <call_dl_init>, args=0xffffffffcb10)
    at ./elf/dl-error-skeleton.c:182
#10 0x0000fffff7fcdf5c in dl_open_worker (a=a@entry=0xffffffffcd58) at ./elf/dl-open.c:808
#11 0x0000fffff7e2d308 in __GI__dl_catch_exception (exception=0xffffffffcd40, operate=0xfffff7fcdeb4 <dl_open_worker>, args=0xffffffffcd58)
    at ./elf/dl-error-skeleton.c:208
#12 0x0000fffff7fce2fc in _dl_open (file=0xfffff66fbed0 "/usr/local/lib/python3.10/dist-packages/torch/_C.cpython-310-aarch64-linux-gnu.so", 
    mode=-2147483646, caller_dlopen=0xaaaaaacc2780, nsid=-2, argc=2, argv=0xfffffffff698, env=0xaaaaab061c20) at ./elf/dl-open.c:883
#13 0x0000fffff7d796e4 in dlopen_doit (a=a@entry=0xffffffffd048) at ./dlfcn/dlopen.c:56
#14 0x0000fffff7e2d308 in __GI__dl_catch_exception (exception=exception@entry=0xffffffffcfa0, operate=0xfffff7d79680 <dlopen_doit>, args=0xffffffffd048)
    at ./elf/dl-error-skeleton.c:208
#15 0x0000fffff7e2d3d0 in __GI__dl_catch_error (objname=0xffffffffd018, errstring=0xffffffffd020, mallocedp=0xffffffffd017, operate=<optimized out>, 
    args=<optimized out>) at ./elf/dl-error-skeleton.c:227
#16 0x0000fffff7d791c0 in _dlerror_run (operate=operate@entry=0xfffff7d79680 <dlopen_doit>, args=args@entry=0xffffffffd048) at ./dlfcn/dlerror.c:138
#17 0x0000fffff7d79784 in dlopen_implementation (dl_caller=<optimized out>, mode=<optimized out>, file=<optimized out>) at ./dlfcn/dlopen.c:71
#18 ___dlopen (file=<optimized out>, mode=<optimized out>) at ./dlfcn/dlopen.c:81

It's a similar issue with #243 that if torch is imported first, the segmentation fault won't happen. /cc @lhcorralo This issue might be related to the recent change to how the Linux wheels are built. I will check it in details when I'm back to work a few days later. #235

@BewareMyPower
Copy link
Contributor

I did some experiments on ubuntu:22.04 today.

1. Install the pre-built DEB packages

curl -O -L https://archive.apache.org/dist/pulsar/pulsar-client-cpp-3.7.0/deb-arm64/apache-pulsar-client-dev.deb
curl -O -L https://archive.apache.org/dist/pulsar/pulsar-client-cpp-3.7.0/deb-arm64/apache-pulsar-client.deb
apt install ./apache-pulsar-client*.deb

Then building the pulsar-client-python libraries.

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j8
mv build/lib_pulsar.so .
./setup.py bdist_wheel
python3 -m pip install dist/pulsar_client-3.6.0-cp310-cp310-linux_aarch64.whl --force-reinstall
python3 -c 'import pulsar; import torch'

It works well.

2. Build pulsar-client-cpp from source

# With the pulsar-client-cpp-3.7.0 source code
cmake -B build-cpp -DINTEGRATE_VCPKG=ON -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTS=OFF -DBUILD_DYNAMIC_LIB=ON -DBUILD_STATIC_LIB=ON
cmake --build build-cpp -j8 --target install

Then repeat the steps in the previous section. Now it will crash with "Segmentation fault".

@BewareMyPower
Copy link
Contributor

The difference might be that the pre-built library (/usr/lib/libpulsar.so) links to dl, rt, m, pthread dynamically but the library built from source (/usr/local/lib/libpulsar.so) does not

root@0008eeb7c08f:~/pulsar-client-python-3.6.0# ldd /usr/lib/libpulsar.so 
	linux-vdso.so.1 (0x0000ffff9591f000)
	libdl.so.2 => /lib/aarch64-linux-gnu/libdl.so.2 (0x0000ffff958d0000)
	librt.so.1 => /lib/aarch64-linux-gnu/librt.so.1 (0x0000ffff958b0000)
	libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff94d60000)
	libpthread.so.0 => /lib/aarch64-linux-gnu/libpthread.so.0 (0x0000ffff95890000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff94bb0000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffff958e6000)
root@0008eeb7c08f:~/pulsar-client-python-3.6.0# ldd /usr/local/lib/libpulsar.so 
	linux-vdso.so.1 (0x0000ffffb6cd4000)
	libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb6af0000)
	/lib/ld-linux-aarch64.so.1 (0x0000ffffb6c9b000)

@BewareMyPower
Copy link
Contributor

I pushed a PR: #244

The patch was verified when I repeated the 2nd approach here. But I will need to build the wheels first in CI and verify the wheel again.

BewareMyPower added a commit that referenced this issue Feb 8, 2025
…dc++ (#244)

Fixes #242
Fixes #243
Fixes #245

Add a patch to avoid linking to libgcc and libstdc++.
BewareMyPower added a commit that referenced this issue Feb 8, 2025
…dc++ (#244)

Fixes #242
Fixes #243
Fixes #245

Add a patch to avoid linking to libgcc and libstdc++.

(cherry picked from commit 4a4ac3f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants