Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems when instrumenting MPI applications with HDF5 at runtime #989

Open
arcturus5340 opened this issue May 10, 2024 · 11 comments
Open

Comments

@arcturus5340
Copy link

When attempting to instrument DLIO at runtime as follows:

$ env LD_PRELOAD=/home/user/darshan/darshan-runtime/install/lib/libdarshan.so mpirun -np 8 python -m src.dlio_benchmark workload=cosmoflow

I get the following error:

/bin/sh: symbol lookup error: /home/user/darshan/darshan-runtime/install/lib/libdarshan.so: undefined symbol: H5FDperform_init

I installed Darshan as follows:

$ ./configure --with-log-path=/home/user/darshan-logs --with-jobid-env=PBS_JOBID CC=mpicc --prefix=/home/user/darshan/darshan-runtime/install --enable-hdf5-mod --with-hdf5=/cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/HDF5/1.14.0-iimpi-2022a
$ make
$ make install

And in the output I got:

------------------------------------------------------------------------------
   Darshan Runtime Version 3.4.4 configured with the following features:
           MPI C compiler                - icc
           GCC-compatible compiler       - yes
           NULL          module support  - yes
           POSIX         module support  - yes
           STDIO         module support  - yes
           DXT           module support  - yes
           MPI-IO        module support  - yes
           AUTOPERF MPI  module support  - no
           AUTOPERF XC   module support  - no
           HDF5          module support  - yes (using HDF5 1.14.0)
           PnetCDF       module support  - no
           BG/Q          module support  - no
           Lustre        module support  - yes
           MDHIM         module support  - no
           HEATMAP       module support  - yes
           Memory alignment in bytes     - 8
           Log file env variables        - N/A
           Location of Darshan log files - /home/user/darshan-logs
           Job ID env variable           - PBS_JOBID
           MPI-IO hints                  - romio_no_indep_rw=true;cb_nodes=4

Which means that during installation HDF5 is recognized by the installer (otherwise how would it know the version?)

Next is the output of ldd libdarshan.so, which may prove useful:

    linux-vdso.so.1 (0x00007ffffb752000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x000014d1553bb000)
    librt.so.1 => /lib64/librt.so.1 (0x000014d1551b3000)
    libdl.so.2 => /lib64/libdl.so.2 (0x000014d154faf000)
    liblustreapi.so.1 => /lib64/liblustreapi.so.1 (0x000014d154d6f000)
    libm.so.6 => /lib64/libm.so.6 (0x000014d1549ed000)
    libz.so.1 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/zlib/1.2.12-GCCcore-11.3.0/lib/libz.so.1 (0x000014d15577c000)
    libmpifort.so.12 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.6.0-intel-compilers-2022.1.0/mpi/2021.6.0/lib/libmpifort.so.12 (0x000014d154639000)
    libmpi.so.12 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.6.0-intel-compilers-2022.1.0/mpi/2021.6.0/lib/release/libmpi.so.12 (0x000014d152df1000)
    libc.so.6 => /lib64/libc.so.6 (0x000014d152a2c000)
    /lib64/ld-linux-x86-64.so.2 (0x000014d1555db000)
    libjson-c.so.4 => /lib64/libjson-c.so.4 (0x000014d15281c000)
    liblnetconfig.so.4 => /lib64/liblnetconfig.so.4 (0x000014d1525f7000)
    libyaml-0.so.2 => /lib64/libyaml-0.so.2 (0x000014d1523d7000)
    libreadline.so.7 => /lib64/libreadline.so.7 (0x000014d152188000)
    libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x000014d151f84000)
    libgcc_s.so.1 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/GCCcore/11.3.0/lib64/libgcc_s.so.1 (0x000014d15575e000)

I will note that running DLIO + HDF5 without Darshan does not cause any problems:

$ mpirun -np 8 python -m src.dlio_benchmark workload=cosmoflow
[INFO] 2024-05-01T12:00:00.000000 Running DLIO with 8 process(es) [/rwthfs/rz/cluster/home/user/dlio_benchmark/src/dlio_benchmark.py:102]
...

I also tried running Darshan with a simple program using HDF5 (code here) and had no problems doing so. So the issue may be related to the fact that Darshan does not track H5FDperform_init.

@shanedsnyder
Copy link
Contributor

Could you try our latest release (3.4.5) and see if you still have the issue? We reworked something in our HDF5 module that I think may resolve this issue.

@arcturus5340
Copy link
Author

I repeated the installation process as described above and now a different but similar error occurred:

/bin/sh: symbol lookup error: /home/kr166361/darshan/darshan-runtime/install/lib/libdarshan.so: undefined symbol: H5Eset_auto2

@shanedsnyder
Copy link
Contributor

Hmm, maybe there's something still not quite right with how Darshan's HDF5 module interacts with HDF5 libraries at runtime. We've seen similar issues that we've tried to address in recent releases, but maybe need to rethink things again.. I'll see if I can reproduce this with DLIO and think more about it.

I think you could probably avoid the issue entirely by modifying your setting of LD_PRELOAD to additionally reference the HDF5 library : export LD_PRELOAD=/path/to/libdarshan.so:/path/to/libhdf5.so

@arcturus5340
Copy link
Author

Thanks for your help!

@hariharan-devarajan
Copy link

I repeated the installation process as described above and now a different but similar error occurred:

/bin/sh: symbol lookup error: /home/kr166361/darshan/darshan-runtime/install/lib/libdarshan.so: undefined symbol: H5Eset_auto2

So, DLIO installs h5py, which compiles h5py with a specific HDF5 lib. Plus you compile darshan with a specific HDF5 lib. I suspect the h5py version of HDF5 and the version darshan wants might be different. Causing this issue.

Docs How to make sure u install h5py with correct hdf5.

The main idea is to make sure what h5py was compiled with matches the darshan.

@arcturus5340
Copy link
Author

I updated h5py as per the link you provided and reinstalled DLIO, then the version of HDF5 used by the package updated to the one used by Darshan:

user@login18-2:~/dlio_benchmark[1011]$ python
Python 3.10.4 (main, Aug  9 2023, 13:18:35) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import h5py
>>> h5py.version.hdf5_version
'1.14.0'

However, in spite of this, the error persisted:

mpiexec: symbol lookup error: /home/user/darshan/darshan-runtime/install/lib/libdarshan.so: undefined symbol: H5Eset_auto2

@hariharan-devarajan
Copy link

does ldd on libdarshan.so show hdf5 so and if so is it the same as one u need. If not, u can ldpreload the hdf5 so as well before darshan.so

@arcturus5340
Copy link
Author

No, there is no hdf5 in the ldd output:

$ ldd libdarshan.so
        linux-vdso.so.1 (0x000014ffadb02000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x000014ffad6b6000)
        librt.so.1 => /lib64/librt.so.1 (0x000014ffad4ae000)
        libdl.so.2 => /lib64/libdl.so.2 (0x000014ffad2aa000)
        liblustreapi.so.1 => /lib64/liblustreapi.so.1 (0x000014ffad06a000)
        libm.so.6 => /lib64/libm.so.6 (0x000014ffacce8000)
        libz.so.1 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/zlib/1.2.12-GCCcore-11.3.0/lib/libz.so.1 (0x000014ffada6d000)
        libmpifort.so.12 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.6.0-intel-compilers-2022.1.0/mpi/2021.6.0/lib/libmpifort.so.12 (0x000014ffac934000)
        libmpi.so.12 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/impi/2021.6.0-intel-compilers-2022.1.0/mpi/2021.6.0/lib/release/libmpi.so.12 (0x000014ffab0ec000)
        libc.so.6 => /lib64/libc.so.6 (0x000014ffaad27000)
        /lib64/ld-linux-x86-64.so.2 (0x000014ffad8d6000)
        libjson-c.so.4 => /lib64/libjson-c.so.4 (0x000014ffaab17000)
        liblnetconfig.so.4 => /lib64/liblnetconfig.so.4 (0x000014ffaa8f2000)
        libyaml-0.so.2 => /lib64/libyaml-0.so.2 (0x000014ffaa6d2000)
        libreadline.so.7 => /lib64/libreadline.so.7 (0x000014ffaa483000)
        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x000014ffaa27f000)
        libgcc_s.so.1 => /cvmfs/software.hpc.rwth.de/Linux/RH8/x86_64/intel/skylake_avx512/software/GCCcore/11.3.0/lib64/libgcc_s.so.1 (0x000014ffada51000)
        libtinfo.so.6 => /lib64/libtinfo.so.6 (0x000014ffaa052000)

However, after adding the path to HDF5 in LD_PRELOAD, everything works fine. Thanks!

@hariharan-devarajan
Copy link

I think if u compile darshan with hdf5 the so should be linked to darshan. Maybe it is still a bug. @shanedsnyder thoughts?

@shanedsnyder
Copy link
Contributor

I'll have to dig into it more, but you may be on to something @hariharan-devarajan -- some improper linking of HDF5 could be leading to this error. It is a little tricky though, in that we really don't want the HDF5 library Darshan is using to override what the user wants. E.g., if Darshan was built against a 1.12.x version of HDF5, but the user is trying to build an app against a newer 1.14.x version, then we obviously need to be careful that the 1.12.x libraries aren't used at runtime. I think that's part of the reason that ldd doesn't show HDF5 libraries, as we are intentionally hoping the user provides them at link time. Perhaps this leads to different behavior depending on whether LD_PRELOAD is used or whether Darshan is directly linked into the application (which can't be done with Python).

I'll leave the issue open so I don't forget to investigate. In the meantime, being careful to set LD_PRELOAD to point to both libraries seems to be the way to go.

@hariharan-devarajan
Copy link

Additionally, consider incorrect linking at runtime. I think u need ABI compatibility using libtool to ensure they match. In general if you use the c interface of HDF5 mismatch of version wont screw up things but I think u should be linking darshan with the one it compiled with otherwise, it confuses people of what version is needed (or was compiled with) by darshan. HDF5 also has macros as I remember to make sure u do a check at runtime as well. I believe this would need some work to make sure the stack has a consistent view of the libraries to be loaded/needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants