Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pw2qmcpack.x gives segmentation error for converting DFT+U+V wavefunctions #5244

Closed
kayahans opened this issue Nov 26, 2024 · 5 comments
Closed

Comments

@kayahans
Copy link
Contributor

kayahans commented Nov 26, 2024

Describe the bug
Using QE 7.4, pw2qmcpack.x executable gives segmentation error when it tries to convert DFT+U+V wavefunctions. Using the identical procedure and settings (SCF, NSCF, pw2qmcpack.x) DFT+U wave functions are converted with no issues.

pw2qmcpack.x is part of QE right now, but I guess @ye-luo is the only person maintaining it, so I am posting it here.

To Reproduce
Steps to reproduce the behavior:
I tested it on two computers. Please see the attached files for reproducing the issue. If you like to have the wavefunctions, please contact me I will transfer them to a suitable location for you.

dft_uv_io.zip

ORNL-Baseline:

  1. QE 7.4 stable version
  2. module purge; module load DefApps intel/20.0.4 openmpi/4.0.4 hdf5/1.14.3 cmake/3.26.3
  3. cmake -DQE_ENABLE_PLUGINS=“pw2qmcpack” -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpif90 -DQE_ENABLE_HDF5=ON ..

NERSC Perlmutter (using compilation script that @aannabe provided):

  1. QE 7.4 stable version
  2. module load cpu
    module load cray-hdf5-parallel/1.12.2.9
    module load cray-fftw
    module load cmake
    module list

export HDF5_ROOT=$HDF5_DIR
export HDF5_LIBRARIES=$HDF5_DIR/lib
export HDF5_INCLUDE_DIRS=$HDF5_DIR/include
3. cmake -DCMAKE_C_COMPILER=cc -DCMAKE_Fortran_COMPILER=ftn -DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment -DQE_ENABLE_PLUGINS=pw2qmcpack -DQE_ENABLE_HDF5=ON ..

Expected behavior
Execution of pw2qmcpack.x should not depend on the functional used in QE.

System:

  • ORNL baseline and NERSC perlmutter
@prckent
Copy link
Contributor

prckent commented Dec 4, 2024

This is a very strange problem because the converter pw2qmcpack does not know anything about the functional that is being used. It merely collects the already computed wavefunctions and writes them out in our format.

I was not able to reproduce this problem on a 2x24 core server with a GCC14 build of QE7.4 using the U=2 V=4 inputs. Other than reducing the scf convergence threshold in the scf run for speed, I kept the other files the same. pw.x was run with 48 MPI tasks and OMP_NUM_THREADS=1. pw2qmcpack.x worked successfully with 1, 48 or even 128 tasks, generating a 13GiB pwscf.pwscf.h5 file. Note that the 1 task runs were actually the fastest (17 seconds).

Interestingly your conv.out has some text from routine gamma_only, that I do not get:

$diff conv.out_orig conv.out128
2c2
<      Program pw2qmcpack v.7.4 starts on 25Nov2024 at 11:25: 1 
---
>      Program pw2qmcpack v.7.4 starts on  4Dec2024 at 11:39:42 
13c13,15
<      Parallel version (MPI), running on   128 processors
---
>      Parallel version (MPI & OpenMP), running on     128 processor cores
>      Number of MPI processes:               128
>      Threads/MPI process:                     1
17,18c19,20
<      478222 MiB available memory on the printing compute node when the environment starts
<  
---
>      112180 MiB available memory on the printing compute node when the environment starts
> 
245c247
<  
---
> 
252c254
<  
---
> 
254,255c256,277
<  
<      Message from routine gamma_only text too long will be truncated :
---
> 
>      Reading collected, re-writing distributed wavefunctions
> esh5 destory the existing pwscf_output/pwscf.pwscf.h5
> esh5 create pwscf_output/pwscf.pwscf.h5
> 
>      compute_qmcp :     64.47s CPU    139.99s WALL (       1 calls)
> 
>      Called by read_file_lite:
> 
>      Called by compute_qmcpack:
>      big_loop     :     63.53s CPU    138.16s WALL (       1 calls)
>      write_h5     :      0.01s CPU      0.12s WALL (       1 calls)
>      glue_h5      :      0.00s CPU      0.00s WALL (       1 calls)
> 
>      pw2qmcpack   :   1m23.65s CPU   2m51.26s WALL
> 
> 
>    This run was terminated on:  11:42:33   4Dec2024            
> 
> =------------------------------------------------------------------------------=
>    JOB DONE.
> =------------------------------------------------------------------------------=

Suggestions for things to try:

  • Try running with only 1 MPI task and OMP_NUM_THREADS=1
  • Build the OpenMP version in case there is a difference there

In all cases double check the directories are appropriately clean.

If this doesn't work we might try different compilers and libraries. I do wonder if there is an "off by 1" kind of memory error associated with (say) HDF5 usage that has been lurking. However this would not explain why your DFT runs are OK but DFT+U+V are not.

@kayahans
Copy link
Contributor Author

kayahans commented Dec 4, 2024

Thank you Paul, for rerunning my issue and your detailed response. With your suggestion, I built the OpenMP version of QE 7.4 and it seems that the problem is resolved with it. I tested it on ORNL-Baseline using the same scripts and modules used in the files I shared.

I noticed that in the SCF calculations, using the non-OpenMPI version, when DFT+U+V calculations are ran, the wavefunctions and the charge densities are written with the .dat format, whereas for the DFT+U calculations, they are written in the .hdf5 format. pw2qmcpack requires the HDF5 library enabled in QE, so I don't understand why the DFT+U+V wavefunctions are written in .dat format with the non-OpenMPI build.

If pw2qmcpack terminates because of the missing h5 files, it would be better if it could report it to the user rather than giving segmentation error.

>ls u*/pwscf_output/pwscf.save/charge-density*
u-2.0-v-0.0/pwscf_output/pwscf.save/charge-density.hdf5  u-4.0-v-0.0/pwscf_output/pwscf.save/charge-density.hdf5  u-6.0-v-0.0/pwscf_output/pwscf.save/charge-density.hdf5  u-8.0-v-2.0/pwscf_output/pwscf.save/charge-density.dat
u-2.0-v-2.0/pwscf_output/pwscf.save/charge-density.dat   u-4.0-v-2.0/pwscf_output/pwscf.save/charge-density.dat   u-6.0-v-2.0/pwscf_output/pwscf.save/charge-density.dat   u-8.0-v-4.0/pwscf_output/pwscf.save/charge-density.dat
u-2.0-v-4.0/pwscf_output/pwscf.save/charge-density.dat   u-4.0-v-4.0/pwscf_output/pwscf.save/charge-density.dat   u-8.0-v-0.0/pwscf_output/pwscf.save/charge-density.hdf5

@prckent
Copy link
Contributor

prckent commented Dec 4, 2024

Good news, but strange!
If pw2qmcpack crashes when the charge-density.hdf5 is not there, it would be good to trap this error. To fully confirm your theory, can you try moving the charge-density.hdf5 file or set it to unreadable, and see what happens?

@kayahans
Copy link
Contributor Author

kayahans commented Dec 4, 2024

I moved all the wfc*hdf5 files and charge-density.hdf5 under pwscf_output/pwscf.save to a separate directory and reran using OpenMPI build pw2qmcpack.x. I got the same message as the failed examples I initially reported:

Message from routine gamma_only text too long will be truncated : and the segmentation error.

@prckent
Copy link
Contributor

prckent commented Dec 4, 2024

On my system, hiding either the charge-density.hdf5 or some of the wfc files results in an appropriate error.

I think we can close this issue and re-evaluate if anyone else runs into troubles.

@prckent prckent closed this as completed Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants