Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Illegal instruction crash on x86_64 (probably due to AVX instructions) #235

Closed
vasdommes opened this issue Apr 26, 2024 · 1 comment · Fixed by #237
Closed

Illegal instruction crash on x86_64 (probably due to AVX instructions) #235

vasdommes opened this issue Apr 26, 2024 · 1 comment · Fixed by #237
Assignees
Milestone

Comments

@vasdommes
Copy link
Collaborator

Description

There were several cases when SDPB crashes right after printing parameters:

[r605u15n05:3757218:0:3757218] Caught signal 4 (Illegal instruction: illegal operand)
==== backtrace (tid:3757214) ====
 0 0x000000000004eb50 killpg()  ???:0
 1 0x0000000000168c34 n_sqrt()  /home/rse23/project/flint/src/ulong_extras/sqrt.c:20
 2 0x00000000004fa699 Fmpz_Comb::Fmpz_Comb()  /gpfs/gibbs/project/poland/rse23/sdpb/build/../src/sdp_solve/SDP_Solver/run/bigint_syrk/fmpz/Fmpz_Comb.cxx:30
 3 0x00000000004fa699 calculate_primes()  /gpfs/gibbs/project/poland/rse23/sdpb/build/../src/sdp_solve/SDP_Solver/run/bigint_syrk/fmpz/Fmpz_Comb.cxx:79
 4 0x00000000004fa699 Fmpz_Comb::Fmpz_Comb()  /gpfs/gibbs/project/poland/rse23/sdpb/build/../src/sdp_solve/SDP_Solver/run/bigint_syrk/fmpz/Fmpz_Comb.cxx:91
 5 0x00000000004947b8 Block_Info::read_block_costs()  /gpfs/gibbs/project/poland/rse23/sdpb/build/../src/sdp_solve/Block_Info/read_block_costs.cxx:99
 6 0x000000000048735d Block_Info::Block_Info()  /gpfs/gibbs/project/poland/rse23/sdpb/build/../src/sdp_solve/Block_Info/Block_Info.cxx:12
 7 0x000000000043e5ea main()  /gpfs/gibbs/project/poland/rse23/sdpb/build/../src/sdpb/main.cxx:81
 8 0x000000000003ad85 __libc_start_main()  ???:0
 9 0x000000000044fa9e _start()  ???:0

Two known cases:

  1. Yale cluster, pi-poland partition, SDPB compiled on cascadelake (with avx512) and run on broadwell (without avx512).
    image
  2. On one cluster, Singularity image built from docker://bootstrapcollaboration/sdpb:master fails, whereas docker://bootstrapcollaboration/sdpb:2.7.0 works fine. We don't have a stacktrace for this case.

Similar crash happened when running amd64 Docker image on arm64 CPU, see #222

Possible cause

The function that crashes is simply sqrt(double) called from FLINT's n_sqrt():
https://github.com/flintlib/flint/blob/213a4cff74d1bff9f0e07b63c57b6f3f8876e3a0/src/ulong_extras/sqrt.c#L20

On modern CPUs, sqrt is compiled to vsqrtsd instruction from AVX extensions.
On older CPUs, it is compiled to sqrtsd.

The problem arises when FLINT binary is compiled on a new CPU and used on an old one.

Fix

When building FLINT, run ./configure with --host option specifying target machine.

For FLINT Docker image (built from https://github.com/vasdommes/flint/tree/docker-main), we should set --host=amd64 option - this will ensure compatibility with all x86_64 CPUs by setting compiler flag -march=x86_64, (see configure.ac and GCC flags).

On clusters, one can choose more specific target host, e.g. --host=broadwell.

If a similar problem occurs in SDPB's own code, one may set compiler flag e.g. as CXXFLAGS="${CXXFLAGS} -march broadwell" ./waf configure <...>

@vasdommes vasdommes added this to the 3.0.0 milestone Apr 26, 2024
@vasdommes vasdommes self-assigned this Apr 26, 2024
@vasdommes
Copy link
Collaborator Author

vasdommes commented Apr 26, 2024

TODO:

  • Update FLINT image.
  • Add notes to documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant