Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outer_limits crashes if too many processes are used #94

Closed
vasdommes opened this issue Jul 14, 2023 · 1 comment · Fixed by #169
Closed

outer_limits crashes if too many processes are used #94

vasdommes opened this issue Jul 14, 2023 · 1 comment · Fixed by #169
Assignees
Labels
Milestone

Comments

@vasdommes
Copy link
Collaborator

Our test case for outer_limits (test/data/outer_limits/toy_functions.json) works fine for 4 processes, but fails for 6 or 8.

On Expanse cluster:

$ mpirun -n 6 build/outer_limits --functions test/data/outer_limits/toy_functions.json --out test/out/outer_limits/toy_functions_out.json --checkpointDir test/out/outer_limits/ck --points test/data/outer_limits/toy_functions_points.json --precision=128 --dualityGapThreshold=1e-10 --primalErrorThreshold=1e-10 --dualErrorThreshold=1e-10 --initialMatrixScalePrimal=1e1 --initialMatrixScaleDual=1e1 --maxIterations=1000 --verbosity=1
Outer_Limits started at 2023-Jul-13 22:10:53
functions file  : "test/data/outer_limits/toy_functions.json"
out directory   : "test/out/outer_limits/toy_functions_out.json"

Parameters:
dualityGapReduction          = 1024
meshThreshold                = 0.001
maxIterations                = 1000
maxRuntime                   = 9223372036854775807
checkpointInterval           = 3600
findPrimalFeasible           = false
findDualFeasible             = false
detectPrimalFeasibleJump     = false
detectDualFeasibleJump       = false
precision(actual)            = 128(128)
dualityGapThreshold          = 1e-10
primalErrorThreshold         = 1e-10
dualErrorThreshold           = 1e-10
initialMatrixScalePrimal     = 10
initialMatrixScaleDual       = 10
feasibleCenteringParameter   = 0.1
infeasibleCenteringParameter = 0.3
stepLengthReduction          = 0.7
maxComplementarity           = 1e+100
initialCheckpointDir         = "test/out/outer_limits/ck"
checkpointDir                = "test/out/outer_limits/ck"
writeSolution                = 
verbosity                    = 1

num_constraints: 4
Threshold: 1.1

          time    mu     P-obj       D-obj      gap         P-err       p-err       D-err      P-step   D-step   beta
---------------------------------------------------------------------------------------------------------------------
1            0 1.0e+02  +0.00       +0.00       0.00       +10.0       +1.00       +9.00       0.655    0.778    0.300
2            0 43.      +23.5       -13.2       1.00       +3.45       +0.345      +2.00       0.798    1.00     0.300
3            0 16.      +25.8       -21.4       1.00       +0.699      +0.0699     +6.44e-39   1.00     1.00     0.300
weight: [1.000000000000000000000000000000000000000, 17.92002209012499307550986390081702891602]
optimal: -17.92002209012499307550986390081702891602
terminate called after throwing an instance of 'std::out_of_range'
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
  what():  vector::_M_range_check: __n (which is 0) >= this->size() (which is 0)
[exp-4-16:12745] *** Process received signal ***
[exp-4-16:12746] *** Process received signal ***
[exp-4-16:12745] Signal: Aborted (6)
[exp-4-16:12745] Signal code:  (-6)
[exp-4-16:12746] Signal: Aborted (6)
[exp-4-16:12746] Signal code:  (-6)
[exp-4-16:12745] [ 0] [exp-4-16:12746] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x15554f25bcf0]
[exp-4-16:12745] [ 1] /lib64/libpthread.so.0(+0x12cf0)[0x15554f25bcf0]
[exp-4-16:12746] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x15554eed1aff]
[exp-4-16:12745] [ 2] /lib64/libc.so.6(gsignal+0x10f)[0x15554eed1aff]
[exp-4-16:12746] [ 2] /lib64/libc.so.6(abort+0x127)[0x15554eea4ea5]
[exp-4-16:12745] [ 3] /lib64/libc.so.6(abort+0x127)[0x15554eea4ea5]
[exp-4-16:12746] [ 3] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xa259c)[0x15554faa659c]
[exp-4-16:12745] [ 4] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xa259c)[0x15554faa659c]
[exp-4-16:12746] [ 4] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xad636)[0x15554fab1636]
[exp-4-16:12745] [ 5] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xad636)[0x15554fab1636]
[exp-4-16:12746] [ 5] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xad6a1)[0x15554fab16a1]
[exp-4-16:12745] [ 6] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xad6a1)[0x15554fab16a1]
[exp-4-16:12746] [ 6] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xad935)[0x15554fab1935]
[exp-4-16:12745] [ 7] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xad935)[0x15554fab1935]
[exp-4-16:12746] [ 7] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xa4e9b)[0x15554faa8e9b]
[exp-4-16:12745] [ 8] /cm/shared/apps/spack/cpu/opt/spack/linux-centos8-zen/gcc-8.3.1/gcc-10.2.0-n7su7jf54rc7l2ozegds5xksy6qhrjin/lib64/libstdc++.so.6(+0xa4e9b)[0x15554faa8e9b]
[exp-4-16:12746] [ 8] build/outer_limits[0x4574a2]
[exp-4-16:12746] [ 9] build/outer_limits[0x4574a2]
[exp-4-16:12745] [ 9] build/outer_limits[0x42ceb9]
build/outer_limits[0x42ceb9]
[exp-4-16:12746] [10] [exp-4-16:12745] [10] /lib64/libc.so.6(__libc_start_main+0xe5)[0x15554eebdd85]
[exp-4-16:12746] [11] build/outer_limits[0x44204e]
[exp-4-16:12746] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xe5)[0x15554eebdd85]
[exp-4-16:12745] [11] build/outer_limits[0x44204e]
[exp-4-16:12745] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 4 with PID 0 on node exp-4-16 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
@vasdommes vasdommes added the bug label Jul 14, 2023
@davidsd
Copy link
Owner

davidsd commented Jul 14, 2023

Just a note: outer_limits is experimental and has not been extensively tested or used yet in any research projects. We shouldn't spend too much time fixing it up until someone steps forward and wants to use it in an active research project.

@vasdommes vasdommes self-assigned this Jul 14, 2023
@vasdommes vasdommes added this to the Backlog milestone Nov 14, 2023
@vasdommes vasdommes removed the backlog label Nov 14, 2023
vasdommes added a commit that referenced this issue Jan 3, 2024
Fixes #94 outer_limits crashes if too many processes are used
(outer_limits code did not set procsPerNode correctly)

TODO: get rid of procsPerNode option
vasdommes added a commit that referenced this issue Jan 3, 2024
Running with 6 processes caused crash before.
bharathr98 pushed a commit to bharathr98/sdpb that referenced this issue Mar 1, 2024
Fixes davidsd#94 outer_limits crashes if too many processes are used
(outer_limits code did not set procsPerNode correctly)

TODO: get rid of procsPerNode option
bharathr98 pushed a commit to bharathr98/sdpb that referenced this issue Mar 1, 2024
Running with 6 processes caused crash before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants