Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platypus and MPIEvaluator Issue #370

Open
pollockDeVis opened this issue Sep 11, 2024 · 4 comments
Open

Platypus and MPIEvaluator Issue #370

pollockDeVis opened this issue Sep 11, 2024 · 4 comments

Comments

@pollockDeVis
Copy link

@quaquel : Encountered the following issue while running optimization on 60 cores on the HPC. Crashed after the progress of the optimization was 87%. A ValueError in Platypus core triggered AttributeError: 'MPIEvaluator' object has no attribute 'logwatcher_thread'. EMA Workbench version: 2.4.1

 87%|████████████████████████▏   | 129901/150000 [83:50:59<11:59:20,  2.15s/it]
 87%|████████████████████████▎   | 130393/150000 [84:08:48<11:44:06,  2.15s/it]Traceback (most recent call last):
  File "/scratch/palokbiswas/Repo/JUSTICE/analysis/analyzer.py", line 213, in run_optimization_adaptive
    results = evaluator.optimize(
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/evaluators.py", line 228, in optimize
    return optimize(
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/evaluators.py", line 576, in optimize
    return _optimize(
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/optimization.py", line 1101, in _optimize
    optimizer.run(nfe)
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 410, in run
    self.step()
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/algorithms.py", line 1521, in step
    self.algorithm.step()
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/algorithms.py", line 182, in step
    self.iterate()
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/algorithms.py", line 212, in iterate
    self.archive.extend(self.population)
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 805, in extend
    self.append(solution)
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 801, in append
    self.add(solution)
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 979, in add
    flags = [self._dominance.compare(solution, s) for s in self._contents]
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 979, in <listcomp>
    flags = [self._dominance.compare(solution, s) for s in self._contents]
  File "/home/palokbiswas/.local/lib/python3.9/site-packages/platypus/core.py", line 713, in compare
    i1 = math.floor(o1 / epsilon)
ValueError: cannot convert float NaN to integer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/scratch/palokbiswas/Repo/JUSTICE/hpc_run.py", line 15, in <module>
    run_optimization_adaptive(n_rbfs=4, n_inputs=2, nfe=nfe, swf=swf, seed=seed)
  File "/scratch/palokbiswas/Repo/JUSTICE/analysis/analyzer.py", line 213, in run_optimization_adaptive
    results = evaluator.optimize(
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/evaluators.py", line 109, in __exit__
    self.finalize()
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/util/ema_logging.py", line 153, in wrapper
    res = func(*args, **kwargs)
  File "/scratch/palokbiswas/Repo/JUSTICE/src/ema-workbench/ema_workbench/em_framework/futures_mpi.py", line 213, in finalize
    self.logwatcher_thread.join(timeout=60)
AttributeError: 'MPIEvaluator' object has no attribute 'logwatcher_thread'

 87%|████████████████████████▎   | 130393/150000 [84:26:28<12:41:50,  2.33s/it]
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[29705,1],0]
  Exit code:    1
--------------------------------------------------------------------------
@EwoutH
Copy link
Collaborator

EwoutH commented Sep 11, 2024

Thanks for reporting this potential issue. Could you:

  1. Update to the latest release of the EMAworkbench, 2.5.2
  2. If the issue persists, create a Minimal, Reproducible Example

Edit: A run crashing after 84 hours of running on a HPC, I feel you. Could you salvage any results?

@quaquel maybe we should add checkpoint functionality, that saves results ever nth iteration or ever n % of runs.

@quaquel
Copy link
Owner

quaquel commented Sep 11, 2024

checkpointing and restarts are indeed urgently needed

@pollockDeVis
Copy link
Author

The error is random, and the probability of its occurring increases when --ntasks in HPC ask for more than 50 cores. I have run a few more jobs on the same experiment after this, and it worked. At 60 cores, you can start seeing this error more often.

@quaquel
Copy link
Owner

quaquel commented Sep 17, 2024

It's strange. The error seems to occur within platypus. So, it should not be related to the number of cores or the nature of parallelization. At most, it might relate to the total number of function evaluations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants