Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists. #2698

Closed
KumoLiu opened this issue Jul 14, 2024 · 4 comments · Fixed by Project-MONAI/MONAI#7916
Closed
Assignees
Labels
bug Something isn't working

Comments

@KumoLiu
Copy link
Contributor

KumoLiu commented Jul 14, 2024

2024-07-14 10:02:17,649 - INFO - Load site-1 weights...
2024-07-14 10:02:17,652 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,654 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,661 - Communicator - INFO - Received from secure_project server. getTask: train size: 19.3MB (19280090 Bytes) time: 0.301346 seconds
2024-07-14 10:02:17,661 - FederatedClient - INFO - pull_task completed. Task name:train Status:True 
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781]: got task assignment: name=train, id=00b8bb4c-1fbd-421d-81b3-19472481fd48
2024-07-14 10:02:17,661 - ClientRunner - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: invoking task executor ClientAlgoExecutor
2024-07-14 10:02:17,662 - INFO - Start site-1 evaluating...
2024-07-14 10:02:17,662 - ClientAlgoExecutor - INFO - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: Client trainer got task: train
2024-07-14 10:02:17,662 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,662 - INFO - Load site-2 weights...
2024-07-14 10:02:17,664 - INFO - Converted 148 global variables to match 148 local variables.
2024-07-14 10:02:17,665 - INFO - 'dst' model updated: 148 of 148 variables.
2024-07-14 10:02:17,672 - INFO - Start site-2 evaluating...
2024-07-14 10:02:17,672 - ignite.engine.engine.SupervisedEvaluator - INFO - Engine run resuming from iteration 0, epoch 0 until 1 epochs
2024-07-14 10:02:17,743 - ignite.engine.engine.SupervisedEvaluator - ERROR - Engine run is terminating due to exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,744 - ERROR - Exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
    self._fire_event(Events.STARTED)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
    self._set_experiment()
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
    experiment_id = self.client.create_experiment(self.experiment_name)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
    return self._tracking_client.create_experiment(name, artifact_location, tags)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
    return self.store.create_experiment(
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
    response_proto = self._call_endpoint(CreateExperiment, req_body)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - [identity=site-2, run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, peer=secure_project, peer_run=d6a9f1a8-b7cd-46af-8c9b-f97e4e9d3781, task_name=train, task_id=00b8bb4c-1fbd-421d-81b3-19472481fd48]: client_algo execute exception: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.
2024-07-14 10:02:17,745 - ClientAlgoExecutor - ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 114, in execute
    return self.train(shareable, fl_ctx, abort_signal)
  File "/usr/local/lib/python3.10/dist-packages/monai_nvflare/client_algo_executor.py", line 132, in train
    test_report = self.client_algo.evaluate(exchangeobj_from_shareable(shareable))
  File "/usr/local/lib/python3.10/dist-packages/monai/fl/client/monai_algo.py", line 664, in evaluate
    self.evaluator.run(self.trainer.state.epoch + 1)
  File "/usr/local/lib/python3.10/dist-packages/monai/engines/evaluator.py", line 150, in run
    super().run()
  File "/usr/local/lib/python3.10/dist-packages/monai/engines/workflow.py", line 283, in run
    super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 892, in run
    return self._internal_run()
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 935, in _internal_run
    return next(self._internal_run_generator)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 993, in _internal_run_as_gen
    self._handle_exception(e)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 636, in _handle_exception
    self._fire_event(Events.EXCEPTION_RAISED, e)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/stats_handler.py", line 202, in exception_raised
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 946, in _internal_run_as_gen
    self._fire_event(Events.STARTED)
  File "/usr/local/lib/python3.10/dist-packages/ignite/engine/engine.py", line 425, in _fire_event
    func(*first, *(event_args + others), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 209, in start
    self._set_experiment()
  File "/usr/local/lib/python3.10/dist-packages/monai/handlers/mlflow_handler.py", line 241, in _set_experiment
    experiment_id = self.client.create_experiment(self.experiment_name)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/client.py", line 599, in create_experiment
    return self._tracking_client.create_experiment(name, artifact_location, tags)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/tracking/_tracking_service/client.py", line 251, in create_experiment
    return self.store.create_experiment(
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 101, in create_experiment
    response_proto = self._call_endpoint(CreateExperiment, req_body)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/rest_store.py", line 60, in _call_endpoint
    return call_endpoint(self.get_host_creds(), endpoint, method, json_body, response_proto)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 290, in call_endpoint
    response = verify_rest_response(response, endpoint)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/rest_utils.py", line 173, in verify_rest_response
    raise RestException(json.loads(response.text))
mlflow.exceptions.RestException: RESOURCE_ALREADY_EXISTS: Experiment 'monai_nvflare' already exists.

Looks like there has some issue when running the monai real word example.
When site-2 start evaluating, the monai_nvflare experiment is already exist. Should handle such case.

@KumoLiu KumoLiu added the bug Something isn't working label Jul 14, 2024
@KumoLiu
Copy link
Contributor Author

KumoLiu commented Jul 14, 2024

cc @YuanTingHsieh @holgerroth

@KumoLiu
Copy link
Contributor Author

KumoLiu commented Jul 14, 2024

I attempted to add a try-except block to the set_experiment function in the MLflow handler. However, I'm uncertain if this achieves the desired behavior.

    def _set_experiment(self):
        experiment = self.experiment
        if not experiment:
            for attempt in range(3):
                try:
                    experiment = self.client.get_experiment_by_name(self.experiment_name)
                    if not experiment:
                        experiment_id = self.client.create_experiment(self.experiment_name)
                        experiment = self.client.get_experiment(experiment_id)
                    break
                except MlflowException as e:
                    if "RESOURCE_ALREADY_EXISTS" in str(e):
                        time.sleep(self.retry_delay)
                        continue
                    else:
                        raise e

@YuanTingHsieh
Copy link
Collaborator

YuanTingHsieh commented Jul 16, 2024

@KumoLiu what about we add a line asking people to create this experiment first?

Like a one line code using MLFlow to create that experiment?

@KumoLiu
Copy link
Contributor Author

KumoLiu commented Jul 16, 2024

@KumoLiu what about we add a line asking people to create this experiment first?

Like a one line code using MLFlow to create that experiment?

Hi @YuanTingHsieh, thanks for the suggestion! The mlflowhander is included inside the bundle. And the issue here is that when two sites create the experiment at the same time, it will throw this error. One possible solution is that try-catch the error during creating the experiment. What do you think?

KumoLiu added a commit to Project-MONAI/MONAI that referenced this issue Jul 18, 2024
Try to fixes NVIDIA/NVFlare#2698.


### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [ ] Breaking change (fix or new feature that would cause existing
functionality to change).
- [ ] New tests added to cover the changes.
- [ ] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.
- [ ] Quick tests passed locally by running `./runtests.sh --quick
--unittests --disttests`.
- [ ] In-line docstrings updated.
- [ ] Documentation updated, tested `make html` command in the `docs/`
folder.

---------

Signed-off-by: YunLiu <55491388+KumoLiu@users.noreply.github.com>
Co-authored-by: Eric Kerfoot <17726042+ericspod@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants