Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ExperimentData handling of jobs and analysis #599

Merged
merged 7 commits into from
Jan 20, 2022

Conversation

chriseclectic
Copy link
Collaborator

@chriseclectic chriseclectic commented Jan 7, 2022

Summary

Depends on #596
Fixes #573, #592

This reworks how analysis callbacks are run to use separate Futures for each callback and improves handling of futures for jobs and callbacks in ExperimentData.

Details and comments

This also adds some convenience methods to ExperimentData:

  • jobs: returns a list of all jobs added to the experiment. This can be used to monitor specific jobs for example.
  • job_status returns the status of all job execution. This is similar to existing status, but will return DONE if all jobs are finished regardless of current callback status.
  • callback_status returns the status of callback execution. This will be QUEUED if callbacks are still waiting for jobs to finish running, RUNNING if at least one callback is still running, CANCELLED or ERROR if any callbacks were cancelled or had an error, or DONE if all callbacks have finished successfully.
  • cancel_callbacks cancels any queued callback futures that have not started running yet. Since callbacks run on a single-worker thread pool executor only the currently running callback future cannot be cancelled.
  • cancel cancel all jobs and callbacks.
  • callback_errors returns string of any errors encountered during callback execution

Other methods are updated:

  • status: Returns "EMPTY" instead of "DONE" if no jobs, data, or callbacks are contained in the experiment data.
  • errors has improved reporting for job and callback errors encountered.

@chriseclectic chriseclectic added the Changelog: New Feature Include in the "Added" section of the changelog label Jan 7, 2022
@chriseclectic chriseclectic changed the title Improve ExperimentData handling of job futures Improve ExperimentData handling of callbacks and futures Jan 10, 2022
@chriseclectic chriseclectic added the Changelog: Bugfix Include in the "Fixed" section of the changelog label Jan 10, 2022
Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Halfway through the review. Feel free to start update. I'll review rest of changes tomorrow.

Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chris this PR looks awesome. The logic looks really clean and easy to read. Do you think you can write unittest for #573 before closing it (currently the test for status is too obvious since it just checks status that is manually overridden)? I think this test is bit tough since the event is hard to reproduce.

qiskit_experiments/database_service/db_experiment_data.py Outdated Show resolved Hide resolved
@@ -779,13 +833,6 @@ def _save_experiment_metadata(self) -> None:
LOG.warning("Experiment cannot be saved because backend is missing.")
return

with self._job_futures.lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is removed because now _wait_for_analysis method will wait for job?

@@ -862,8 +909,8 @@ def save(self) -> None:

if self.verbose:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also analysis status should be done here (i.e. no error).


If the experiment consists of multiple jobs, the returned status is mapped
in the following order:
with self._analysis_futures.lock and self._job_futures.lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to shutdown executor when not_done_* is empty?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The executor doesn't need to be shutdown, it can sit around forever waiting to make more futures if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we expect the assigned resources in the analysis executor of each experiment data instance will be eliminated by garbage collection (we can come back to this later when we start to get memory issue as the size of parallel experiment scales)?

qiskit_experiments/database_service/db_experiment_data.py Outdated Show resolved Hide resolved
@chriseclectic chriseclectic changed the title Improve ExperimentData handling of callbacks and futures Improve ExperimentData handling of jobs and analysis Jan 14, 2022
Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments for new commits. Previous updates are really nice.

)
# Add future for cancelling jobs that timeout
if timeout_ids:
self._job_executor.submit(self._timeout_running_jobs, timeout_ids, timeout)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like this logic 💯


Args:
jobs: The Job or list of Jobs to add result data from.
timeout: Optional, time to wait for jobs to finish before
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add time unit for the value.

if waited.not_done:
LOG.debug("Cancelling running jobs that exceeded add_jobs timeout.")
done_ids = {fut.result()[0] for fut in waited.done}
notdone_ids = [jid for jid in job_ids if jid not in done_ids]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just minor comment; why we cannot simply write self.cancel_jobs([fut.result[0] for fut in waited.not_done])? another comment; do you think this can be simplified with as_complete function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done futures won't have a result, the job id is only returned when the future finishes. This is why it can be extracted only from the done futures and the reason for the above work around to figure out the not-done job ids.

self._add_job_data(job)
else:
# Add job results asynchronously
self._add_job_future(job)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to retrieve with timeout and call add_job here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to calling add_jobs would raise a bunch of warnings about existing job ids at the moment. It could be good to think about changing later so that backend+service can be extracted from job for expdata saved/loaded to JSON, since the backend/provider cant be serialized/deserialized, but that might be better to treat in the serialization PR.


If the experiment consists of multiple jobs, the returned status is mapped
in the following order:
with self._analysis_futures.lock and self._job_futures.lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we expect the assigned resources in the analysis executor of each experiment data instance will be eliminated by garbage collection (we can come back to this later when we start to get memory issue as the size of parallel experiment scales)?

pending analysis callbacks. Note that analysis callbacks that have already
started running cannot be cancelled.
- |
Adds :meth:`.ExperimentData.cancel` to cancel both jobs and analysis.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need description for timeout arg in run method and and add_job method.

The ``timeout`` kwarg of :meth:`.ExperimentData.add_data` has been deprecated.
To timeout waiting for job results use the ``timeout`` kwarg with
:meth:`.ExperimentData.block_for_results` or
:meth:`.ExperimentData.analysis_results` instead.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need deprecation of adding job to add_data method.

* Add `add_jobs` method to ExperimentData
* Add timeout functionality to add_jobs and Experiment.run that allows an experiment to set a timeout before cancelling all non-finished jobs
* Adds status enum as ExperimentData class variable for easier access for value comparisons
* This PR improves ExperimentData handling of qiskit job data and analysis callbacks to make querying job and analysis statuses and cancelling jobs or analysis processes more robust.
* Changes `ExperimentData.status` to return `ExperimentStatus enum value
* Adds `ExperimentData.job_status` method  and `JobStatus` enum return type
* Adds` ExperimentData.analysis_status` method and `AnalysisStatus enum return type.
* Adds `ExperimentData.cancel_jobs`, `ExperimentData.cancel_analysis` and `ExperimentData.cancel` methods for cancelling running jobs, queued analysis, or both.
* Adds `ExperimentData.add_jobs` method for adding job data. Deprecates adding job data via `add_data` method.
@chriseclectic chriseclectic added Changelog: API Change Include in the "Changed" section of the changelog Changelog: Deprecation Include in "Deprecated" section of changelog and removed Changelog: Bugfix Include in the "Fixed" section of the changelog labels Jan 19, 2022
Copy link
Collaborator

@nkanazawa1989 nkanazawa1989 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, This PR allows us to handle callback and status more safely. Thanks Chris!

@chriseclectic chriseclectic merged commit f229944 into qiskit-community:main Jan 20, 2022
@chriseclectic chriseclectic deleted the job-futures branch March 3, 2022 22:44
paco-ri pushed a commit to paco-ri/qiskit-experiments that referenced this pull request Jul 11, 2022
…y#599)

This PR improves ExperimentData handling of qiskit job data and analysis callbacks to make querying job and analysis statuses and cancelling jobs or analysis processes more robust.

It also includes some API changes to ExperimentData:
* Changes `ExperimentData.status` to return `ExperimentStatus enum value
* Adds `ExperimentData.job_status` method  and `JobStatus` enum return type
* Adds` ExperimentData.analysis_status` method and `AnalysisStatus enum return type.
* Adds `ExperimentData.cancel_jobs`, `ExperimentData.cancel_analysis` and `ExperimentData.cancel` methods for cancelling running jobs, queued analysis, or both.
* Adds `ExperimentData.add_jobs` method for adding job data. Deprecates adding job data via `add_data` method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Changelog: API Change Include in the "Changed" section of the changelog Changelog: Deprecation Include in "Deprecated" section of changelog Changelog: New Feature Include in the "Added" section of the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ThreadSafe object doesn't work as expected
3 participants