Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[State API] Improve task api #31847

Merged
merged 7 commits into from
Jan 24, 2023
Merged

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Jan 23, 2023

Why are these changes needed?

  • Add worker id & pg id to the task state
  • Add pg id to the actor state
  • Add start / end time for worker state
  • Add start / end time for node state

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: SangBin Cho <rkooo567@gmail.com>
@rkooo567 rkooo567 changed the title [WIP] Improve task api [State API] Improve task api Jan 23, 2023
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Only question is on the gcs worker manager refactoring.

for (auto &listener : worker_dead_listeners_) {
listener(worker_failure_data);
}
GetWorkerInfo(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes mainly on the refactoring of the GetWorkerInfo funcion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. But also there's a new change that we get WorkerData before we update the failureData. Previously, we overwrote the whole data. Now we merge them.

@@ -416,6 +418,12 @@ class NodeState(StateSchema):
node_name: str = state_column(filterable=True)
#: The total resources of the node.
resources_total: dict = state_column(filterable=False)
#: The time when the node (raylet) starts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we print out to users on CLI, do you think if we should convert all the timestamps to human readable values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should. I think it is same for task events.

I think we need a way to distinguish API return value / display values. Maybe you can consider adding a task to fix this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I guess we could do the formatting upon printing? If I use list_tasks programmatically, I might still have the value as ms, but I think we should make it readable for display.
Tracked here: #31876

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. I think we can either process that from the CLI, or we can have something like view_callback for each field (which is invoked when it is printed)

@rickyyx rickyyx added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 23, 2023
@rkooo567
Copy link
Contributor Author

rkooo567 commented Jan 23, 2023

cc @iycheng for the worker manager

Previously, we overwrote the whole WorkerTableData with the failure data which has some data loss. I changed it to merge from the original worker data instead when the failure is reported

Signed-off-by: SangBin Cho <rkooo567@gmail.com>
Signed-off-by: SangBin Cho <rkooo567@gmail.com>
@rkooo567
Copy link
Contributor Author

also @scv119 for proto changes

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rkooo567 rkooo567 merged commit c464207 into ray-project:master Jan 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants