Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the starting time of the goal expiration timeout #1121

Conversation

Barry-Xu-2018
Copy link
Contributor

Address one issue mentioned #1103

The expire timeout is being calculated from the time the goal was accepted. According to the design https://design.ros2.org/articles/actions.html#result-caching, it should be calculated from the time the goal is completed.

@Barry-Xu-2018
Copy link
Contributor Author

This patch doesn't introduce new interfaces. As a result, the current modification is not very clear for setting goal terminal timestamp.
To achieve a clearer implementation, I am considering adding a new interface to set the goal terminal timestamp for a specific goal handle. However, this would require modifications in both rclcpp and rclpy.
Would you please share your opinions?

Copy link

@SteveMacenski SteveMacenski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is goal_handle->impl->goal_terminal_timestamp set initially?

If my understanding of your code is correct (I don't live in rcl) when calling _recalculate_expire_timer, which I assume is periodically called during an action server's normal operations, we reset to now the clock ticking down when its expired.

I don't understand the goal_terminal_timestamp == INT64_MAX condition then --> shouldn't we always be resetting the clock if the action is still active? That would make the most sense to me; while the goal is still actively being processed, we stave off the timeout to remove it by another full timeout length. Then when the last update / sending of the result comes so we don't call it anymore, the full countdown is considered before removal.

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have a few comments to confirm.

rcl_action/include/rcl_action/goal_handle.h Outdated Show resolved Hide resolved
rcl_action/include/rcl_action/goal_handle.h Outdated Show resolved Hide resolved
rcl_action/src/rcl_action/goal_handle.c Outdated Show resolved Hide resolved
@Barry-Xu-2018
Copy link
Contributor Author

@SteveMacenski

How is goal_handle->impl->goal_terminal_timestamp set initially?

If my understanding of your code is correct (I don't live in rcl) when calling _recalculate_expire_timer, which I assume is periodically called during an action server's normal operations, we reset to now the clock ticking down when its expired.

I don't understand the goal_terminal_timestamp == INT64_MAX condition then --> shouldn't we always be resetting the clock if the action is still active? That would make the most sense to me; while the goal is still actively being processed, we stave off the timeout to remove it by another full timeout length. Then when the last update / sending of the result comes so we don't call it anymore, the full countdown is considered before removal.

First, let me explain when _recalculate_expire_timer() is called.
Currently, there are two scenarios in which this function is called.

  1. While goal state is changed to terminal state (success or cancel or abort), rcl_action_notify_goal_done() will be called. And _recalculate_expire_timer() will be called in rcl_action_notify_goal_done() for setting a timer for the next expiring goal.

  2. While timer for goal expire is triggered, rcl_action_expire_goals() will be called. And _recalculate_expire_timer() will be called for setting a timer for the next expiring goal.

I use scenario 1 to set goal_terminal_timestamp.
In order to avoid scenario 2, the condition to set goal_terminal_timestamp include

  • The state of goal should be one of success or cancel and abort.

    So the code to set goal_terminal_timestamp should be in this condition. This can avoid wrong setting goal_terminal_timestamp in scenario 2.

if (!rcl_action_goal_handle_is_active(goal_handle)) {

  • goal_terminal_timestamp isn't set before.

    I want to use INT64_MAX as invalid value (unset state).
    So, only if goal_terminal_timestamp is INT64_MAX, goal_terminal_timestamp can be updated to current time.
    But invalid value is unsuitable (From Fujita-san' review comment). I will change it to 0.

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, i will try CI.

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@SteveMacenski
Copy link

SteveMacenski commented Dec 7, 2023

setting a timer for the next expiring goal.

In this case, why is there any timer still set for expiring the goals? That seems like the root cause of the problem; our goals are being expired before they're done due to a terminal condition if they're long-running.

In this design, that seems to me should be completely removed and this check in terminal states (success or cancel or abort) is all that should be required. This is where the timer to count down the expiration of the goal handle should be initiated.

Alternatively, the timer for expiring the next goal should be adjusted during action server operations from checking to resetting the clock to the full timeout. That way the timeout would only trigger when its become stale. I think the first way is better though. That removes alot of unnecessary complication that would be removed with starting the countdown where you've made changes alone. God knows simplifying actions would reduce the likelihood of future issues

@fujitatomoya
Copy link
Collaborator

@Barry-Xu-2018 can you rebase your development branch? it is failing build https://ci.ros2.org/job/ci_linux/19950/console since it does not include some changes in the mainline. Barry-Xu-2018/rcl@review/topic-use-goal-terminal-timestamp-to-expire-check...ros2:rcl:rolling

@Barry-Xu-2018 Barry-Xu-2018 force-pushed the review/topic-use-goal-terminal-timestamp-to-expire-check branch from 62dd16f to 0a87c6d Compare December 8, 2023 01:15
@Barry-Xu-2018
Copy link
Contributor Author

can you rebase your development branch? it is failing build https://ci.ros2.org/job/ci_linux/19950/console since it does not include some changes in the mainline. Barry-Xu-2018/rcl@review/topic-use-goal-terminal-timestamp-to-expire-check...ros2:rcl:rolling

Rebase was done. @fujitatomoya

@Barry-Xu-2018
Copy link
Contributor Author

@SteveMacenski

setting a timer for the next expiring goal.

In this case, why is there any timer still set for expiring the goals? That seems like the root cause of the problem; our goals are being expired before they're done due to a terminal condition if they're long-running.

Sorry, I'm not quite clear about the scenarios you described. Could you please describe them again?
The setting of the expire timer begins when a goal enters the terminal state. It is not related to how long the goal has been running.

#1121 (comment)
In the two scenarios described in my comment, it may be necessary to reset the expire timer:

  • When the state of a goal enters the terminal state, if the expire timer has not started yet (e.g., for the first terminal goal, or if the previous expire timer was triggered earlier but there was no subsequent expire time, so the timer was canceled), then the expire timer should be started. If the expire timer already started, nothing need to be done.

  • When the expire timer is triggered, at this point, we should calculate the nearest upcoming expire time and set the expire timer accordingly.

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@Barry-Xu-2018
Copy link
Contributor Author

There is a failure in windows test.
projectroot.src.core.ddsc.tests.CUnit_ddsc_whc_check_end_state
After checking the log, it seems that the failure is unrelated to this PR.

@fujitatomoya
Copy link
Collaborator

@sloretz @iuhilnehc-ynos either of you, can you take a look at this? i would like to have another review on this.

@SteveMacenski
Copy link

Your two situations this method is called:

While goal state is changed to terminal state (success or cancel or abort), rcl_action_notify_goal_done() will be called. And _recalculate_expire_timer() will be called in rcl_action_notify_goal_done() for setting a timer for the next expiring goal.

I'll shorthand this as achieving a terminal state

While timer for goal expire is triggered, rcl_action_expire_goals() will be called. And _recalculate_expire_timer() will be called for setting a timer for the next expiring goal.

I'll shorthand the timer for goal expiring as "the ticking clock"


My assertion is that you found a clever place to make modifications to the goal handle's state: at achieving the terminal state. We know that once we achieve this state, the goal is done processing and anytime before this point, the goal is still processing (or was dropped on the floor from a users' crappy application, which no matter what is going to cause problems).

If we have a place (e.g. here) we can modify a ticking clock, that calls into question the need for the ticking clock prior to this point at all. The issue we want to solve is that ticking clock expiring still-processing goals -- thus why even have the ticking clock have the ability to expire still-processing goals? I'd argue that it shouldn't even be possible. We should be waiting until the terminal state to trigger the clock to start its countdown until the handle / result are deleted. Of which, the default of 10s or what-have-you is totally appropriate and fully solves the problem.

It also comes with the added benefit of simplifying the action logic since there (1) doesn't need to be a ticking clock checked while the goal is processing and (2) the modification you require don't need to reset it, it merely needs to start it

@Barry-Xu-2018
Copy link
Contributor Author

Barry-Xu-2018 commented Dec 11, 2023

@SteveMacenski

Thank you for the further explanation.

If we have a place (e.g. here) we can modify a ticking clock, that calls into question the need for the ticking clock prior to this point at all. The issue we want to solve is that ticking clock expiring still-processing goals -- thus why even have the ticking clock have the ability to expire still-processing goals? I'd argue that it shouldn't even be possible. We should be waiting until the terminal state to trigger the clock to start its countdown until the handle / result are deleted. Of which, the default of 10s or what-have-you is totally appropriate and fully solves the problem.

It also comes with the added benefit of simplifying the action logic since there (1) doesn't need to be a ticking clock checked while the goal is processing and (2) the modification you require don't need to reset it, it merely needs to start it

Your current question revolves around "the ticking clock", but I haven't fully caught your idea yet.
To clarify, I use an example for the following discussion.

image

For "the ticking clock", while the timer is triggered (At 2 in above diagram), the next expire time need to be calculated.

It also comes with the added benefit of simplifying the action logic since there (1) doesn't need to be a ticking clock checked while the goal is processing and (2) the modification you require don't need to reset it, it merely needs to start it

There are currently two time points where "the ticking clock" is checked. One is when a goal enters the termination state, and the other is when the ticking clock timer is triggered. When you mentioned "the goal is processing," in what situation does that occur?

If expire timer has been run, a goal enters terminal state to call _recalculate_expire_timer(). At this case, calculating and setting the timer is an unnecessary operation.
While the expire timer is triggered (At 2 in above diagram), I think calculating and setting the timer is a necessary operation.

@SteveMacenski
Copy link

SteveMacenski commented Dec 11, 2023

When you mentioned "the goal is processing," in what situation does that occur?

That's a question you're most suited to answer than me. With the current behavior before this PR, where was the goal being expired resulting in the bug that this is meant to address (which was expiring still-processing goals causing issues)? That was happening which triggered this discussion. My suggestion is removing entirely where-ever that is happening if we're able to start the count down to removing a goal at a terminal state. You've shown the part of the code where that terminal state is found to be able to reset a timer. Instead, we could just start it here and remove the previous problematic countdown that previously caused us issues.

I ack that perhaps I'm missing something and that my suggestion isn't fully informed with the ins-and-outs of rcl. What I'm trying to suggest at its core is this, regardless of the specific details:

  • We previously had a problem where goals were expiring mid-execution and we weren't getting the results back because they were no longer in existence. That required us in Nav2 to take somewhat drastic measures to expose this not-well-documented timeout to users as a temporary work around.
  • You've identified a part of the code where we know the goal has just reached its terminal state
  • Thus: my suggestion is to completely rip out the existing logic that caused the bug in the first place with a timeout existing at all during normal goal processing -- and replace it with logic in the goal-terminal-state-reached section of the code to start the timeout clock before expiring the goal's information.

To my eye, this PR resets some timeouts, but still leaves the original timeout in place that caused the problem in the past - which could still cause a problem moving forward as well if you have a timeout of 10s but an action length of 10 minutes (which, we have).

@Barry-Xu-2018
Copy link
Contributor Author

Barry-Xu-2018 commented Dec 12, 2023

@SteveMacenski

Thank you for detailed explanation.
I understand the concerns you mentioned.
Before this patch, when a goal handle is created, this time point is already recorded as the starting time for expire calculation. Of course, the expire timer is not set at this time. When the goal enters the terminal state, this time point is used to calculate and set the time on the expire timer. In this scenario, if the time it takes to process the goal is longer than the set timeout, when the goal enters the terminal state, the result will not be retained for the timeout duration; instead, it will be immediately removed.

This PR ensures that the current time is recorded as the start time for expiration only when the goal enters the terminal state. The expire timer is set based on this time. Therefore, with the current modifications, it is impossible for the goal to expire while being processed. In rcl_action_expire_goals(), it will get time of goal termination by calling rcl_action_goal_handle_get_goal_terminal_timestamp() to check if expiration occurs.

@SteveMacenski
Copy link

SteveMacenski commented Dec 12, 2023

OK, got it. That makes sense, you can ignore my comments then :-) This is a very important fix for us in Nav2 so I wanted to make sure beyond the shadow of my ignorance that this was going to fully resolve the problem 😉 I very much appreciate your time here!

@fujitatomoya
Copy link
Collaborator

@Barry-Xu-2018 since you added the fix from last CI, i started the full CI.

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@SteveMacenski thank you for taking your time for the review!

@fujitatomoya
Copy link
Collaborator

  • Linux-aarch64 Build Status

@SteveMacenski
Copy link

What's the good word?

@fujitatomoya
Copy link
Collaborator

we would like to have 2nd review on this.

CC: @wjwwood @clalancette @alsora @mjcarroll

Signed-off-by: Barry Xu <barry.xu@sony.com>
rcl_action/src/rcl_action/action_server.c Show resolved Hide resolved
rcl_action/src/rcl_action/action_server.c Show resolved Hide resolved
rcl_action/src/rcl_action/action_server.c Show resolved Hide resolved
rcl_action/src/rcl_action/action_server.c Outdated Show resolved Hide resolved
rcl_action/src/rcl_action/action_server.c Outdated Show resolved Hide resolved
Signed-off-by: Barry Xu <barry.xu@sony.com>
Signed-off-by: Barry Xu <barry.xu@sony.com>
@Barry-Xu-2018
Copy link
Contributor Author

@jmachowinski Thanks a lot for your review. I've updated the code based on your comments.

@fujitatomoya
Copy link
Collaborator

@jmachowinski @alsora thanks for your reviews, i will start the CI.

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@Barry-Xu-2018
Copy link
Contributor Author

I will check failed test cases.

@Barry-Xu-2018
Copy link
Contributor Author

Commit a07e119 is wrong.
Removed code is useful. It filled goal info for expired_goals.
I will do more testing before running the CI.

This reverts commit a07e119.

Signed-off-by: Barry Xu <barry.xu@sony.com>
@Barry-Xu-2018
Copy link
Contributor Author

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@alsora
Copy link
Contributor

alsora commented Jul 6, 2024

@Barry-Xu-2018 please consider adding some comments around that code that we thought was not needed but broke the tests when removed, so that in the future we will immediately know why we need it.
Besides that, the PR looks good to me.

@Barry-Xu-2018
Copy link
Contributor Author

@alsora I added comments and optimized the code.
Please take a look 03555f5.

rcl/include/rcl/types.h Outdated Show resolved Hide resolved
rcl_action/src/rcl_action/goal_handle.c Outdated Show resolved Hide resolved
rcl_action/src/rcl_action/goal_handle.c Outdated Show resolved Hide resolved
Signed-off-by: Barry Xu <barry.xu@sony.com>
@Barry-Xu-2018 Barry-Xu-2018 force-pushed the review/topic-use-goal-terminal-timestamp-to-expire-check branch from 3ac0079 to 959f39b Compare July 10, 2024 07:07
@fujitatomoya
Copy link
Collaborator

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@Barry-Xu-2018
Copy link
Contributor Author

The all unstable cases are related to this commit ros2/message_filters@5551518
It introduces warning macro in header files. But test cases all pass.

For Windows, it is related to this PR.
This change lead to the below error. It is failed to build point_cloud_transport on Windows platform.

07:34:22 C:/ci/ws/install/include/message_filters\message_filters/simple_filter.h(18,1): fatal error C1021: invalid preprocessor command 'warning' (compiling source file C:\ci\ws\src\ros-perception\point_cloud_transport\point_cloud_transport\src\subscriber_filter.cpp) [C:\ci\ws\build\point_cloud_transport\point_cloud_transport.vcxproj]

@Barry-Xu-2018
Copy link
Contributor Author

Re-CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@Barry-Xu-2018
Copy link
Contributor Author

  • Windows Build Status

@Barry-Xu-2018
Copy link
Contributor Author

All CI tests passed.

@fujitatomoya fujitatomoya merged commit 30a047a into ros2:rolling Jul 29, 2024
3 checks passed
@SteveMacenski
Copy link

Will this be backported to Jazzy/Humble?

@fujitatomoya
Copy link
Collaborator

I believe adding new function in C does not break ABI compatibility, so i say we can backport this to jazzy, iron and humble.

@clalancette @alsora what do you think?

@SteveMacenski
Copy link

SteveMacenski commented Aug 1, 2024

This is really great news, I know there's alot of interest in Nav2 users for the backport to remove that timeout! From my review as well, it looks backportable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants