KeyError in ActionServer._execute_goal #1236

rafael1193 · 2024-02-29T15:11:01Z

Bug report

Required Info:

Operating System:
- Ubuntu 22.04
Installation type:
- Binaries (.deb package)
Version or commit hash:
- 3.3.12-1jammy.20240217.065750
DDS implementation:
- Fast-RTPS (2.6.7-1jammy.20240217.034748)
Client library (if applicable):
- rclpy

Steps to reproduce issue

Run an action server node with the use_sim_time parameter as True
Have a node emit clock messages faster than wall clock moves (at x60)
Call the action

Expected behavior

Action goal result is set without problem.

Actual behavior

The action server fails because self._result_futures[bytes(goal_uuid)].set_result(result_response) in the ActionServer _execute_goal method raises a KeyError exception:

Traceback (most recent call last):
  File "/home/rbailonr/adauv_ws/install/adauv_bluebox/lib/adauv_bluebox/vehicle.py", line 333, in main
    executor.spin()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 294, in spin
    self.spin_once()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 794, in spin_once
    self._spin_once_impl(timeout_sec)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/executors.py", line 791, in _spin_once_impl
    future.result()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 94, in result
    raise self.exception()
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/task.py", line 239, in __call__
    self._handler.send(None)
  File "/opt/ros/humble/local/lib/python3.10/dist-packages/rclpy/action/server.py", line 352, in _execute_goal
    self._result_futures[bytes(goal_uuid)].set_result(result_response)
KeyError: b'EM$\xc6\xb3YA:\xa9\xa1_\xd1\xeb\x88\x06\x88'

Additional information

This occurrence of this behavior is inconsistent, but only happens in accelerated time.

It looks like a synchronization issue with ActionServer._execute_expire_goals that removes the goal result future from the _result_futures dictionary before _execute_goal sets its result. Checking not self._result_futures[goal_uuid].done() before removing the expired goal seems to mitigate the issue, but the actual source of the problem may be elsewhere.

The text was updated successfully, but these errors were encountered:

Barry-Xu-2018 · 2024-03-01T02:27:24Z

It should not be a bug.
There is a Result caching in action design (Please refer to https://design.ros2.org/articles/actions.html).
Users can decide how long the goal result will be retained. However, it is not guaranteed that the goal result will be consistently retained even if it has not been accessed.

rclpy/rclpy/rclpy/action/server.py

Line 200 in 220d714

result_timeout=900

fujitatomoya · 2024-03-01T06:41:51Z

Have a node emit clock messages faster than wall clock moves (at x60)

@Barry-Xu-2018 this will make the result expired (removed) 60 times faster in the action server.
i think precisely that this situation could happen, but generating KeyError exception is not designed behavior?

i think we can check if the key bytes(goal_uuid) exists in _result_futures dictionary, and, goal_handle is also remove from _goal_handles. probably it can print warning that says "Goal is already expired before completion", what do you think?

Barry-Xu-2018 · 2024-03-01T09:21:28Z

i think precisely that this situation could happen, but generating KeyError exception is not designed behavior?

Yes. Throwing an exception is not a correct behavior in this scenario.

i think we can check if the key bytes(goal_uuid) exists in _result_futures dictionary, and, goal_handle is also remove from _goal_handles. probably it can print warning that says "Goal is already expired before completion", what do you think?

Yes. I'll double-check the code logic to see if there are any other places that may need modification when this issue occurs.

Barry-Xu-2018 · 2024-03-01T09:52:46Z

rclpy/rclpy/rclpy/action/server.py

Lines 327 to 353 in 220d714

    
           async def _execute_goal(self, execute_callback, goal_handle): 
        
               goal_uuid = goal_handle.goal_id.uuid 
        
               self._logger.debug('Executing goal with ID {0}'.format(goal_uuid)) 
        
               try: 
        
                   # Execute user callback 
        
                   execute_result = await await_or_execute(execute_callback, goal_handle) 
        
               except Exception as ex: 
        
                   # Create an empty result so that we can still send a response to the client 
        
                   execute_result = self._action_type.Result() 
        
                   self._logger.error('Error raised in execute callback: {0}'.format(ex)) 
        
                   traceback.print_exc() 
        
               # If user did not trigger a terminal state, assume aborted 
        
               if goal_handle.is_active: 
        
                   self._logger.warning( 
        
                       'Goal state not set, assuming aborted. Goal ID: {0}'.format(goal_uuid)) 
        
                   goal_handle.abort() 
        
               self._logger.debug( 
        
                   'Goal with ID {0} finished with state {1}'.format(goal_uuid, goal_handle.status)) 
        
               # Set result 
        
               result_response = self._action_type.Impl.GetResultService.Response() 
        
               result_response.status = goal_handle.status 
        
               result_response.result = execute_result 
        
               self._result_futures[bytes(goal_uuid)].set_result(result_response)

rclpy/rclpy/rclpy/action/server.py

Lines 413 to 419 in 220d714

    
           async def _execute_expire_goals(self, expired_goals): 
        
               for goal in expired_goals: 
        
                   goal_uuid = bytes(goal.goal_id.uuid) 
        
                   self._goal_handles[goal_uuid].destroy() 
        
                   del self._goal_handles[goal_uuid] 
        
                   self.remove_future(self._result_futures[goal_uuid]) 
        
                   del self._result_futures[goal_uuid]

These two parts involve asynchronous operations, so to completely avoid the issue, locking (asyncio.Lock) should be added when operating on self._result_futures.

rafael1193 · 2024-03-04T12:22:49Z

Thank you very much for your help.

In hindsight, some expiring timeout message was to be expected but not the action server failing. I have increased result_timeout accordingly and the synchronization problem does not seem to appear anymore.

fujitatomoya · 2024-09-27T00:16:48Z

@Barry-Xu-2018 i am almost sure this has been fixed with ros2/rcl#1121, what do you think?

i think the only thing missing here is to check the goal_uuid key in _result_futures during _execute_get_result_request call. this could generate the key error when the result timeout is too short or simulation time like this issue report.

what do you think?

fujitatomoya · 2024-09-27T00:40:57Z

i think the only thing missing here is to check the goal_uuid key in _result_futures during _execute_get_result_request call.

I was thinking that the following patch, but i found that is not needed. because it already checks the goal_uuid in the _goal_handles. probably we can adjust the debug message a bit.

rclpy/rclpy/rclpy/action/server.py

Lines 413 to 420 in 2de5c8e

    
           # If no goal with the requested ID exists, then return UNKNOWN status 
        
           if bytes(goal_uuid) not in self._goal_handles: 
        
               self._logger.debug( 
        
                   'Sending result response for unknown goal ID: {0}'.format(goal_uuid)) 
        
               result_response = self._action_type.Impl.GetResultService.Response() 
        
               result_response.status = GoalStatus.STATUS_UNKNOWN 
        
               self._handle.send_result_response(request_header, result_response) 
        
               return

diff --git a/rclpy/rclpy/action/server.py b/rclpy/rclpy/action/server.py
index ce60309..06b93d4 100644
--- a/rclpy/rclpy/action/server.py
+++ b/rclpy/rclpy/action/server.py
@@ -472,8 +472,13 @@ class ActionServer(Generic[GoalT, ResultT, FeedbackT], Waitable['ServerGoalHandl

         # There is an accepted goal matching the goal ID, register a callback to send the
         # response as soon as it's ready
-        self._result_futures[bytes(goal_uuid)].add_done_callback(
-            functools.partial(self._send_result_response, request_header))
+        if bytes(goal_uuid) in self._result_futures:
+            self._result_futures[bytes(goal_uuid)].add_done_callback(
+                functools.partial(self._send_result_response, request_header))
+        else:
+            # This can happen when result_timeout is too short, or high-freq simulation time
+            self._logger.warn(
+                'Requested result has been expired already (goal ID: {0})'.format(goal_uuid))

     async def _execute_expire_goals(self, expired_goals: Tuple[GoalInfo, ...]) -> None:
         for goal in expired_goals:

Barry-Xu-2018 · 2024-09-27T09:40:30Z

@Barry-Xu-2018 i am almost sure this has been fixed with ros2/rcl#1121, what do you think?

If result_timeout is still set to 0, this issue might still occur. This is because the current design doesn't ensure that the result is retrieved before it gets deleted. So it's the user's responsibility to set a reasonable result timeout.

ros2/rcl#1121 isn't aimed at this issue.

i think the only thing missing here is to check the goal_uuid key in _result_futures during _execute_get_result_request call. this could generate the key error when the result timeout is too short or simulation time like this issue report.

what do you think?

Yes. This check is necessary.

fujitatomoya linked a pull request Sep 27, 2024 that will close this issue

adjust debug message that requested goal is already expired. #1363

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyError in ActionServer._execute_goal #1236

KeyError in ActionServer._execute_goal #1236

rafael1193 commented Feb 29, 2024

Barry-Xu-2018 commented Mar 1, 2024

fujitatomoya commented Mar 1, 2024

Barry-Xu-2018 commented Mar 1, 2024

Barry-Xu-2018 commented Mar 1, 2024 •

edited

Loading

rafael1193 commented Mar 4, 2024

fujitatomoya commented Sep 27, 2024

fujitatomoya commented Sep 27, 2024

Barry-Xu-2018 commented Sep 27, 2024

KeyError in ActionServer._execute_goal #1236

KeyError in ActionServer._execute_goal #1236

Comments

rafael1193 commented Feb 29, 2024

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

Barry-Xu-2018 commented Mar 1, 2024

fujitatomoya commented Mar 1, 2024

Barry-Xu-2018 commented Mar 1, 2024

Barry-Xu-2018 commented Mar 1, 2024 • edited Loading

rafael1193 commented Mar 4, 2024

fujitatomoya commented Sep 27, 2024

fujitatomoya commented Sep 27, 2024

Barry-Xu-2018 commented Sep 27, 2024

Barry-Xu-2018 commented Mar 1, 2024 •

edited

Loading