Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support service 2/2 --- rosbag2 service play #1481

Merged
merged 73 commits into from
Apr 13, 2024

Conversation

Barry-Xu-2018
Copy link
Contributor

@Barry-Xu-2018 Barry-Xu-2018 commented Oct 13, 2023

Split #1414 to 2 PRs.
This is the second PR. This PR implement service play.
Note that: this PR depend on #1480

Known a bug while enabling introspection on both service side and client side.
rmw_fastrtps and rmw_cyclonedds have this issue. rmw_connextdds has no this issue.

ros2bag/ros2bag/verb/play.py Outdated Show resolved Hide resolved
rosbag2_storage_mcap/src/mcap_storage.cpp Outdated Show resolved Hide resolved
rosbag2_transport/src/rosbag2_transport/player.cpp Outdated Show resolved Hide resolved
rosbag2_transport/src/rosbag2_transport/player.cpp Outdated Show resolved Hide resolved
Comment on lines 1549 to 1712
void PlayerImpl::PlayerClient::async_send_request(const rclcpp::SerializedMessage & message)
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from MichaelOrlov

I have a concern about PlayerClient::async_send_request. We can’t really make it sending request asynchronously at least with the current design. Because we need to wait for future returning from client_->async_send_request(..). Since we can't delete ros_message until the future complete. Otherwise, it will be a dangling pointer to the request_addr and memory could be reused and we have a chance to use corrupted memory during the client_->async_send_request(..). Also according to the notes in the client_→async_send_request(..) API need to call client_→remove_pending_request(future) if we waited for future and timeout happened.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will consider how to resolve this problem based on your suggestion.

rosbag2_transport/src/rosbag2_transport/player.cpp Outdated Show resolved Hide resolved
rosbag2_transport/src/rosbag2_transport/player.cpp Outdated Show resolved Hide resolved
@Barry-Xu-2018
Copy link
Contributor Author

The comments are from MichaelOrlov (#1414 (review))
Comments related to this PR (marked as ✅) will be handled in this PR.


I have made a first pass of the review and here is my major notes and proposals:

  1. Handle in Implement service recording and display info about recorded services #1480
    It seems we don’t have tests for the rosbag2_cpp::Info::read_service_info(...) and Python wrapper for information about services. It would be nice to add at least one sanity check.

  2. Handle in Implement service recording and display info about recorded services #1480
    I have a proposal about new parameters for topic/services exclusions:

    Keep the old parameter exclude but rename it to the exclude_regex. it will be applicable to both topics and services.
    Make exclude_topics and exclude_services as std::vectorstd::string i.e. space separated list of topics and services names for CLI arguments.
    IMO it will be more user-friendly. We have had recently an incoming issue from one of the users when it was expected that the exclude parameter is a list of topics. Like opposite to the topics parameter.
    It looks intuitively reasonable to expect it to be defined this way.


  3. I would like to request to rename PlayerClient to the PlayerServiceClient and consider moving it to its own separate cpp and hpp files.
    The PlayerClient is pretty large and independent from its parent PlayerImpl class. We already have a lot of code inside the Player class and with this inner class inside it will become more cluttered and more difficult to navigate and support.


  4. We are doing deserialization for the same service message twice. Once in the PlayerClient::is_include_request_message(..) and the second time in the PlayerClient::async_send_request(). This is an expensive procedure.
    I would like to request structural changes there.
    Could you please rework is_include_request_message(..) to the get_service_request_message(serialized_message) and return smart pointer to the deserialized message? If there are no valid request message it will return a pointer to null. Then rework async_send_request(deserialized_message) to accept those smart pointer to the already deserialized message.


  5. I also have a concern about PlayerClient::async_send_request. We can’t really make it sending request asynchronously at least with the current design. Because we need to wait for future returning from client_->async_send_request(..). Since we can't delete ros_message until the future complete. Otherwise, it will be a dangling pointer to the request_addr and memory could be reused and we have a chance to use corrupted memory during the client_->async_send_request(..). Also according to the notes in the client_->async_send_request(..) API need to call client_->remove_pending_request(future) if we wait for future and timeout happened.
    As a first iteration, I would suggest to making sending client request synchronous with timeout. Then need to figure out how to design it to be asynchronous.
    I have idea to create some static worker thread and wait in it for static condition variable by timeout and check for future. If another request needs to be sent we can signal to condition variable to interrupt the wait earlier than timeout, remove_pending_request(future) and then try to send the new service request.

@Barry-Xu-2018
Copy link
Contributor Author

Barry-Xu-2018 commented Oct 23, 2023

@MichaelOrlov

I also have a concern about PlayerClient::async_send_request. We can’t really make it sending request asynchronously at least with the current design. Because we need to wait for future returning from client_->async_send_request(..). Since we can't delete ros_message until the future complete. Otherwise, it will be a dangling pointer to the request_addr and memory could be reused and we have a chance to use corrupted memory during the client_->async_send_request(..). Also according to the notes in the client_->async_send_request(..) API need to call client_->remove_pending_request(future) if we wait for future and timeout happened.
As a first iteration, I would suggest to making sending client request synchronous with timeout. Then need to figure out how to design it to be asynchronous.
I have idea to create some static worker thread and wait in it for static condition variable by timeout and check for future. If another request needs to be sent we can signal to condition variable to interrupt the wait earlier than timeout, remove_pending_request(future) and then try to send the new service request.

Your suggestion is to create a worker thread for management. I'm considering whether it's possible to avoid using a thread and, instead, check if the current request future queue has reached its limit before each async_send_request call. During this check, completed futures and timed-out futures would be removed. If, at that point, the queue hasn't reached its limit, the request can be sent and request future is added to the queue; otherwise, an error is returned, and the request won't be sent. Besides, there are 2 parameters The maximum queue size and timeout. How to set these 2 parameters? set them by play command parameters or environment variables ? Related code is 54f82cb

@Barry-Xu-2018
Copy link
Contributor Author

Barry-Xu-2018 commented Oct 23, 2023

Handle review comments for this PR is from commit 1c73a3f.

@Barry-Xu-2018
Copy link
Contributor Author

Due to my mistake, the commit d6ecb16 includes corrections for review comments.

@Barry-Xu-2018
Copy link
Contributor Author

I rebased code.
Since this PR is based on #1480 and #1480 isn't merged, there are many commits for recording & info included.
For this PR, the commit is started from Implement service play 230dfb5.

@fujitatomoya
Copy link
Contributor

@Barry-Xu-2018 can you rebase the code, i will start the review.
CC: @MichaelOrlov

@Barry-Xu-2018
Copy link
Contributor Author

Barry-Xu-2018 commented Dec 22, 2023

@fujitatomoya

can you rebase the code, i will start the review.

Yeah. I start to do rebase.

@Barry-Xu-2018
Copy link
Contributor Author

Now generic client ros2/rclcpp#2358 isn't merged. So build still failed.

Copy link
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50% review, I need more time to complete the 1st review.

ros2bag/ros2bag/verb/play.py Outdated Show resolved Hide resolved
ros2bag/ros2bag/verb/play.py Outdated Show resolved Hide resolved
rosbag2_storage/include/rosbag2_storage/storage_filter.hpp Outdated Show resolved Hide resolved
rosbag2_storage_mcap/src/mcap_storage.cpp Show resolved Hide resolved
Comment on lines +93 to +119
std::chrono::seconds request_future_timeout_;
size_t maximum_request_future_queue_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe those are really user dependent configuration, that would be really better to configure when we play the service bag as clients? what do you think?

@MichaelOrlov @Barry-Xu-2018

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

furthermore, if the rosbag2 plays back with really low rate, or server is working with use_sim_time enabled, eventually server would take really long time to respond to the rosbag2 service client for this. that means, user cannot receive the service response from the server. i think this needs to be well explained at least when we use ros2 bag play services. at least, these options should be configurable by user, and they understand that this timeout is based on monotonic clock.

Copy link
Contributor Author

@Barry-Xu-2018 Barry-Xu-2018 Jan 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe those are really user dependent configuration, that would be really better to configure when we play the service bag as clients? what do you think?

Totally agree.
Maybe request_future_timeout should be configured by user. We also don't know regarding the required duration of a service process. Currently, it is set for 5 minutes, which should suffice for some cases. However, for cases with special requirements, we should also provide a configurable method.

At the beginning, Michael suggested this way #1481 (comment). I used another way and wanted to hear opinions. So I neglected to consider allowing users to set the relevant parameters.

Let's hear Michael's opinion on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, having async behavior looks good to me as implemented now with max queue depth and timeout reap the pending request if that takes long time. if we have thread to do sync way, we might have thousands threads running to wait the response from service server, we have no idea how much it takes to complete the service.

i will defer this to @MichaelOrlov , what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. The maximum queue depth and timeout duration for these two configurations should be customizable by the user. Currently, this interface has not been reflected in this PR (Which way is used to configure, command parameter or environment variable ? This need to further discuss), but it will be added in subsequent updates.

Copy link
Contributor

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides all comments in the source code, there is one thing i would like to discuss on user experience.

current implementation does, generic clients are created on service type and send the request based on the service introspection event messages. that is good to send the request to the service server from the bag file.
but it does not show any result from the server, internally it caches the feature and results are ready until timeouts, never printed to the user.

i think current behavior is okay that user can do play the requests from the bag file. and results can be checked on the server side.

what do you think?

}

if (request_futures_list_.size() == maximum_request_future_queue_) {
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, print warning to let the user know to tell queue size is not enough, so that user can configure this once this option is ready.

Copy link
Contributor Author

@Barry-Xu-2018 Barry-Xu-2018 Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function was called at PlayerServiceClientManager::register_request_future(). If the queue is full, it leads to failure to register request future. So I will output a warning at that function.

rosbag2_transport/src/rosbag2_transport/player.cpp Outdated Show resolved Hide resolved
type == service_msgs::msg::ServiceEventInfo::REQUEST_RECEIVED) &&
client_side_type_ == introspection_type::METADATA)
{
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it would be better for user to know that the service data that user trying to replay does not have any contents so that it is unable to replay the data via this rosbag2 database?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are multiple clients to the service name. So current logic is incorrect.
I need to reconsider the logic of this part.

Comment on lines 82 to 81
if (message.size() <= rosbag2_cpp::get_serialization_size_for_service_metadata_event()) {
client_side_type_ = introspection_type::METADATA;
RCUTILS_LOG_WARN_ONCE_NAMED(
ROSBAG2_TRANSPORT_PACKAGE_NAME,
"The configuration of introspection for '%s' client is metadata !",
service_name_.c_str());
return false;
} else {
client_side_type_ = introspection_type::CONTENTS;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would move these implementation into rosbag2_cpp service_utils.hpp/cpp, like get_service_event_type() which returns service introspection event type based on the content length?

type == service_msgs::msg::ServiceEventInfo::REQUEST_SENT)
{
if (message.size() <= rosbag2_cpp::get_serialization_size_for_service_metadata_event()) {
client_side_type_ = introspection_type::METADATA;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this bad assumption? this rosbag2 database could have multiple client's data in the same service name. i do not think we can cache the introspection type in generic client here unless we manage that for each client ID.

the assumption we can make here,

  • we are not sure if server or clients enabled contents or metadata. some did, some did not.
  • there are multiple clients to the service name, this is likely.
  • we do not want to duplicated request from the same transaction with REQUEST_SENT and REQUEST_RECEIVED
  • introspection flag can be changed dynamically, and user application would do so on server and clients.
  • service introspection event does not have unique GID during transaction rmw#357 does not allow us to tell the identical transaction.

how about the following for now with some note?

always try to send request based on service server data with CONTENTS is enabled. expectation is that we can guide the user to enable service introspection on service server side with this feature, and that is gonna be much easier to take advantage of this feature instead of enabling all clients.
when ros2/rmw#357 solves, we can control more flexibly with client ID if the request is played or not as much as possible with server and client data, besides we can also avoid request duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this bad assumption? this rosbag2 database could have multiple client's data in the same service name.

Yes. I need to reconsider the logic of the code.

always try to send request based on service server data with CONTENTS is enabled. expectation is that we can guide the user to enable service introspection on service server side with this feature, and that is gonna be much easier to take advantage of this feature instead of enabling all clients.

If introspection is set on the service side, send the received request sent from the service side. This can simplify the code logic.

when ros2/rmw#357 solves, we can control more flexibly with client ID if the request is played or not as much as possible with server and client data, besides we can also avoid request duplication.

Yes. I will reconsider this part of the logic based on the client ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After considering, we cannot use "always try to send request based on service server data with CONTENTS is enabled. ".

e.g.
In recorded data

  1. request_sent (for client 1)
  2. request_received (for client 1)
  3. ...

While dealing with 1, code doesn't know whether service enable introspection. But code must decide whether this request is sent this time. At this case, code have to decide to use request data from request_sent to send. Code have to do judgement for each client. So I change the logic of code 05c1e95.

@wjwwood
Copy link
Member

wjwwood commented Jan 6, 2024

current implementation does, generic clients are created on service type and send the request based on the service introspection event messages. that is good to send the request to the service server from the bag file.
but it does not show any result from the server, internally it caches the feature and results are ready until timeouts, never printed to the user.

I didn't have time to review this pr (though I had a look at dependency pr ros2/rclcpp#2358), but I have to wonder at the value of playing back services and that being part of rosbag itself. I personally don't see why you'd want this as conceptually it brings up the problem that @fujitatomoya is pointing out that there's no way to "playback" what happens with the responses rosbag will receive after making requests. And I don't see a general way of handling this other than perhaps printing the responses to the console and I think that has limited value. It really only makes sense when the responses are empty or informational only and don't drive other behavior in your system.

Instead, if you actually wanted to emulate playback of service requests and do something with the responses then I'd imagine it would be better to write a script that is specific to your system, so it can subscribe to the replayed service introspection topic and makes the appropriate requests and do something meaningful for the responses.

I'll acknowledge that just because I can't imagine a good use for this generic (generic in the common sense not specifically talking about type-agnostic in the GenericClient sense) playback feature, doesn't mean there isn't a good use for it, but I'm curious if you guys see yourself actually using it for something concrete?

On the other hand, having the record feature and having the info in the bag file for analysis or custom playback scripts seems really useful, and I have no issues with that one.

@fujitatomoya
Copy link
Contributor

doesn't mean there isn't a good use for it, but I'm curious if you guys see yourself actually using it for something concrete?

originally we think that this is still useful to debug the service servers based on the recorded data for development. that is the major use case that we have. @Barry-Xu-2018 any thoughts?

On the other hand, having the record feature and having the info in the bag file for analysis or custom playback scripts seems really useful, and I have no issues with that one.

👍

@Barry-Xu-2018
Copy link
Contributor Author

One case I can think of is a sequence of requests causing issues in the system. Currently, the recorded requests can accurately simulate the scenario in which the problem occurred. (Don't care response from service.)

Besides, in the past, there have been community members who have raised the need for recording and playing back services. #773

@wjwwood
Copy link
Member

wjwwood commented Jan 9, 2024

Currently, the recorded requests can accurately simulate the scenario in which the problem occurred. (Don't care response from service.)

I guess I'm just wondering how often the responses don't matter... Seems like a topic might be better in that case.

Besides, in the past, there have been community members who have raised the need for recording and playing back services. #773

Right, but I wonder if they thought through the implications of playing back services. It doesn't really make sense. That's why I was suggesting that rather than have this as a feature of rosbag it should be done using user written scripts that understand the responses. I suppose that having this part built into rosbag that's only useful when the responses don't matter is fine, but it's useful in fewer situations than people realized I think.

@fujitatomoya
Copy link
Contributor

IMO, playback service request still can be useful even without checking the response in rosbag2.

  • we can simulate the system behavior including services that control node states and manage endpoint. e.g request the node state change to start subscription.
  • we can still see the service response with ros2 service echo <service name> if we need to see the response.

thoughts / opinion?

@wjwwood
Copy link
Member

wjwwood commented Jan 10, 2024

I suppose playback of state change requests with printing of the results (so you can at least see when a state change fails) is a good use case for this kind of feature. I certainly don't want to stand in the way of the feature, I just wanted to see what concrete use cases we have for it (the state change playback is a good example I think).

ros2 service echo can work, but it would be nice if rosbag could optionally print them I think.

@stewpend0us
Copy link

stewpend0us commented Jan 10, 2024 via email

@Barry-Xu-2018
Copy link
Contributor Author

About test_burst.cpp, I have submitted a commit 87526a7.
About player.cpp, it is a bug. I fix it now.

Signed-off-by: Barry Xu <barry.xu@sony.com>
@Barry-Xu-2018
Copy link
Contributor Author

About player.cpp, it is a bug. I fix it now.
Done. e95a51e

@MichaelOrlov
Copy link
Contributor

@fujitatomoya Could you please try to reproduce the failure with burst on your local setup after the @Barry-Xu-2018 latest fixes?

I was not able to reproduce it with CycloneDDS nor with FastRTPS DDS implementations even with the very heavy load on my machine. e.g. stress -m 70 -c 70 -i 70

@fujitatomoya
Copy link
Contributor

hmmm still flaky, i can reproduce https://ci.ros2.org/job/ci_linux/20847/testReport/ only with connextdds. (never happened with cyclone nor fast-dds)

@Barry-Xu-2018 could you try colcon test --event-handlers console_direct+ --packages-select rosbag2_transport --ctest-args -R test_burst at least 10 times, i can reproduce this flaky result with connextdds? i am not sure if this is my environment problem...

@fujitatomoya
Copy link
Contributor

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@MichaelOrlov
Copy link
Contributor

I have a suspicious that it could be rmw_connextdds specific issue. It could be different queue sizes in QoS settings for services.
Or just a surprise bug inside rti dds.

@Barry-Xu-2018
Copy link
Contributor Author

@Barry-Xu-2018 could you try colcon test --event-handlers console_direct+ --packages-select rosbag2_transport --ctest-args -R test_burst at least 10 times, i can reproduce this flaky result with connextdds? i am not sure if this is my environment problem...

Okay. I try it. But my local environment doesn't set up connextdds. This will take some time.

@fujitatomoya
Copy link
Contributor

@MichaelOrlov i guess so too. it seems that 2nd service request from rosbag2 player burst, cannot be received on palyer side. looks like always missing 1 response...

i guess we do not want to have the flaky test in here, so maybe skip connextdds case in this specific test, and leave TODO would be good if i am connect on this flaky result.

@fujitatomoya
Copy link
Contributor

I try it. But my local environment doesn't set up connextdds. This will take some time.

see https://docs.ros.org/en/rolling/Installation/DDS-Implementations.html#debian-packages-provided-in-the-ros-2-apt-repositories

@MichaelOrlov
Copy link
Contributor

MichaelOrlov commented Apr 13, 2024

Since RTI has very restrictive and company/developer non-friendly licensing, I would refrain installing it on my machine.
(I have to delete it already once from my local setup)

I would suggest - to disabale service support for RTI DDS or just this particular test - and create a follow-up issue and let folks from RTI handle it themselves.

@fujitatomoya
Copy link
Contributor

I am not sure why CI is not run...

@fujitatomoya
Copy link
Contributor

I would suggest - to disabale service support for RTI DDS or just this particular test - and create a follow-up issue and let folks from RTI handle it themselves.

agree. this could be really deep inside of dds implementation. at this moment, we can skip this particular test.

@fujitatomoya
Copy link
Contributor

fujitatomoya commented Apr 13, 2024

@Barry-Xu-2018 can you add the patch to skip connextdds for burst_bursting_only_filtered_services? and then we can run CI once again to make sure everything else is fine?

note, i can test locally tomorrow and start CI in the morning.

@MichaelOrlov
Copy link
Contributor

@fujitatomoya @Barry-Xu-2018 If you don't mind I can push a fix with disabling test_burst for RTI DDS.

- The test `burst_bursting_only_filtered_services` fails only with
rmw_connext_dds for unknown reasons and tends to be flaky. Sometimes
we receive only one service request instead of 2.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@Barry-Xu-2018
Copy link
Contributor Author

I can reproduce this flaky result with connextdds in my local environment.
The test_burst test ran continuously for 6 and a half minutes before the issue occurred once.

@MichaelOrlov
Copy link
Contributor

@Barry-Xu-2018 Likely, you will be able to increase the reproduction rate if you will run your machine in the power save mode and run in parallel in another terminal window stress -m 70 -c 70 -i 70 command.

…g requests"

This reverts commit aaaac74.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
- The test `burst_bursting_only_filtered_services` fails only with
rmw_connext_dds for unknown reasons and tends to be flaky. Sometimes
we receive only one service request instead of 2.

Signed-off-by: Michael Orlov <michael.orlov@apex.ai>
@MichaelOrlov
Copy link
Contributor

Re-running CI after disabling failing test for rmw_connextdds.
BUILD args: --packages-above-and-dependencies ros2bag rosbag2_cpp rosbag2_py rosbag2_storage rosbag2_storage_mcap rosbag2_storage_sqlite3 rosbag2_transport rosbag2_tests
TEST args: --packages-above ros2bag rosbag2_cpp rosbag2_py rosbag2_storage rosbag2_storage_mcap rosbag2_storage_sqlite3 rosbag2_transport rosbag2_tests
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/13695

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@fujitatomoya
Copy link
Contributor

@MichaelOrlov @Barry-Xu-2018 all green 🚀🚀🚀 really appreciate your effort!

@fujitatomoya
Copy link
Contributor

@MichaelOrlov after merge this PR, can you take care of this #1608 too?

i will update the doc to resolve ros2/ros2_documentation#4274

@MichaelOrlov MichaelOrlov merged commit b02142a into ros2:rolling Apr 13, 2024
14 checks passed
@MichaelOrlov
Copy link
Contributor

Uff finally we merged it.
@Barry-Xu-2018 @fujitatomoya thank you for your tremendous effort for this feature.

@fujitatomoya
Copy link
Contributor

honor is ours, lets keep it up 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants