Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds design document for Deadline, Liveliness, and Lifespan. #212
Adds design document for Deadline, Liveliness, and Lifespan. #212
Changes from 7 commits
70cf9aa
05ec0a5
33431f5
26c08be
2715a7b
de871b0
50ebf99
3f9aad6
f29b7be
2de4243
3e74964
2b4b937
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a behavior that DDS ensures (like in RTI's request/reply API)? or is it something that we would ensure in
rcl
and above? (presumably by emulating the behavior of deadline QoS for topics)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I looked before I did not see anything in the DDS RPC spec about deadline behavior for the request/reply pattern. This is something that would need to be tracked in the ROS layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wjwwood, @nburek: There is LatencyBudged which is unfortunately just a hint to the DDS implementation and not enforced.
Maybe it makes sense to use separate two semantically different things by calling them differently in ROS 2:
The notation of a latency budget could then implemented enforceable on the ROS 2 level for pub/sub, RPC and actions as an additional QOS. Which could also be made even more useful by allowing the user to select between sender timestamp and data timestamp (stored in ROS 2 stamped types) for discerning if the latency budget was exceeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the data model and the API elements are convoluted. I have 5 cars that I am tracking provided by 5 different publishers on the same topic. What happens if I lose tracking of one of the cars? I want to know about the liveliness of the data not just of some API entities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we merge this as-is, then we need to have an issue which describes the work to be done in order to properly support liveliness with instances.
Right now, we assume the data model is one to one with publishers and subscriptions. If someone were to use a keyed data type with instances, then I'm not sure how this would break. It is already possible for users to create keyed types using DDS style IDL files in ROS.
It may need to be mentioned as a limitation in the API docs as well until we decide what to do differently.
To be clear, I support merging this as-is, but I would like to see one of the authors open up the issue that describes the necessary follow up work and/or the current limitations w.r.t. instances and keyed data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here data model and API gets convoluted. I as a client care about my specific (RPC) request, and I want to know if this request is still alive, meaning the server claims that it still plans to process it. I don't care if whatever API objects exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think defining liveliness for a specific request convolutes the meaning. For Publishers and Subscribers liveliness doesn't inform about a specific message still being alive and receivable by a Subscriber, it informs you that the Publisher itself is still online and active. To me it makes sense to make liveliness for Services mean something similar, that the Service Client and Server are still alive.
I can see the usefulness of knowing the lifecycle of a particular request as well, I just don't think it should be solved by liveliness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what you think needs to be done here? You mean the asynchronous service client needs to give you a callback when the request expired due to lifespan? The sync service client already has a timeout, but it is basically how long you're willing to wait for a service request to be received, and if you run out of time you just ignore the result.
Seems like more work would be on the Service Server side, which would presumably want to let the user stop their work if the lifespan is exceeded, otherwise it would just wait for the callback to handle the service to end and then drop the result (because the lifespan was already exceeded).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I was seeing in the existing client code, the timeouts on the client side are handled in rclcpp and nothing in the rcl layer or below tracked the time. It's fine to ignore the result when it arrives late, but what happens when it doesn't arrive at all? If for some reason my service is dying or dropping requests am I leaking memory in the rcl/rmw layers by continuing to wait for responses that will never come in?
Also, as @gbiggs points out, you need a way to notify the application when a timeout occurs. If a request is dropped by either the rmw layer on the client side or by the server the application making the request should be informed of that so that they don't get blocked spinning on a wait to a response that will never come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nburek: I am not sure if trying to inform application on data removal due to lifespan as sensible for the common use cases of this policy such as:
Removing data and then informing the application of a potential problem created by the data removal makes not really sense to me. A more useful approach would be the latency budget which informs the user that data is older than it should be without removing the data or a way to query the number of samples currently hold in the history per instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deeplearningrobotics For just Publishers and Subscribers I would agree with you that the application doesn't need to be informed of transient removal of data from the history. However, if you're defining the lifespan for a Service request then the violation of that policy means you never expect to receive a response for the request and you'll need to be informed in both the ROS library to clean up state for the request as well as in the application so that they don't spin forever on waiting for a response that will never come.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DDS QoS settings such as deadline and liveliness map to instances. So this sentence is only true if one keeps the semantics consistent with DDS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This relates to a couple of your comments above. I need to do a little more research to figure out the best approach for addressing your concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you maybe clarify why would we be tied to a particular DDS implementation? Because some (e.g. OpenSplice, RTI Connext Micro) do not implement https://www.omg.org/spec/DDS-RPC/About-DDS-RPC/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining deadline for Services in terms of the underlying topics ties the rmw implementation to the DDS style of using multiple pub/sub channels for RPC. So it's not that it ties you to a particular DDS vendor, but rather that it ties you to a DDS style implementation of RMW because it is defining implementation details when it doesn't need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there are multiple publishers but each providing a subset of the necessary data model?
In DDS liveliness information is provided with the SampleInfo
Also, lifecycle impacts data buffering:
https://community.rti.com/static/documentation/connext-dds/5.2.0/doc/api/connext_dds/api_cpp2/classdds_1_1core_1_1policy_1_1WriterDataLifecycle.html#details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be covered in a follow up issue (the same one as for deadline or in a separate one).
I not sure what @deeplearningrobotics is speaking about here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can create a follow up issue here as well, but this sounds like a message composition feature to me. I know that DDS allows you to compose data from multiple publishers into a single subscription, but does ROS have plans to allow that as well? If not, then I don't think it makes sense to open an issue to update the design for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nburek I'm not sure what you mean by composition, since you can definitely have multiple publishers per subscription (many to one/many).
I think @deeplearningrobotics means that liveliness is based on the instance, so in the case that we have one instance (most of the time, i.e. with key-less types) the liveliness isn't lost unless all the publishers stop asserting liveliness. If at least one is still asserting it, then it will not drop. So it may not be a reliable way to detect that an individual publisher is gone, but instead that all publishers for a particular instance is gone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may just need rewording here, because we're not doing anything special or different from DDS, we're just using a single instance. But that does not indicate a single publisher, which is kind of what this verbiage seems to be saying to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump, if we can address this we can merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a change to the wording to indicate that the subscriber receives an event when there are no publishers alive to continue asserting liveliness on the topic. I did not specifically mention instances since it's assumed through the doc that ROS 2 only supports single instances and that multiple instances of a topic based on a key need further work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not work if one wants to use instances, see my comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this second option dangerous. Experience shows that if users have the option they will use it and they will ignore warnings on the status events, out of ignorance or out of not understanding the limitation. I'd vote for raising an exception as being the only option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising an exception (or returning a non-SUCCESS error code as the case may be) then requires every rmw implementation to implement every QoS policy or risk nothing being able to run. Yes, you now have the risk of users ignoring the warnings, but it also allows the application to actually run in cases where you don't actually care.
For example, I imagine that a large number of commonly used Nodes are going to provide LIVELINESS_MANUAL_TOPIC to allow the most verbose tracking of liveliness by consumers even though it isn't really required for them to run. There may be an rmw implementation that a user wants to use for their particular use case that does not implement the liveliness QoS policy. The user may have decided that they don't actually need liveliness tracked in their system but do need some other feature offered by this particular rmw implementation. In that scenario, throwing an exception means that the user will not be able to use any of those common Nodes that offer that offer liveliness because the thrown exception would prevent them from starting up. On the other hand, going the route of only making this a warning allows the Nodes to still run without the use of liveliness and the user can still use them.
I guess the tradeoff here is that by not using exceptions we allow Nodes to be more portable out of the box when the implementor doesn't fully handle the situation when an rmw implementation doesn't support a QoS policy. However, it does push some burden onto the end user of the Node to understand the impact of their rmw choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the feature is not implemented uniformly across middleware implementations, then I question it's value in the first place. Making the API optional just makes user code more complicated.
In the case of emitting warnings, I think that's simply not good enough. If I write a program and expect the deadline (for example) to be respected, but it fails to do so, only emitting a warning, then my program may no longer be valid. Instead, I'd like the API to throw (or return an error code), and if you want to write code that can use it optionally, then add an API call which can be used to check and see if it is not implemented, in which case the user can avoid calling it, or configure their system differently. Or they can just call it and check for the error state, catch it / handle it and continue with an alternative code path that doesn't require it.
In general, however, I would prefer to avoid optional components of the API due to the complexity they add at all layers of the system. Instead we should aim to only include features in the API that are useful enough to require all middleware implementations to support one way or the other. This contributes to my personal preference of including as few QoS settings as we can manage and still do useful stuff. The breadth of configuration is useful, but makes portability difficult and implementations complicated. Having optional API's try's to address this, but fails to avoid the complexity in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be one way or the other and guaranteed by rmw, because I don't think it will acceptable to say "sometimes you'll get all the events, but in other situations you may not". The easy solution is to only ever have the latest event available. Again we can discuss this in more detail in a live meeting.