Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Obstacle Layer, Voxel Layer, and Costmap Topic Collision Checker not working in Humble/Rolling due to Fast-DDS Regression #3014

Closed
SteveMacenski opened this issue Jun 14, 2022 · 48 comments
Labels
bug Something isn't working

Comments

@SteveMacenski
Copy link
Member

More info #2489 (comment)

Seems to have broken sine #2489 for 22.04 / Humble updates.

CC @jwallace42 @gezp

@SteveMacenski SteveMacenski added the bug Something isn't working label Jun 14, 2022
@gezp
Copy link
Contributor

gezp commented Jun 15, 2022

the problem is that Obstacle Layer not working (the scan data isn't subscribed by Obstacle Layer), becasue observation_subscribers_->subsciber() failed when ObstacleLayer::activate() is called.

it seems that we can't create subscriber in another callback group at runtime since rollling updated in ubuntu22.04, but i don't know why. and i make a simple PR to sovle this problem by removing subscribe().

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jun 15, 2022

I think that makes sense in the interim, but can you file a ticket describing this issue to the message filters repo? Do you see the same behavior if you create a minimal example or just something we see in Nav2's use of it? The unit test in message filters unsubscribing then later subscribing seems to work https://github.com/ros2/message_filters/blob/master/test/test_subscriber.cpp#L105-L126 so I agree it seems related to callback groups.

I went through the git blame for message filters and don't see any changes in that repo that seem problematic (or really even substantial) in the last year.

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jun 15, 2022

https://github.com/ros2/message_filters/blob/master/include/message_filters/subscriber.h#L88-L117

It seems like it might be related to this? It might be using the wrong constructor and even though we pass the options, if the 2nd one is called then its not even sent. These are called in the constructor https://github.com/ros2/message_filters/blob/master/include/message_filters/subscriber.h#L186-L202.

That's new relative in the last year, but may have never been released on 20.04 for us to see. I think (?) that's the only issue at hand. Let me know if you agree

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Jun 16, 2022

I determined root cause of the costmap collision checker issue. Since the obstacle layer isn't getting its callbacks as @gezp reported, the local costmap is empty so it shows always collision free. We use the local costmap in the behavior server, not the global costmap, which does not have the static map set by default in nav2_params.yaml.

When a static map is set or the global costmap/footprint/frame are used, things work fine!

So once we solve the issue described in ros2/rmw_fastrtps#613 / short term patch in #3018 then this should all be good

@ladianchad
Copy link
Contributor

I found same issue. (obstacle_layer doesn't subscribe sensor callback)
So, i tried short term patch in #3018, and it works good

@padhupradheep
Copy link
Member

could we pin the issue on the top?

@SteveMacenski SteveMacenski pinned this issue Jul 5, 2022
@SteveMacenski SteveMacenski changed the title Costmap Topic Collision Checker not working in Humble/Rolling Obstacle Layer, Voxel Layer, and Costmap Topic Collision Checker not working in Humble/Rolling due to Fast-DDS Regression Jul 5, 2022
@SteveMacenski
Copy link
Member Author

ros2/rmw_fastrtps#619 fixes ros2/rmw_fastrtps#613 on Fast-DDS

@Aposhian I believe you were the Fast-DDS user that was blocked especially by this. Can you verify this fixes the issue for you so we can close this out?

@fujitatomoya
Copy link

Just FYI, ros2/rmw_fastrtps#619 has been merged in rolling.

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Aug 25, 2022

Sweet. It should be pretty easy for me to test then when the binaries are released! Until we have confirmation though, leaving this open 😀

I’d normally grab some shadow-fixed binaries but my dev time is very short right now due to different internal priorities. Most the time I can spend right now is used making sure I’m not the blocker in PRs/issue tickets. ~November things get back to normal.

@do335
Copy link

do335 commented Oct 18, 2022

I switch to rmw_cyclonedds and it works! Will doing so has any side effect?

@jwallace42
Copy link
Contributor

I would read concepts. The key line from the documentation is ROS 2 supports multiple DDS/RTPS implementations because it is not necessarily “one size fits all” when it comes to choosing a vendor/implementation.

So depending on your specific use case there might be some side effects.

@dcconner
Copy link

Confirming that Humble binaries as of 18-Oct still have this issue. Switching to Cyclone seems to fix. Just commenting here for others that might be trying to track down issue

@doisyg
Copy link
Contributor

doisyg commented Oct 19, 2022

I still cannot believe that close to 5 months after the last ROS2 LTS release, its default DDS implementation is not yet fully functional !
I moved away to Cyclone a long time ago so I was not aware it is still an issue.
It seems that has been fixed for rolling in ros2/rmw_fastrtps#619, and then back-ported to humble in ros2/rmw_fastrtps#633 but the updated binaries were never released for humble...

@linas-p
Copy link

linas-p commented Oct 21, 2022

@SteveMacenski, confirming @dcconner claim. Issue is still, there it's not possible run proper navigation stack in humble as of 21-Oct

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Nov 2, 2022

Lets keep this conversation on the actual issue at hand here folks, this isn't a place for people working on other project ecosystems to come in and given their 2 cents. This is a specific issue regarding a specific project (nav2). Release requests for updated binaries and branches for fixes to solve this problem are fair game - but generally railing on people/projects are not. That's not how you get folks to help you out and work collaboratively with you on this problem.

@EduPonz
Copy link

EduPonz commented Nov 3, 2022

Hi everyone,

Following up on what @SteveMacenski said, I'd like to try to move this conversation forward on a more constructing manner. As we have show over and over, we are definitely willing to contribute to solve any issue rmw_fastrtps may be manifesting, but to do that we need to understand the actual problems. Mind that we (eProsima) do middleware, not only used in ROS 2 but in many critical systems, so claiming that Fast DDS, or any other implementation for that matter, is broken maybe an over statement (even though for sure there are many ways things to improve. Mind that we do not do navigation nor simulation and these tools are not our strongest suit, so we'd appreciate help on specifically describing what the problems are so we can reproduce on out end.

It seems that so far we have not been able to actually understand what issues you are describing here, as the testing we having been doing seemed successful. I do not think GitHub is the best platform to organize the effort of actually fixing these things so I'd like to create a small group of those of you who are willing to help on getting this solved and move the conversation to the Nav2 slack workspace so we can cease angrily commenting here and rather work towards a fix.

On a side note, I think it is important, although maybe redundant, to point out that the ROS 2 core is a coordinated effort of a lot of parties, and that there are processes to be followed for contributing, releasing and finally getting things onto the binary repositories; this does not just happen overnight. Furthermore, I'd like to say that all of us (not only eProsima, but also OR and other parties involved) are widely available so reach out to us and we may be able to give you a hand with your concerns more effectively than going on a rampage about all the possible improvements that ROS 2 needs, which are many and not only limited to Nav2.

On that thought, it is worth remembering that ROS 2 is a FOSS project, and as such it needs inputs from the community to keep going and grow. As @clalancette pointed out, I think that the Nav2 community specifically could bring a lot of value to the table in the form of integration and system tests, as you guys really leverage most of the "advanced" ROS 2 features, which as this issue surfaces require far more testing.

Please write to me here, send me and email at eduardoponz@eprosima.com, or preferably contact me on the Nav2 slack so we can organize a small chat early next week and move from there.

@bperseghetti
Copy link

bperseghetti commented Nov 4, 2022

We tried Cyclone DDS with Galactic, but when we had latency issues, we were asked to switch to Fast DDS. Fast DDS shared memory looked promising but infinite discovery and service issues started coming up.

@aditya2592 I am curious as to the performance differences you saw in "current CycloneDDS" vs what you are getting with FastDDS. We switched to CycloneDDS with Galactic and moved forward with it on Humble. I am curious as to what limitations/latency issues you ran into and what was specifically effected so we can keep an eye out for it as well.

When you were testing CycloneDDS were you using the newest iceoryx or an older one? Also did you test configuring the payload size, multicast, specific network interfaces, etc? https://github.com/eclipse-cyclonedds/cyclonedds#run-time-configuration

@aditya2592
Copy link

aditya2592 commented Nov 4, 2022

@bperseghetti This is Feb 2022, so our iceoryx was probably the old one. I certainly looked at iceoryx, but at the time, the repo said ROS services are not supported and that was big limitation for us since we were trying to shift to a service drive architecture [looks like we missed it by a month]. The issues we saw were all in line with the TSC report here. This report confirmed our observations from production and then it made a lot more sense to keep pushing on Fast DDS since it seemed a lot more efficient. Feel free to contact me through my Github profile for further discussion.

@ciandonovan
Copy link

@aditya2592 you mentioned ROS services not being supported with Iceoryx, I suspect you're thinking of https://github.com/ros2/rmw_iceoryx#Limitations, which is a non-DDS RMW using Iceoryx exclusively.

However, the Cyclone DDS RMW can also optionally be configured to use Iceoryx for shared-memory on the same host, while still using regular UDP/TCP DDS for inter-host communication. https://github.com/eclipse-cyclonedds/cyclonedds/blob/master/docs/manual/shared_memory.rst

That said, I haven't had much luck with it so far. Enabling Iceoryx with Cyclone broke Nav2 immediately, apparently Iceoryx is compiled only supporting a limited number of "Ports" for its communication and Nav2 blows past that. With a stripped back setup I was able to get the robot teleoping with no traffic for once on the loopback interface :)

Here's a Cyclone DDS config file to get it working, note you'll also have to manually run the iox-roudi daemon if you want shared-memory.

<?xml version="1.0" encoding="UTF-8" ?>
<CycloneDDS xmlns="https://cdds.io/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="https://cdds.io/config https://raw.githubusercontent.com/eclipse-cyclonedds/cyclonedds/master/etc/cyclonedds.xsd">
	<Domain Id="any">
		<SharedMemory>
			<Enable>true</Enable>
			<LogLevel>info</LogLevel>
		</SharedMemory>
		<General>
			<Interfaces>
				<NetworkInterface address="127.0.0.1" multicast="true" />
				<NetworkInterface address="192.168.1.1" multicast="true" /> # address of some other wired or wireless interface you want to use ROS2 on
			</Interfaces>
			<AllowMulticast>spdp</AllowMulticast>
			<MaxMessageSize>65500B</MaxMessageSize>
		</General>
		<Internal>
			<MinimumSocketReceiveBufferSize>10MB</MinimumSocketReceiveBufferSize>
			<Watermarks>
				<WhcHigh>500kB</WhcHigh>
			</Watermarks>
		</Internal>
		<Tracing>
			<Verbosity>config</Verbosity>
			<OutputFile>stdout</OutputFile>
		</Tracing>
	</Domain>
</CycloneDDS>

@Aposhian
Copy link
Contributor

Aposhian commented Nov 7, 2022

Related: issues when trying to enable iceoryx shared mem with CycloneDDS on a basic Nav2 stack (repro included): eclipse-cyclonedds/cyclonedds#1326

@SteveMacenski
Copy link
Member Author

Deleted totally off topic question

@JET00wy
Copy link

JET00wy commented Jan 7, 2023

Deleted totally off topic question

I tried CycloneDDS and it worked. Thanks for your discussion. They are really helpful. Just a quick question, is it possible that nav2 can be used in Fast DDS in the future?

@SteveMacenski
Copy link
Member Author

Presumably so, but that’s not in my control to change

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Mar 2, 2023

I had some time to localize the problem and make run-time verification of the bugfix.

What was found:

How I've tested the fix:

  • Used my old ROS2 Rolling (built from 2022.07.22 sources, which has no RCLCPP bugfix or FastRTPS improvement patches) - the problem appears
  • Used newer ROS2 Rolling (built from 2023.01.10 sources with RCLCPP PR1640 and FastRTPS PR619 applied) - problem does not appear
  • I've found that replacing of https://github.com/ros2/rmw_fastrtps as well as https://github.com/eProsima/Fast-DDS does not resolve the problem, but only reduces the probability of a it
  • Using the bisection method, it was found that problem disappeared after ros2/rclcpp@d119157. Testing was proceeded on TB3 standard simulation. When the fail appeared, the empty local costmap was observed:
    Local_costmap_FAIL_cr
  • With RCLCPP tuned on ros2/rclcpp@d119157 (bugfix) v.s. one commit before it was checked the frequency of problem, depending on FastRTPS PR619, which is presented in the table below. The testcase was taken from Callback not reached, using callback group and Fast-RTPS ros2/rclcpp#1611 description (as well as it was checked on standard TB3 simulation):
rclcpp w/o PR1640 rclcpp with PR1640
rmw_fastrtps w/o PR619 FAIL (4/4 runs) PASS (0/15 runs)
rmw_fastrtps with PR619 FAIL (3/17 runs) PASS (0/15 runs)*

(*) TB3 simulation passes, testcase from rclcpp Issue1611 was not checked

  • To additionally confirm the the problem is related to ros2/rclcpp@d119157, this patch was reverted back from the ROS2 Rolling (built from 2023.01.10 sources) and the problem was returned back

Some thoughts:

  • The testcase, described in the Issue1611 header is very similar by its structure to Costmap2D: we are making the mutually exclusive callback group where the layers/filters to be operated at. So, both problems (this and rclcpp Issue1611) should have the same cause.
  • The problem disappears in latest ROS2 Rolling built from sources, but it is questionable: whether the RCLCPP has a back-ported patch to Humble?

TLDR
The problem is related to ros2/rclcpp#1611 issue and being fixed by ros2/rclcpp@d119157.
ros2/rmw_fastrtps#619 although is not fixing the problem, but reduces the frequency of its occurrence from 100% -> to ~20%.

@MiguelCompany
Copy link

@AlexeyMerzlyakov Thanks for such a thourough testing!

  • The problem disappears in latest ROS2 Rolling built from sources, but it is questionable: whether the RCLCPP has a back-ported patch to Humble?

ros2/rclcpp#2033 was the backport for ros2/rclcpp#1640, and was merged on ros2/rclcpp@33cbd76 months ago, so I would say it should also work on Humble.

There have been patch releases since then, so it should be easy to test directly from binaries.

@doisyg
Copy link
Contributor

doisyg commented Mar 2, 2023

Great to see this progressing !

So, this ticket could be closed as of today state.

As this appears to still be a problem on Humble (the LTS version that most people use in production), I believe it should stays open

@SteveMacenski
Copy link
Member Author

Really great analysis @AlexeyMerzlyakov!

As this appears to still be a problem on Humble
ros2/rclcpp#2033 was the backport for ros2/rclcpp#1640, and was merged on ros2/rclcpp@33cbd76 months ago, so I would say it should also work on Humble.

@MiguelCompany says this was backported already, should it not work in Humble too?

@doisyg
Copy link
Contributor

doisyg commented Mar 2, 2023

@MiguelCompany says this was backported already, should it not work in Humble too?

I may have missed that, but @AlexeyMerzlyakov tests were done on Rolling, hence my comment.

@AlexeyMerzlyakov
Copy link
Collaborator

AlexeyMerzlyakov commented Mar 2, 2023

Checked on my Humble VM with latest upgraded packages:
Screenshot_2023-03-02_23-11-02_r
The build is from 20230221, so it should contain the back-ported bugfix.
Also, the problem is confirmed to disappear on latest Humble run-time testing: 0 fails of 10 runs for both composable and convenient TB3 simulation modes on FastDDS.

@SteveMacenski
Copy link
Member Author

Awesome! I think that means we can close this out and unpin it!

Thanks @AlexeyMerzlyakov for your very careful analysis of the issue and bringing this topic to a close!

@SteveMacenski SteveMacenski unpinned this issue Mar 2, 2023
@altineller
Copy link

I was trying to figure out why the local costmap was not showing up on rviz2, when I came across this issue.

Screenshot from 2023-03-20 16-15-02

I am on humble ubuntu 22.04, with the newest rclcpp and local costmap still does not show.

Best Regards,
C.

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Mar 20, 2023

Does switching DDS to Cyclone fix it?

@altineller
Copy link

I have tried it, and no switching to cyclone does not fix it. I think this is another problem.

And apart from the nav2_simple_commander demos, generally, if the localization is running with known map, local costmap shows up, if we are running without known map, then local costmap does not show up.

By 'generally` I mean using the scripts under /opt/ros/humble/share/nav2_bringup

@GPrathap
Copy link

GPrathap commented Feb 8, 2024

After some time, issues start even with cyclone dds. I need to restart it several times to get it working,
I've set these values but it seems to still message for point clouds disappeared

    net.core.rmem_max=5147483647
    net.core.rmem_default=5147483647

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests