Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full System Testing: Nav2 Edition P1: The Exector Strikes Back #1042

Closed
SteveMacenski opened this issue May 3, 2023 · 14 comments
Closed

Full System Testing: Nav2 Edition P1: The Exector Strikes Back #1042

SteveMacenski opened this issue May 3, 2023 · 14 comments

Comments

@SteveMacenski
Copy link
Member

Someone please tag amd debian cyclone jammy. Should there be a second ticket if I'm also going to do fastdds? I'll likely test binary too but I'm not committing to that up front :-)

@clalancette
Copy link
Collaborator

Someone please tag amd debian cyclone jammy. Should there be a second ticket if I'm also going to do fastdds? I'll likely test binary too but I'm not committing to that up front :-)

Yeah, a separate ticket for fastdds would be preferred. That way it is easier to make this a standard part of the testing in the future if we want to.

@SteveMacenski
Copy link
Member Author

  • Localization localizes
  • Gazebo Classic comes up, hooks into plugins, and stays up
  • Rviz comes up and displays the data I would expect for all 30 visualization types
  • BT navigator takes commands and timely cancels or preempts them
  • Controller and Planner servers publish out costmaps at rate to the Behavior server
  • CLI commands for clearing costmap, commanding to point, and similar work
  • Run a couple of canned demos in the simple_commander examples
  • Rviz panel takes and sends commands without crashing the rviz window
  • laser scans and sensor data pass through custom callback groups to populate the main costmap
  • TF trees are complete and publishing at a regular clip
  • Pluginlib plugins are being appropriately identified and loaded
  • Composition nodes are being identified and loaded into the component containers
  • Lifecycle nodes + services are bringing up the system into an active state and tearing down to a clean exit
  • Internal actions appear to be working as expected for smooth navigation

@SteveMacenski SteveMacenski changed the title Full System Testing: Nav2 Edition Full System Testing: Nav2 Edition P1: The Exector Strikes Back May 3, 2023
@SteveMacenski
Copy link
Member Author

SteveMacenski commented May 3, 2023

@clalancette Ticket from Iron testing --> colcon/colcon-core#554 Colcon has warnings when executed. Should I report this anywhere else?

@SteveMacenski
Copy link
Member Author

SteveMacenski commented May 3, 2023

Another: ros2/rosidl#743 --> Previously at ros2/rosidl_typesupport_fastrtps#28 but I think from re-reading it, that's incorrectly attributed to Fast-DDS of another (but related) problem and filed in that rosidl repo. Ours is a more central issue in rosidl_generator_cpp which only appeared after the last Rolling sync and now in Iron.

@clalancette I think this should be looked at soon after release - but I don't think show stopping for release. If it was only nav2_msgs, honestly not a big deal on the scale of things, but sensor_msgs would also exhibit this issue with messages such as http://docs.ros.org/en/api/sensor_msgs/html/msg/BatteryState.html containing a number of const definitions.

@SteveMacenski
Copy link
Member Author

#1044 is the Fast-DDS version (linking to be able to find later)

@SteveMacenski
Copy link
Member Author

SteveMacenski commented May 3, 2023

Woof. Nav2 appears to be broken but I don't think its Iron, though it could I suppose 🤷. We're seeing Nav2 receiving AMCL localization service requests, but not honoring the content of them (and seems to randomly just bounce around) & Gazebo Classic appears to be producing data that doesn't seems sensible for where the robot truly is in the environment + reporting that the robot is outside of the local costmap bounds for update. This feels like TF, services, and/or gazebo issues which I doubt are related to the core packages but maybe some regression we have within Nav2 itself. This happened around the time of the last Rolling sync, but that could be entirely coincidental.

Once its moving, the planning and control stack seems to be functional from really quick testing.

I'm currently building a branch that I nuked all changes back to March 1 to see if it still is a problem - at which point it should be clear. I narrowed down from our CI results the last job I can clearly know it didn't happen in and the first when it appears. I can't see any commits in that range that touch any of this, but I flashed back much further than even those. Not at a point yet I'm asking for any particular help, just giving an update and summary in case this turns into a multi-day excursion

@SteveMacenski
Copy link
Member Author

OK, now I'm asking for some attention, I believe this is not Nav2 related and is instead ROS 2 core regressions.

I've backed out all of the commits since last Christmas on Nav2 and I'm still seeing the same problem unchanged on Iron binaries.

I was able to find the last nightly build on Rolling that succeeded, and it was the night of April 13th (early morning hours of April 14th). This corresponds with the last Rolling release on the 14th before the Iron fork which broke Nav2's CI (rosidl issue, not relevant). Since the first build after resolving that CI issue, we see all system tests failing with this issue described above since.

To confirm from a different direction, I then repeated this exercise using Humble and Nav2's current state without rolling back commits and things work perfectly (after a couple of 2 line changes for deprecated API). So the Iron vs Humble and Pre-last-Rolling-release shows that difference in behavior. So in summary, I think something in the last Rolling release before the fork broke some stuff that we see in Iron now.

I scanned through the list of packages updated in the last sync and I see RMW, DDS, and client library stuff as well as TF2. Regressions here could explain the behavior. Nothing else in the list stood out to me, though I would have added Gazebo to that list if the sync included any updates there.

I know this isn't very specific about where or what, but this is a flag worth raising early 🎏

@SteveMacenski
Copy link
Member Author

Followup - I went through the commits in the time range in rclcpp and TF2 but nothing stood out to me, so I took a more emperical approach.

Building Humble's version of geometry2 (just selected something arbitrarily old but relatively new not to have API issues) in the Nav2 workspace seems to make the problems disappear. While Rviz and other tools are using the binary versions, Nav2 luckily is self contained enough that at least all the Nav2 modules can use that version of TF2 instead when built alongside it.

That seems to resolve the issues, which is welcome news that I don't need to dig into rclXXX or the RMW vendors.

As best I can tell, there's only 2 commits that seem like they might be causing an issue that have been in the ros2 branch after march:

I'm in the process of rebuilding geometry2 + nav2 without these to see if either is the culprit, but perhaps someone involved with those decisions (@clalancette) might have an educated guess looking at the recent commit history https://github.com/ros2/geometry2/commits/rolling

@SteveMacenski
Copy link
Member Author

OK, I've identified it. Compiling TF next to Nav2, I bisected commits from the last time I knew it was working properly for us 0.30.0 and now and found the behavior breaks when ros2/geometry2@3bac620 is introduced (and by extension ros2/geometry2@7afeb6c).

I'm not going to speculate as to why in particular, I don't know, but its without a doubt the cause. This commit breaks ABI so we can't just revert it and ship Iron if its something important. So the options are to revert or try to fix it before release, but I can't make the value call of that.

CC @roncapat, in case anything I've said here sounds familiar to you that you might know the reason for an issue to proposal a solution!

@clalancette it looks like you merged that PR so

@clalancette
Copy link
Collaborator

OK, I've identified it. Compiling TF next to Nav2, I bisected commits from the last time I knew it was working properly for us 0.30.0 and now and found the behavior breaks when ros2/geometry2@3bac620 is introduced (and by extension ros2/geometry2@7afeb6c).

Thanks for running this one to ground, that is really helpful. I'll comment more on ros2/geometry2#600

@Yadunund
Copy link
Member

@SteveMacenski is it safe to mark this test as a success?

@SteveMacenski
Copy link
Member Author

SteveMacenski commented May 19, 2023

I will not be able to complete this. I spent the 2 days I had allocated for Iron testing on debugging the TF issue and unfortunately I need to be conscious of my time

These are good to have around for next year, but unfortunately I just won't have time in the next 2 weeks to come back to this

@Yadunund
Copy link
Member

Understood. Thanks for update!

@clalancette
Copy link
Collaborator

Some of this got tested, which is good. Since we are out of the tutorial party now, I'm going to close this out as untested.

@clalancette clalancette closed this as not planned Won't fix, can't repro, duplicate, stale Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants