Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Unique Client GID for Service Introspectino Event. (#779)" #780

Merged
merged 1 commit into from
Oct 1, 2024

Conversation

Blast545
Copy link
Contributor

This reverts commit 565bbc2.

We believe this PR might be related to some consistent hangs we have found in the rhel buildfarm jobs, we will try to reproduce before considering to merge this PR.

@Blast545
Copy link
Contributor Author

Reference job:
Build Status

Job with this PR:
Build Status

@fujitatomoya
Copy link
Collaborator

fujitatomoya commented Sep 27, 2024

@Blast545 sorry about this.

btw, i cannot see the failure links that you posted here. could you check that?

Note: CI was all green with PR, #779 (comment)

@fujitatomoya
Copy link
Collaborator

@Blast545 RHEL CI looks green?

@Blast545
Copy link
Contributor Author

This is the investigation I was running:

osrf/buildfarm-tools#97 (comment). The jobs in https://ci.ros2.org/job/nightly_linux-rhel_debug/ have been getting stuck reliably since 565bbc2 and ros2/rosbag2@cd7bd63 got merged.

I'm not sure that the issue is directly related to this, I wasn't able to reproduce it without the revert PR. I will try to reproduce it before taking any actions.

@clalancette
Copy link
Contributor

So it does indeed look like #779 is the cause of the problem, though it was somewhat tricky to determine since it doesn't always fail.

What I did was to take a look at when this started failing on nightly_rhel_debug, which is the most reliable job that fails this. The last time that job was successful was in https://ci.ros2.org/view/nightly/job/nightly_linux-rhel_debug/2050/ . The next job (https://ci.ros2.org/view/nightly/job/nightly_linux-rhel_debug/2051/) was the first one that started "hanging" forever in test_launch_ros.

So the first thing I did was to extract the ros2.repos used in 2050 and 2051. I then ran a test job for each of them, to make sure I could reproduce the issue:

So it does appear I can reproduce the issue. I then diffed the ros2.repos between 2050 and 2051, and came up with the following diff:

--- ros2.repos-working	2024-09-30 08:19:08.884655260 -0400
+++ ros2.repos-broken	2024-09-30 08:20:12.064624051 -0400
@@ -282,7 +282,7 @@
   ros2/rclcpp:
     type: git
     url: https://github.com/ros2/rclcpp.git
-    version: a78d0cbd33b8fe0b4db25c04f7e10017bfca6061
+    version: 63105cd8a6edb6906efa71c7cf56cd56d0bccdda
   ros2/rclpy:
     type: git
     url: https://github.com/ros2/rclpy.git
@@ -318,7 +318,7 @@
   ros2/rmw_fastrtps:
     type: git
     url: https://github.com/ros2/rmw_fastrtps.git
-    version: fc66051e598c49d0f01665e33ee63d3ae83b1e2d
+    version: 565bbc22357413aa7342e9dd0d6518b71f7f10a6
   ros2/rmw_implementation:
     type: git
     url: https://github.com/ros2/rmw_implementation.git
@@ -342,7 +342,7 @@
   ros2/rosbag2:
     type: git
     url: https://github.com/ros2/rosbag2.git
-    version: fd8dd26f840c1634534607364a5dc71d41cf6265
+    version: cd7bd63696604973e23c739afa6387556f3e7781
   ros2/rosidl:
     type: git
     url: https://github.com/ros2/rosidl.git

As you can see, there are exactly 3 repositories that were updated between those two builds. From there, I created three ros2.repos files. For each of them, I copied the working ros2.repos from 2050, and individually updated just a single repository in each of the 3 jobs. Then I ran a CI job with each of them. That led to the following jobs:

As you can see from those jobs, only the rmw_fastrtps updated PR hung (the other two finished).

Thus, the conclusion is that PR #779 caused this problem, and this revert should fix it. With that in mind, I'm going to convert this to a real PR, run regular CI on it, and go ahead and merge this in. When we go to put #779 back in, we should be certain to run full CI on RHEL debug on it before merging, and figure out what the heck is going on with it. I'm also going to open a PR to test_launch_ros so it will eventually time out.

@clalancette clalancette marked this pull request as ready for review September 30, 2024 18:18
@clalancette
Copy link
Contributor

Pulls: #780
Gist: https://gist.githubusercontent.com/clalancette/cdaa2bf4c0dccc474d073f2626dca91b/raw/99312936bd7ab57bb6cbf8dacc35b59f215f9c87/ros2.repos
BUILD args: --packages-above-and-dependencies ament_cmake_core
TEST args: --packages-above ament_cmake_core
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/14632

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the facts from #780 (comment), agree to revert this. lgtm with green CI.

i will try to look into the cause of this for next try.

@clalancette
Copy link
Contributor

  • The problem cannot be reproducible with ubuntu environment regardless of --cmake-args -DCMAKE_BUILD_TYPE=Debug.

Unfortunately, that's not the behavior we saw on the CI farm.

In particular, if you look at the history of https://ci.ros2.org/view/nightly/job/nightly_linux_debug/ (which is Ubuntu in our case), you can see that in the last week, 3 of the 7 builds had to be aborted because they were stuck in test_launch_ros. Whatever the problem is, it also affects Ubuntu, just not as often.

@Blast545
Copy link
Contributor Author

Blast545 commented Oct 1, 2024

Retriggering windows build for the sake of making sure this won't break CI:

Build Status

@clalancette
Copy link
Contributor

Despite the fact that Windows failed, I'm fairly certain it has nothing to do with this PR. Given how badly this is making overnight CI, I'm going to merge this right now anyway. If Windows comes back with another failure, then we can look deeper into what is happening.

@clalancette clalancette merged commit 6bdff0b into rolling Oct 1, 2024
3 checks passed
@clalancette clalancette deleted the blast545/revert_779 branch October 1, 2024 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants