-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Unique Client GID for Service Introspectino Event. (#779)" #780
Conversation
This reverts commit 565bbc2.
@Blast545 sorry about this. btw, i cannot see the failure links that you posted here. could you check that? Note: CI was all green with PR, #779 (comment) |
@Blast545 RHEL CI looks green? |
This is the investigation I was running: osrf/buildfarm-tools#97 (comment). The jobs in https://ci.ros2.org/job/nightly_linux-rhel_debug/ have been getting stuck reliably since 565bbc2 and ros2/rosbag2@cd7bd63 got merged. I'm not sure that the issue is directly related to this, I wasn't able to reproduce it without the revert PR. I will try to reproduce it before taking any actions. |
So it does indeed look like #779 is the cause of the problem, though it was somewhat tricky to determine since it doesn't always fail. What I did was to take a look at when this started failing on nightly_rhel_debug, which is the most reliable job that fails this. The last time that job was successful was in https://ci.ros2.org/view/nightly/job/nightly_linux-rhel_debug/2050/ . The next job (https://ci.ros2.org/view/nightly/job/nightly_linux-rhel_debug/2051/) was the first one that started "hanging" forever in So the first thing I did was to extract the ros2.repos used in 2050 and 2051. I then ran a test job for each of them, to make sure I could reproduce the issue:
So it does appear I can reproduce the issue. I then diffed the ros2.repos between 2050 and 2051, and came up with the following diff:
As you can see, there are exactly 3 repositories that were updated between those two builds. From there, I created three ros2.repos files. For each of them, I copied the working ros2.repos from 2050, and individually updated just a single repository in each of the 3 jobs. Then I ran a CI job with each of them. That led to the following jobs:
As you can see from those jobs, only the rmw_fastrtps updated PR hung (the other two finished). Thus, the conclusion is that PR #779 caused this problem, and this revert should fix it. With that in mind, I'm going to convert this to a real PR, run regular CI on it, and go ahead and merge this in. When we go to put #779 back in, we should be certain to run full CI on RHEL debug on it before merging, and figure out what the heck is going on with it. I'm also going to open a PR to test_launch_ros so it will eventually time out. |
Pulls: #780 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the facts from #780 (comment), agree to revert this. lgtm with green CI.
i will try to look into the cause of this for next try.
- The problem cannot be reproducible with ubuntu environment regardless of
--cmake-args -DCMAKE_BUILD_TYPE=Debug
. test_communication
andtest_launch_ros
pass without any error.- i thought that
test_communication
(before hang-up issue withtest_launch_ros
) leaves some test logics behind, and it might affect the following tests. but test_communication does not have any cross-vendor service request/response test at all, the reason i checked this is that Unique Client GID for Service Introspectino Event. #779 only affects the service request and response communication.
Unfortunately, that's not the behavior we saw on the CI farm. In particular, if you look at the history of https://ci.ros2.org/view/nightly/job/nightly_linux_debug/ (which is Ubuntu in our case), you can see that in the last week, 3 of the 7 builds had to be aborted because they were stuck in |
Despite the fact that Windows failed, I'm fairly certain it has nothing to do with this PR. Given how badly this is making overnight CI, I'm going to merge this right now anyway. If Windows comes back with another failure, then we can look deeper into what is happening. |
This reverts commit 565bbc2.
We believe this PR might be related to some consistent hangs we have found in the rhel buildfarm jobs, we will try to reproduce before considering to merge this PR.