-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad gateway message fails some ci.jenkins.io builds #4204
Comments
jenkinsci/acceptance-test-harness#1644 is failing with similar errors even after a retry |
Thanks for raising this issue and for the details folks! Checking the logs in Datadog show there was a lot of HTTP/502 in that time windows.
The errors are spread across the 2 ACP services:
|
|
What to do from here:
=> scaling it up won't change anything (shared resource for ingress and outbound)
|
Now that #4206 has been fixed, the ACP in the cluster Next steps:
|
Update on the ACP private only:
Next steps:
|
Update:
|
Update:
Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management.
|
…etup to reach ACP in the ci.jenkins.io-agents1 cluster (#798) Related to jenkins-infra/helpdesk#4204 (comment) This PR introduces the following changes to allow ci.jenkins.io VM agents to access the private ACP in the `ci.jenkins.io-agents1` AKS cluster (instead of the azure.repo.jenkins.io ACP in the `publick8s` cluster): - Allow the AKS cluster identity to manage Network on the whole Vnet (as per the Azure documentation - see comment) - Required to create LB and NIC in both subnets. We could restrict a bit more but wouldn't protect us. - Create private DNS records in the private DNS zone of the ci.jenkins.io vnet to point to thge internal ACP LB. - Note: I moved the 2 existing DNS record close to this one. Only visual. - Add NSG in/out rules in the ci.jenkibns.io ephemeral (VM- agents subnet to allow HTTP request on the port `8080` of the internal ACP loadbalancer - Update shared tools - Usual "keep up to date" - Generate an infra report for reports.jenkins.io to export the private IP. It will allow us to automate the Kubernetes Service LB annotations ---- Testing: I applied these changes manually and verified it's working by creating an additional LB with the YAML below on the AKS cluster. Then I was able to emit curl request to ACP using the DNS on the port `8080` \o/ Finally: clean up all of these (both Terraform and AKS) resources. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update: started implementation after a successful manual test.
|
Update:
=> Tests in progress, let's wait 2 days to see the results before deprovisioning public ACP |
Update: more changes
=> Windows VM agents are now using properly the internal ACP as verified in https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/246/pipeline-console/?selected-node=151 Next steps:
|
Related to jenkins-infra/helpdesk#4204 (comment) This PR sets up the required Azure Entra permissions and NSG rules to allow ACI agents of ci.jenkins.io to run with a private IP in their dedicated subnet. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update:
=> Now the (private) ACP is used but I was able to reproduce the dreaded It should be improved with jenkins-infra/kubernetes-management#5525 (did a lot of tests) which not only uses the local kube DNS by default (to let CoreDNS do its work and benefit from DNS local cache) but also keep using 9.9.9.9 as a fallback. |
Let's see the result after a few days. @MarkEWaite @basil @timja don't hesitate to run big builds in the upcomings days so we'll see how the new DNS setup behaves. I saw impressive results (Linux build from 50s to 30s) on the plugin We'll check the errors in logs (datadog) and I'll see to add an alert system when there are such errors. |
Update:
While it is an improvement, I'm still feeling there might be improvements:
|
For info: #4241 |
Related to jenkins-infra/helpdesk#4204 Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update:
|
Service(s)
ci.jenkins.io
Summary
https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/PR-468/1/console failed to build with a report
Reproduction steps
The text was updated successfully, but these errors were encountered: