Bad gateway message fails some ci.jenkins.io builds #4204

MarkEWaite · 2024-07-30T19:42:16Z

Service(s)

ci.jenkins.io

Summary

https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/PR-468/1/console failed to build with a report

13:20:22  Caused by: org.apache.maven.project.DependencyResolutionException: Could not resolve dependencies for project org.jenkins-ci:pipeline-steps-doc-generator:jar:1.0-SNAPSHOT
13:20:22  dependency: org.jenkins-ci.plugins.workflow:workflow-api:jar:1322.v857eeeea_9902 (compile)
13:20:22  	Could not transfer artifact org.jenkins-ci.plugins.workflow:workflow-api:jar:1322.v857eeeea_9902 from/to azure-proxy (https://repo.azure.jenkins.io/): status code: 502, reason phrase: Bad Gateway (502)
13:20:22  dependency: org.jenkins-ci.plugins:scm-api:jar:690.vfc8b_54395023 (compile)
13:20:22  	Could not transfer artifact org.jenkins-ci.plugins:scm-api:jar:690.vfc8b_54395023 from/to azure-proxy (https://repo.azure.jenkins.io/): status code: 502, reason phrase: Bad Gateway (502)

Reproduction steps

Open the failing build, confirm that the build failed due to failure to resolve a dependency from https://repo.azure.jenkins.io/

The text was updated successfully, but these errors were encountered:

basil · 2024-07-30T19:43:11Z

jenkinsci/acceptance-test-harness#1644 is failing with similar errors even after a retry

dduportal · 2024-07-31T08:24:30Z

Thanks for raising this issue and for the details folks!
Datadog also indicates that ACP had issues between 06:00pm UTC an 08:00 pm UTC yesterday (30 July 2024).

Checking the logs in Datadog show there was a lot of HTTP/502 in that time windows.
Each (651 precisely) HTTP/502 error reported the following:

22#22: *265332 upstream timed out (110: Operation timed out) while connecting to upstream

The errors are spread across the 2 ACP services:

646 (99%) were on the publick8s ACP (https://repo.azure.jenkins.io), which uses the
5 (<1%) were on the private Azure ACP (http://artifact-caching-proxy.artifact-caching-proxy:8080/)

dduportal · 2024-07-31T09:29:38Z

The fact that the error happens on 2 distinct AKS clusters, with 2 distinct node operating systems (Ubuntu / Azure Linux), on 2 distinct network with 2 distinct outbound methods shows that there has been an issue outside of our subscriptions (either on Artifactory side or on the public Azure network) OR we have a general ACP issue (in the way we setup Nginx to manage outbound connections).
The problem on the public ACP is clearly made worse by [publick8s] AzureAD / AKS error Authorization Failures have been detected that may affect cluster availability over outbound IPv6 addresses #4206 => that explains the 99% / 1%

dduportal · 2024-07-31T09:41:43Z

A few metrics collections regarding the time window of yesterday:

Public ACP

Nodes:
- Disk / net metrics clearly shows a period of network activity due to the ACP usage in the time window (both in and out) correlated with a peak of disk reads: confirms it is an ACP related activity.
- CPU/memory metrics shows almost nothing (e.g. ACP performs well for these 2 metrics). Note: the "tiny" peak on requests is due to another service than ACP, which was updated (rolling upgrade).
Pods metrics:
- ACP only clearly shows the same activity during the time windows with nominal CPU/memory usage:
- Ingress metrics shows 2 things:
  - Almost all their outbound network rate is due to ACP transmlitting data while their inbound rate is only half-passed to ACP. Not sure what kind of traffic is not transmitted (hard to tell other than half of the net. rate which might be a low value).
  - The peak of network activity is the same pattern as ACP

dduportal · 2024-07-31T09:47:08Z

Private ACP

Pod metrics show the same time window activity with a lower average rate (make sense as only ci.jenkins.io container agents are using this one today: there are less container builds but still a few)
Node metrics (removed the ci.jenkins.io agents nodes) shows the same behavior as for the Public ACP: CPU/memory impact are close to zero, but we clearly see a network rate correlated to this time window (which is epxpected).

dduportal · 2024-07-31T09:51:46Z

What to do from here:

The public ACP service is clearly impaired by 2 things:
- The current publick8s outbound issue
- The (shared) usage of the ingress

=> scaling it up won't change anything (shared resource for ingress and outbound)
=> we should find a solution to only use the private ACP and decommission the public ACP

The private ACP service had a few errors which need investigating (ACP/nginx level) and we have a short term improvement to make:
- As per Dockerhub rate limit broke the www.jenkins.io CI build #4192 issues, spreading the outbound traffic over more IPs would limit the side effects of remote services using strict rate limit anti abuse systems.

dduportal · 2024-08-06T17:42:08Z

Now that #4206 has been fixed, the ACP in the cluster publick8s should behave better.

Next steps:

Migrate all ACP workload to the private (HTTP only) ACP. Requires setting up an internal LB, associate NSG rules with it, test from VM agents and set up ci.jenkins.io
- No more ingress in the middle, no more TLS, no more "public" access requirement
Deprovision the "public" ACP
- Less money to spend on Azure CDF
- Less impact on the public cluster with a stricter partition between public services and ci.jenkins.io itself
Watch the (private) ACP activity: if it also has errors, then we'll need to setup DNS caching inside the ci.jenkins.io-agents-1 cluster ([Feature] Support Node Local DNS Cache (AKS Local DNS implementation) Azure/AKS#3673)

dduportal · 2024-08-07T14:54:33Z

Update on the ACP private only:

I've successfully set up a temporarily internal LB with an IP in the ci.jio VM agents subnets to reach the private ACP with success

Required to create a few Azure RM resources:
- 2 roles assignments (to allow the AKS identity to have Network Contributor on both ephemeral agents and kubernetes subnets
- NSG rules (1 in and 1 out) to allow ephemeral agents to reach the internal LB

Then created the temp. LB with the following YAML:

apiVersion: v1
kind: Service
metadata:
  name: acp
  namespace: artifact-caching-proxy
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/azure-load-balancer-resource-group: "public-jenkins-sponsorship"
    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "public-jenkins-sponsorship-vnet-ci_jenkins_io_agents"
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: http
  selector:
    app.kubernetes.io/instance: artifact-caching-proxy
    app.kubernetes.io/name: artifact-caching-proxy

Next problem will be to specify the DNS record with the allocated IP. I would like to define the IP somehow in Terraform and pass it to the kubernetes-management (through the ACP chart values with annotations)

Next steps:

Persist in Terraform the role assignement and NSG rules
Check if we can use a Private link + endpoint defined in Terraform (with a DNS record) as per https://learn.microsoft.com/en-us/azure/aks/internal-lb?tabs=set-service-annotations#create-a-private-endpoint-to-the-private-link-service. It looks like the Kubernetes Service takes care of creating and managing the PLS at first sight, I need to try to create one in Terraform and specify it to see if it reconciles or not
Once we have a PLS or a static IP, we set up the DNS record (and eventually update the NSG rules if the IP was changed from step 1.)
set up ci.jio Azure VM agents to use this new DNS with HTTP and port 8080 (instead of the https public ACP)
Verify it works. If yes, then deprovision the public ACP from publick8s

dduportal · 2024-08-09T08:45:56Z

Update:

Tried creating a Private Link Service from the AKS Service LB: it works very well when using the annotations from https://learn.microsoft.com/en-us/azure/aks/internal-lb?tabs=set-service-annotations#pls-customizations-via-annotations and setting up the proper Azure Role permissions.
I'm working on the egg-and-chicken problem between Terraform and AKS, in order to create a set of "private end point to this PLS + DNS + automated NSGs"

dduportal · 2024-08-09T16:37:37Z

Update:

I failed to solve the egg and chicken problem between Terraform and Kubernetes Management:
- With a PLS (and LB) managed by the Kubernetes Service, Terraform need to specify the Azure RM permissions (brefore creating the LB) and the Private Endpoint with DNS configuration and NSG rules (but after LB creation*). If the LB changes its name/setup on Kubernetes, then Terraform will start failing because the Data source won't be updated.
  - I was succesful though, with a PLS in the Kubernetes Node Resource group (MC..... RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private A record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure.
- With a simple "internal LB", we still need to have Terraform to create the DNS record and NSG rules. And installing external-dns only to manage private record looks overkill.

Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management.
=> the constraints to select the proper IP are:

Make sure it is in the same subnet as the ci.jio VM agent (easy peasy: we have CIDR!)
Make sure it is available: I chose the antepenultimate IP of the CIDR as the Azure VM Jenkins plugins tends to select available IPs from the lower part of CIDR range. It's not strict, but it is low probability to get an IP on the upper range.
- Why not the last one: long habit of network allocating the last IP of range to an appliance...
  => This pattern makes it easy (no PLS, no NIC, no private endpoint, etc.).

…etup to reach ACP in the ci.jenkins.io-agents1 cluster (#798) Related to jenkins-infra/helpdesk#4204 (comment) This PR introduces the following changes to allow ci.jenkins.io VM agents to access the private ACP in the `ci.jenkins.io-agents1` AKS cluster (instead of the azure.repo.jenkins.io ACP in the `publick8s` cluster): - Allow the AKS cluster identity to manage Network on the whole Vnet (as per the Azure documentation - see comment) - Required to create LB and NIC in both subnets. We could restrict a bit more but wouldn't protect us. - Create private DNS records in the private DNS zone of the ci.jenkins.io vnet to point to thge internal ACP LB. - Note: I moved the 2 existing DNS record close to this one. Only visual. - Add NSG in/out rules in the ci.jenkibns.io ephemeral (VM- agents subnet to allow HTTP request on the port `8080` of the internal ACP loadbalancer - Update shared tools - Usual "keep up to date" - Generate an infra report for reports.jenkins.io to export the private IP. It will allow us to automate the Kubernetes Service LB annotations ---- Testing: I applied these changes manually and verified it's working by creating an additional LB with the YAML below on the AKS cluster. Then I was able to emit curl request to ACP using the DNS on the port `8080` \o/ Finally: clean up all of these (both Terraform and AKS) resources. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2024-08-09T16:50:44Z

Update:

* I failed to solve the egg and chicken problem between Terraform and Kubernetes Management:
  
  * With a PLS (and LB)  managed by the Kubernetes Service, Terraform need to specify the Azure RM permissions (brefore creating the LB) and the Private Endpoint with DNS configuration and NSG rules (but _after_ LB creation*). If the LB changes its name/setup on Kubernetes, then Terraform will start failing because the Data source won't be updated.
    
    * I was succesful though, with a PLS in the Kubernetes Node Resource group (`MC.....` RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private `A` record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure.
  * With a simple "internal LB", we still need to have Terraform to create the DNS record and NSG rules. And installing `external-dns` only to manage private record looks overkill.

Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management. => the constraints to select the proper IP are:

* Make sure it is in the same subnet as the ci.jio VM agent (easy peasy: we have CIDR!)

* Make sure it is available: I chose the antepenultimate IP of the CIDR as the Azure VM Jenkins plugins tends to select available IPs from the lower part of CIDR range. It's not strict, but it is low probability to get an IP on the upper range.
  
  * Why not the last one: long habit of network allocating the last IP of range to an appliance...
    => This pattern makes it easy (no PLS, no NIC, no private endpoint, etc.).

Update: started implementation after a successful manual test.

feat(ci.jenkins.io) create resource to ensure an internal LB can be setup to reach ACP in the ci.jenkins.io-agents1 cluster azure#798 to create resources in Terraform + generate automated report
Created feat(acp) allow specifying annotations for ACP Service helm-charts#1257 to allow specifying custom annotations to ACP chart

dduportal · 2024-08-10T09:47:59Z

Update:

ci.jenkins.io has been set up with only the internal ACP (through the private DNS / private LB)
- feat(ci.jenkins.io) use the same private ACP provider in all types of Azure agents jenkins-infra#3591
- cleanup: feat!(ci.jenkins.io, profile::jenkinscontroller) refactor ACP to support passing URL and new topologies jenkins-infra#3592 which also disable ACP for s390x
Tested with both Linux/Windows, VM and containers in https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/
Some cases were failing: had to fix the pipeline library with:
- short term hotfix: hotfix(buildPlugin) use the new ACP Azure unified URL for healthcheck. Stops support of dynamic ACP URL. pipeline-library#876
- cleanup (with code simplification and support of dynamic retrieval of the ACP URL): fix(buildPlugin) setup Artifact Caching Proxy using only the env. var ARTIFACT_CACHING_PROXY_SERVERID pipeline-library#877
Cleanup of the public ACP:
- Set to zero replica (until Monday)
- Removed the credential in ci.jenkins.io

=> Tests in progress, let's wait 2 days to see the results before deprovisioning public ACP

dduportal · 2024-08-10T11:49:30Z

Update: more changes

Cleaned up the monitoring as we do not want public ACP anymore (and it's scaled to zero): cleanup(acp) remove (public) ACP monitoring datadog#257
Set up the ACP environment variables on Windows VM Azure agent (inbound) used in ci.jenkins.io: feat(jenkinscontroller) setup required environement variables during init of Azure VM Windows inbound agent jenkins-infra#3594

=> Windows VM agents are now using properly the internal ACP as verified in https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/246/pipeline-console/?selected-node=151

Next steps:

Fix ACI configuration to use private network (prototyped manually and it worked)
- Require a dedicated subnet to ensure delegation to ACI
Cleanup public ACP

Related to jenkins-infra/helpdesk#4204 (comment) This PR sets up the required Azure Entra permissions and NSG rules to allow ACI agents of ci.jenkins.io to run with a private IP in their dedicated subnet. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2024-08-10T17:13:51Z

Update:

ACI agents are now using private IP in a private subnet and they can reach successfully the ACP!
Fixup for the settings.xml Config file as the id of a Maven mirror must not have special characters (WARNING message found in builds logs of maven builds)

=> Now the (private) ACP is used but I was able to reproduce the dreaded (110: Operation timed out) error quite quickly with the new workload (example: https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/257/).

It should be improved with jenkins-infra/kubernetes-management#5525 (did a lot of tests) which not only uses the local kube DNS by default (to let CoreDNS do its work and benefit from DNS local cache) but also keep using 9.9.9.9 as a fallback.

dduportal · 2024-08-10T17:16:07Z

Let's see the result after a few days. @MarkEWaite @basil @timja don't hesitate to run big builds in the upcomings days so we'll see how the new DNS setup behaves.

I saw impressive results (Linux build from 50s to 30s) on the plugin jenkins-infra-team but it is not really a real life use case.

We'll check the errors in logs (datadog) and I'll see to add an alert system when there are such errors.

dduportal · 2024-08-13T11:26:12Z

Update:

In the past 48h, the (private) ACP logs show 5 individual errors for ~ 4M successful requests. Ratio is way better than it use to be.
- 4 errors are HTTP/503 errors on the upstream (e.g. on Artifactory) side. They all happened after a HTTP/302 redirect. => these errors are not on our side alas.
- But we still had a upstream timed out (110: Operation timed out) while connecting to upstream error though
- Note: we have a huge amount of warning messages (~2M) about buffered-to-file responses (an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/<...> while reading upstream). While expected for huge files bigger than the memory buffer window, it could be interesting to avoid writing these to disk. Nice to have improvement?
Given the good rate, let's clean up the public ACP resources (as no need to go back):
- Remove resource management from jenkins-infra/kubernetes-management
- Then remove the namespace manually
- Finally clean up Terraform resources: https://github.com/jenkins-infra/azure/blob/main/repo.azure.jenkins.io.tf

While it is an improvement, I'm still feeling there might be improvements:

On client (Maven) side: https://stackoverflow.com/questions/55899091/maven-retry-dependency-download-if-failed
On ACP side (using Nginx upstream retries): http://nginx.org/en/docs/http/ngx_http_upstream_module.html

dduportal · 2024-08-14T19:12:09Z

For info: #4241

Related to jenkins-infra/helpdesk#4204 Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2024-08-17T08:38:50Z

Update:

The HTTP/500 did not re-appeared. Let's consider the case closed until they re-appear again. The results with the new ACP are quite good:
- Output of 1.3 Gbit/s with the internal BOM builds in "Connection reset" errors with artifact-caching-proxy.privatelink.azurecr.io #4241
- Output of 700 Mb/s with the last ATH
- Error rate on the past 4 days has been 6 errors for 12 millions successful (HTTP/200 served from cache) requests.
Cleaned up the old public ACP instance:

MarkEWaite added the triage Incoming issues that need review label Jul 30, 2024

jenkins-infra-helpdesk-app bot added the ci.jenkins.io label Jul 30, 2024

dduportal self-assigned this Jul 31, 2024

dduportal added this to the infra-team-sync-2024-08-13 milestone Jul 31, 2024

dduportal removed the triage Incoming issues that need review label Jul 31, 2024

dduportal mentioned this issue Aug 9, 2024

feat(acp) allow specifying annotations for ACP Service jenkins-infra/helm-charts#1257

Merged

dduportal mentioned this issue Aug 9, 2024

feat(ci.jenkins.io) create resource to ensure an internal LB can be setup to reach ACP in the ci.jenkins.io-agents1 cluster jenkins-infra/azure#798

Merged

This was referenced Aug 10, 2024

cleanup(acp) remove (public) ACP monitoring jenkins-infra/datadog#257

Merged

feat(jenkinscontroller) setup required environement variables during init of Azure VM Windows inbound agent jenkins-infra/jenkins-infra#3594

Merged

This was referenced Aug 10, 2024

feat(vnets) add a new dedicated subnet for ci.jenkins.io ACI agents jenkins-infra/azure-net#281

Merged

feat(ci.jenkins.io) support ACI agents with private IP jenkins-infra/azure#800

Merged

dduportal modified the milestones: infra-team-sync-2024-08-13, infra-team-sync-2024-08-20 Aug 14, 2024

dduportal added a commit to jenkins-infra/azure that referenced this issue Aug 17, 2024

cleanup: remove leftovers of the public ACP instance (#804)

7ce6f7e

Related to jenkins-infra/helpdesk#4204 Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal mentioned this issue Aug 17, 2024

cleanup: remove leftovers of the public ACP instance jenkins-infra/azure-net#283

Merged

dduportal closed this as completed Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad gateway message fails some ci.jenkins.io builds #4204

Bad gateway message fails some ci.jenkins.io builds #4204

MarkEWaite commented Jul 30, 2024

basil commented Jul 30, 2024

dduportal commented Jul 31, 2024 •

edited

Loading

dduportal commented Jul 31, 2024

dduportal commented Jul 31, 2024

dduportal commented Jul 31, 2024

dduportal commented Jul 31, 2024

dduportal commented Aug 6, 2024

dduportal commented Aug 7, 2024

dduportal commented Aug 9, 2024

dduportal commented Aug 9, 2024

dduportal commented Aug 9, 2024

dduportal commented Aug 10, 2024

dduportal commented Aug 10, 2024

dduportal commented Aug 10, 2024

dduportal commented Aug 10, 2024

dduportal commented Aug 13, 2024

dduportal commented Aug 14, 2024

dduportal commented Aug 17, 2024

Bad gateway message fails some ci.jenkins.io builds #4204

Bad gateway message fails some ci.jenkins.io builds #4204

Comments

MarkEWaite commented Jul 30, 2024

Service(s)

Summary

Reproduction steps

basil commented Jul 30, 2024

dduportal commented Jul 31, 2024 • edited Loading

dduportal commented Jul 31, 2024

dduportal commented Jul 31, 2024

dduportal commented Jul 31, 2024

dduportal commented Jul 31, 2024

dduportal commented Aug 6, 2024

dduportal commented Aug 7, 2024

dduportal commented Aug 9, 2024

dduportal commented Aug 9, 2024

dduportal commented Aug 9, 2024

dduportal commented Aug 10, 2024

dduportal commented Aug 10, 2024

dduportal commented Aug 10, 2024

dduportal commented Aug 10, 2024

dduportal commented Aug 13, 2024

dduportal commented Aug 14, 2024

dduportal commented Aug 17, 2024

dduportal commented Jul 31, 2024 •

edited

Loading