Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad gateway message fails some ci.jenkins.io builds #4204

Closed
MarkEWaite opened this issue Jul 30, 2024 · 18 comments
Closed

Bad gateway message fails some ci.jenkins.io builds #4204

MarkEWaite opened this issue Jul 30, 2024 · 18 comments

Comments

@MarkEWaite
Copy link

Service(s)

ci.jenkins.io

Summary

https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/PR-468/1/console failed to build with a report

13:20:22  Caused by: org.apache.maven.project.DependencyResolutionException: Could not resolve dependencies for project org.jenkins-ci:pipeline-steps-doc-generator:jar:1.0-SNAPSHOT
13:20:22  dependency: org.jenkins-ci.plugins.workflow:workflow-api:jar:1322.v857eeeea_9902 (compile)
13:20:22  	Could not transfer artifact org.jenkins-ci.plugins.workflow:workflow-api:jar:1322.v857eeeea_9902 from/to azure-proxy (https://repo.azure.jenkins.io/): status code: 502, reason phrase: Bad Gateway (502)
13:20:22  dependency: org.jenkins-ci.plugins:scm-api:jar:690.vfc8b_54395023 (compile)
13:20:22  	Could not transfer artifact org.jenkins-ci.plugins:scm-api:jar:690.vfc8b_54395023 from/to azure-proxy (https://repo.azure.jenkins.io/): status code: 502, reason phrase: Bad Gateway (502)

Reproduction steps

  1. Open the failing build, confirm that the build failed due to failure to resolve a dependency from https://repo.azure.jenkins.io/
@MarkEWaite MarkEWaite added the triage Incoming issues that need review label Jul 30, 2024
@basil
Copy link
Collaborator

basil commented Jul 30, 2024

jenkinsci/acceptance-test-harness#1644 is failing with similar errors even after a retry

@dduportal dduportal self-assigned this Jul 31, 2024
@dduportal dduportal added this to the infra-team-sync-2024-08-13 milestone Jul 31, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Jul 31, 2024
@dduportal
Copy link
Contributor

dduportal commented Jul 31, 2024

Thanks for raising this issue and for the details folks!
Datadog also indicates that ACP had issues between 06:00pm UTC an 08:00 pm UTC yesterday (30 July 2024).

Checking the logs in Datadog show there was a lot of HTTP/502 in that time windows.
Each (651 precisely) HTTP/502 error reported the following:

22#22: *265332 upstream timed out (110: Operation timed out) while connecting to upstream

The errors are spread across the 2 ACP services:

@dduportal
Copy link
Contributor

@dduportal
Copy link
Contributor

A few metrics collections regarding the time window of yesterday:

Public ACP

  • Nodes:

    • Disk / net metrics clearly shows a period of network activity due to the ACP usage in the time window (both in and out) correlated with a peak of disk reads: confirms it is an ACP related activity.
      Capture d’écran 2024-07-31 à 11 07 08

    • CPU/memory metrics shows almost nothing (e.g. ACP performs well for these 2 metrics). Note: the "tiny" peak on requests is due to another service than ACP, which was updated (rolling upgrade).
      Capture d’écran 2024-07-31 à 11 07 29

  • Pods metrics:

    • ACP only clearly shows the same activity during the time windows with nominal CPU/memory usage:
      Capture d’écran 2024-07-31 à 11 14 53

    • Ingress metrics shows 2 things:

      • Almost all their outbound network rate is due to ACP transmlitting data while their inbound rate is only half-passed to ACP. Not sure what kind of traffic is not transmitted (hard to tell other than half of the net. rate which might be a low value).
      • The peak of network activity is the same pattern as ACP
        Capture d’écran 2024-07-31 à 11 14 32

@dduportal
Copy link
Contributor

Private ACP

  • Pod metrics show the same time window activity with a lower average rate (make sense as only ci.jenkins.io container agents are using this one today: there are less container builds but still a few)
    Capture d’écran 2024-07-31 à 11 15 10

  • Node metrics (removed the ci.jenkins.io agents nodes) shows the same behavior as for the Public ACP: CPU/memory impact are close to zero, but we clearly see a network rate correlated to this time window (which is epxpected).

    Capture d’écran 2024-07-31 à 11 42 49
    Capture d’écran 2024-07-31 à 11 43 07

@dduportal
Copy link
Contributor

What to do from here:

  • The public ACP service is clearly impaired by 2 things:
    • The current publick8s outbound issue
    • The (shared) usage of the ingress

=> scaling it up won't change anything (shared resource for ingress and outbound)
=> we should find a solution to only use the private ACP and decommission the public ACP

  • The private ACP service had a few errors which need investigating (ACP/nginx level) and we have a short term improvement to make:

@dduportal
Copy link
Contributor

Now that #4206 has been fixed, the ACP in the cluster publick8s should behave better.

Next steps:

  • Migrate all ACP workload to the private (HTTP only) ACP. Requires setting up an internal LB, associate NSG rules with it, test from VM agents and set up ci.jenkins.io
    • No more ingress in the middle, no more TLS, no more "public" access requirement
  • Deprovision the "public" ACP
    • Less money to spend on Azure CDF
    • Less impact on the public cluster with a stricter partition between public services and ci.jenkins.io itself
  • Watch the (private) ACP activity: if it also has errors, then we'll need to setup DNS caching inside the ci.jenkins.io-agents-1 cluster ([Feature] Support Node Local DNS Cache (AKS Local DNS implementation) Azure/AKS#3673)

@dduportal
Copy link
Contributor

Update on the ACP private only:

  • I've successfully set up a temporarily internal LB with an IP in the ci.jio VM agents subnets to reach the private ACP with success
    • Required to create a few Azure RM resources:

      • 2 roles assignments (to allow the AKS identity to have Network Contributor on both ephemeral agents and kubernetes subnets
      • NSG rules (1 in and 1 out) to allow ephemeral agents to reach the internal LB
    • Then created the temp. LB with the following YAML:

      apiVersion: v1
      kind: Service
      metadata:
        name: acp
        namespace: artifact-caching-proxy
        annotations:
          service.beta.kubernetes.io/azure-load-balancer-internal: "true"
          service.beta.kubernetes.io/azure-load-balancer-resource-group: "public-jenkins-sponsorship"
          service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "public-jenkins-sponsorship-vnet-ci_jenkins_io_agents"
      spec:
        type: LoadBalancer
        ports:
          - name: http
            port: 8080
            protocol: TCP
            targetPort: http
        selector:
          app.kubernetes.io/instance: artifact-caching-proxy
          app.kubernetes.io/name: artifact-caching-proxy
    • Next problem will be to specify the DNS record with the allocated IP. I would like to define the IP somehow in Terraform and pass it to the kubernetes-management (through the ACP chart values with annotations)

Next steps:

  1. Persist in Terraform the role assignement and NSG rules
  2. Check if we can use a Private link + endpoint defined in Terraform (with a DNS record) as per https://learn.microsoft.com/en-us/azure/aks/internal-lb?tabs=set-service-annotations#create-a-private-endpoint-to-the-private-link-service. It looks like the Kubernetes Service takes care of creating and managing the PLS at first sight, I need to try to create one in Terraform and specify it to see if it reconciles or not
  3. Once we have a PLS or a static IP, we set up the DNS record (and eventually update the NSG rules if the IP was changed from step 1.)
  4. set up ci.jio Azure VM agents to use this new DNS with HTTP and port 8080 (instead of the https public ACP)
  5. Verify it works. If yes, then deprovision the public ACP from publick8s

@dduportal
Copy link
Contributor

Update:

@dduportal
Copy link
Contributor

Update:

  • I failed to solve the egg and chicken problem between Terraform and Kubernetes Management:
    • With a PLS (and LB) managed by the Kubernetes Service, Terraform need to specify the Azure RM permissions (brefore creating the LB) and the Private Endpoint with DNS configuration and NSG rules (but after LB creation*). If the LB changes its name/setup on Kubernetes, then Terraform will start failing because the Data source won't be updated.
      • I was succesful though, with a PLS in the Kubernetes Node Resource group (MC..... RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private A record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure.
    • With a simple "internal LB", we still need to have Terraform to create the DNS record and NSG rules. And installing external-dns only to manage private record looks overkill.

Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management.
=> the constraints to select the proper IP are:

  • Make sure it is in the same subnet as the ci.jio VM agent (easy peasy: we have CIDR!)
  • Make sure it is available: I chose the antepenultimate IP of the CIDR as the Azure VM Jenkins plugins tends to select available IPs from the lower part of CIDR range. It's not strict, but it is low probability to get an IP on the upper range.
    • Why not the last one: long habit of network allocating the last IP of range to an appliance...
      => This pattern makes it easy (no PLS, no NIC, no private endpoint, etc.).

dduportal added a commit to jenkins-infra/azure that referenced this issue Aug 9, 2024
…etup to reach ACP in the ci.jenkins.io-agents1 cluster (#798)

Related to
jenkins-infra/helpdesk#4204 (comment)

This PR introduces the following changes to allow ci.jenkins.io VM
agents to access the private ACP in the `ci.jenkins.io-agents1` AKS
cluster (instead of the azure.repo.jenkins.io ACP in the `publick8s`
cluster):

- Allow the AKS cluster identity to manage Network on the whole Vnet (as
per the Azure documentation - see comment)
- Required to create LB and NIC in both subnets. We could restrict a bit
more but wouldn't protect us.
- Create private DNS records in the private DNS zone of the
ci.jenkins.io vnet to point to thge internal ACP LB.
- Note: I moved the 2 existing DNS record close to this one. Only
visual.
- Add NSG in/out rules in the ci.jenkibns.io ephemeral (VM- agents
subnet to allow HTTP request on the port `8080` of the internal ACP
loadbalancer
- Update shared tools
  - Usual "keep up to date"
- Generate an infra report for reports.jenkins.io to export the private
IP. It will allow us to automate the Kubernetes Service LB annotations


----

Testing: I applied these changes manually and verified it's working by
creating an additional LB with the YAML below on the AKS cluster.
Then I was able to emit curl request to ACP using the DNS on the port
`8080` \o/

Finally: clean up all of these (both Terraform and AKS) resources.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor

Update:

* I failed to solve the egg and chicken problem between Terraform and Kubernetes Management:
  
  * With a PLS (and LB)  managed by the Kubernetes Service, Terraform need to specify the Azure RM permissions (brefore creating the LB) and the Private Endpoint with DNS configuration and NSG rules (but _after_ LB creation*). If the LB changes its name/setup on Kubernetes, then Terraform will start failing because the Data source won't be updated.
    
    * I was succesful though, with a PLS in the Kubernetes Node Resource group (`MC.....` RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private `A` record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure.
  * With a simple "internal LB", we still need to have Terraform to create the DNS record and NSG rules. And installing `external-dns` only to manage private record looks overkill.

Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management. => the constraints to select the proper IP are:

* Make sure it is in the same subnet as the ci.jio VM agent (easy peasy: we have CIDR!)

* Make sure it is available: I chose the antepenultimate IP of the CIDR as the Azure VM Jenkins plugins tends to select available IPs from the lower part of CIDR range. It's not strict, but it is low probability to get an IP on the upper range.
  
  * Why not the last one: long habit of network allocating the last IP of range to an appliance...
    => This pattern makes it easy (no PLS, no NIC, no private endpoint, etc.).

Update: started implementation after a successful manual test.

@dduportal
Copy link
Contributor

Update:

=> Tests in progress, let's wait 2 days to see the results before deprovisioning public ACP

@dduportal
Copy link
Contributor

Update: more changes

=> Windows VM agents are now using properly the internal ACP as verified in https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/246/pipeline-console/?selected-node=151

Next steps:

  • Fix ACI configuration to use private network (prototyped manually and it worked)
    • Require a dedicated subnet to ensure delegation to ACI
  • Cleanup public ACP

dduportal added a commit to jenkins-infra/azure that referenced this issue Aug 10, 2024
Related to
jenkins-infra/helpdesk#4204 (comment)

This PR sets up the required Azure Entra permissions and NSG rules to
allow ACI agents of ci.jenkins.io to run with a private IP in their
dedicated subnet.

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor

Update:

=> Now the (private) ACP is used but I was able to reproduce the dreaded (110: Operation timed out) error quite quickly with the new workload (example: https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/257/).

It should be improved with jenkins-infra/kubernetes-management#5525 (did a lot of tests) which not only uses the local kube DNS by default (to let CoreDNS do its work and benefit from DNS local cache) but also keep using 9.9.9.9 as a fallback.

@dduportal
Copy link
Contributor

Let's see the result after a few days. @MarkEWaite @basil @timja don't hesitate to run big builds in the upcomings days so we'll see how the new DNS setup behaves.

I saw impressive results (Linux build from 50s to 30s) on the plugin jenkins-infra-team but it is not really a real life use case.

We'll check the errors in logs (datadog) and I'll see to add an alert system when there are such errors.

@dduportal
Copy link
Contributor

Update:

  • In the past 48h, the (private) ACP logs show 5 individual errors for ~ 4M successful requests. Ratio is way better than it use to be.

    • 4 errors are HTTP/503 errors on the upstream (e.g. on Artifactory) side. They all happened after a HTTP/302 redirect. => these errors are not on our side alas.
    • But we still had a upstream timed out (110: Operation timed out) while connecting to upstream error though
    • Note: we have a huge amount of warning messages (~2M) about buffered-to-file responses (an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/<...> while reading upstream). While expected for huge files bigger than the memory buffer window, it could be interesting to avoid writing these to disk. Nice to have improvement?
  • Given the good rate, let's clean up the public ACP resources (as no need to go back):

While it is an improvement, I'm still feeling there might be improvements:

@dduportal
Copy link
Contributor

For info: #4241

@dduportal
Copy link
Contributor

Update:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants