Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dockerhub rate limit broke the www.jenkins.io CI build #4192

Closed
MarkEWaite opened this issue Jul 23, 2024 · 33 comments · Fixed by jenkinsci/acceptance-test-harness#1645
Closed

Comments

@MarkEWaite
Copy link

MarkEWaite commented Jul 23, 2024

Service(s)

ci.jenkins.io

Summary

The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:

You have reached your pull rate limit.
You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

I've restarted the build in hopes that it will not hit the rate limit.

Reproduction steps

  1. Open the ci.jenkins.io job and review the log file
@MarkEWaite MarkEWaite added the triage Incoming issues that need review label Jul 23, 2024
@basil
Copy link
Collaborator

basil commented Jul 23, 2024

FYI it broke an https://github.com/jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.

@dduportal dduportal added this to the infra-team-sync-2024-07-30 milestone Jul 23, 2024
@dduportal dduportal removed the triage Incoming issues that need review label Jul 23, 2024
@dduportal
Copy link
Contributor

The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:

You have reached your pull rate limit.
You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

About www.jenkins.io (I'll focus on ATH on a second time):

=> it was prone to happen since we moved to NAT gateways a few month agos. Let us open a PR to run the docker login and push the limit forward

@dduportal
Copy link
Contributor

FYI it broke an https://github.com/jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.

@basil can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the jenkins/ath image itself)? The 2 days ago PRs have their build logs already purged, but I see some tests failures on the master branch (build https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/1178/) that might be related.

I'm asking to think about an eventual ACP-like for Docker Engine with a "pull-through" cache as per https://docs.docker.com/docker-hub/mirror/ for ci.jenkins.io

@dduportal
Copy link
Contributor

The ci.jenkins.io job that builds the www.jenkins.io web site failed its most recent build with the message:

You have reached your pull rate limit.
You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

About www.jenkins.io (I'll focus on ATH on a second time):

* The build scripts are pulling Docker Library (e.g. "Official") images which ARE subject to rate limit, contrary to the `jenkins/*` and `jenkinsciinfra/*` images
  
  * https://github.com/jenkins-infra/jenkins.io/blob/387ad60415c543e440cc8be06934559227ab251e/scripts/ruby#L13
  * https://github.com/jenkins-infra/jenkins.io/blob/387ad60415c543e440cc8be06934559227ab251e/scripts/node#L14
  * etc.

* ci.jenkins.io agents only have 2 outbound IPs as per https://github.com/jenkins-infra/azure-net/blob/6637c0b38bf0614335375f92c385a3da452e45e0/gateways.tf#L91

* The DockerHub documentation tells us that anonymous pulls are limited to 100 pulls per 6 hours per IP, so ~200 pulls (1 pull == 1 layer OR 1 manifest)

* The pipeline at https://github.com/jenkins-infra/jenkins.io/blob/master/Jenkinsfile never uses any kind of credential to log in to DockerHub (and increasing the available rate limit)

=> it was prone to happen since we moved to NAT gateways a few month agos. Let us open a PR to run the docker login and push the limit forward

@MarkEWaite : jenkins-infra/jenkins.io#7421

@timja
Copy link
Member

timja commented Jul 24, 2024

FYI it broke an jenkinsci/acceptance-test-harness PR build as well, but I was able to successfully retry about an hour and a half later.

@basil can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the jenkins/ath image itself)? The 2 days ago PRs have their build logs already purged, but I see some tests failures on the master branch (build ci.jenkins.io/job/Core/job/acceptance-test-harness/job/master/1178) that might be related.

Likely the same issue from a quick look, the actual build logs of the docker image aren't archived though

@basil
Copy link
Collaborator

basil commented Jul 24, 2024

Can you confirm that the rate limit issue, with the ATH build, was with the test "additional" Docker images (and not the jenkins/ath image itself)?

Yes, this was a rate limit error while fetching containers for use during tests. I didn't encounter any problems building or fetching the jenkins/ath image itself.

@dduportal
Copy link
Contributor

Thanks @basil @timja !

I've opened jenkinsci/acceptance-test-harness#1634 to set up authenticated Docker Engine during tests

@dduportal
Copy link
Contributor

Closing as:

  • The issue was temporary and short fix was to relaunch pipeline manually
  • Both projects have been updated to use Docker auth. to increase the API rate limit on each.

Thanks folks!

@dduportal dduportal reopened this Jul 25, 2024
@dduportal
Copy link
Contributor

Reopening as we saw a collection of 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit errors in the builds of jenkinsci/docker-agent and jenkinsci/docker on ci.jenkins.io in the past hour.

Example: https://ci.jenkins.io/job/Packaging/job/docker-agent/job/PR-843/1/pipeline-console/?start-byte=0&selected-node=100#log-170

@dduportal
Copy link
Contributor

Reopening as we saw a collection of 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit errors in the builds of jenkinsci/docker-agent and jenkinsci/docker on ci.jenkins.io in the past hour.

Example: https://ci.jenkins.io/job/Packaging/job/docker-agent/job/PR-843/1/pipeline-console/?start-byte=0&selected-node=100#log-170

jenkinsci/docker-agent#844

@dduportal
Copy link
Contributor

A solution to limit this kind of impact would be for us to run registry pull through caches (see. https://docs.docker.com/docker-hub/mirror/) in the ci.jenkins.io agent networks (all VMs and Linux containers)

@dduportal
Copy link
Contributor

@basil @timja I'm continuing the discussion from jenkinsci/acceptance-test-harness#1640 (comment) but here:

I'm not sure how to identify the failure error, I would need help navigating the ATH build and test results. With this I should be more autonomous to find failures, understand them and provide solutions.

@timja
Copy link
Member

timja commented Jul 31, 2024

Yes it is setup as DinD

@dduportal
Copy link
Contributor

Yes it is setup as DinD

🤔 what is the reason to use nested containers?

(btw DinD is a nightmare to configure regarding docker login but at least it explains why my PR did not look to work as it only sets up the outer Docker engine.

@dduportal
Copy link
Contributor

Yes it is setup as DinD

I see https://github.com/jenkinsci/acceptance-test-harness/blob/4904fec29f49dedca64214757f8a7898ffa9a329/ath-container.sh#L37 and it looks like it is not DinD (e.g. nested container engine) but DonD (Docker on Docker, e.g. sharing the socket) is my understanding correct?
=> I'm not sure if the authentication is expected to work though (as it might be on the client side).

@timja
Copy link
Member

timja commented Jul 31, 2024

Yes your understanding is correct.

I'm not sure either would need testing

@dduportal
Copy link
Contributor

Yes your understanding is correct.

I'm not sure either would need testing

If it is DonD, then the ACR will be a good solution as the cache through setup is on engine side \o/

@dduportal
Copy link
Contributor

Actually it does look like they are archived:

You can get there from the test report: e.g. https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/

https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1645/1/testReport/plugins/AntPluginTest/latest_linux_jdk21_firefox_split5___testWithAntPipelineBlock/attachments/docker-SshAgentContainer.build.log

ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied
ERROR: permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Head "http://%2Fvar%2Frun%2Fdocker.sock/_ping": dial unix /var/run/docker.sock: connect: permission denied

All the failed tests in that build seem to have the same error

But I fail to see the relation between these errors and the rate limit :|

@timja
Copy link
Member

timja commented Jul 31, 2024

Might be this fix that was just pushed: jenkinsci/acceptance-test-harness@04f64ef

@dduportal
Copy link
Contributor

Might be this fix that was just pushed: jenkinsci/acceptance-test-harness@04f64ef

Ow yeah, this change might fix it!

@basil
Copy link
Collaborator

basil commented Jul 31, 2024

https://ci.jenkins.io/job/Core/job/acceptance-test-harness/job/PR-1660/2/testReport/junit/plugins/LdapPluginTest/lts_linux_jdk17_firefox_split1___enable_cache/

#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 3.19kB done
#1 DONE 0.0s

#2 [internal] load metadata for docker.io/library/debian:bullseye
#2 ERROR: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:7aef2e7d061743fdb57973dac3ddbceb0b0912746ca7e0ee7535016c38286561: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
------
 > [internal] load metadata for docker.io/library/debian:bullseye:
------
Dockerfile:2
--------------------
   1 |     # Sets up
   2 | >>> FROM debian:bullseye
   3 |     
   4 |     # Viewvc is not part of bullseye repos anymore but oldstable https://github.com/viewvc/viewvc/issues/310
--------------------
ERROR: failed to solve: debian:bullseye: failed to resolve source metadata for docker.io/library/debian:bullseye: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:7aef2e7d061743fdb57973dac3ddbceb0b0912746ca7e0ee7535016c38286561: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 3.19kB done
#1 DONE 0.0s

#2 [internal] load metadata for docker.io/library/debian:bullseye
#2 ERROR: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:907e428c7d1dd4e3a2458d22da8193e69878d3a23761d12ef9cd1a1238214798: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
------
 > [internal] load metadata for docker.io/library/debian:bullseye:
------
Dockerfile:2
--------------------
   1 |     # Sets up
   2 | >>> FROM debian:bullseye
   3 |     
   4 |     # Viewvc is not part of bullseye repos anymore but oldstable https://github.com/viewvc/viewvc/issues/310
--------------------
ERROR: failed to solve: debian:bullseye: failed to resolve source metadata for docker.io/library/debian:bullseye: failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/library/debian/manifests/sha256:907e428c7d1dd4e3a2458d22da8193e69878d3a23761d12ef9cd1a1238214798: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

@basil basil reopened this Jul 31, 2024
@dduportal
Copy link
Contributor

Thanks @basil for the details. In order to tackle down these HTTP/429, I propose this course of actions:

  • Short term: add 2 more outbound IPs for ci.jenkins.io to immediatly spread the workload to DockerHub (quick, no impact, imemdiate effect)
  • Medium term: add an ACR registry only for ci.jenkins.io agents (priority on VMs as they have Docker Engine, eventually on the AKS ci.jenkins.io cluster later) and set up Azure VM agents Docker engines to use it a pull through mirror
    • Then remove the docker login pipeline steps on ATH

=> once these setup are in place, we'll look at the result

@timja
Copy link
Member

timja commented Aug 1, 2024

@dduportal and I got the ACR option working and have tested on ci.jenkins.io.

@dduportal is going to finish off the terraform automation and update the jcasc config.

It looks like our users aren't rate limited and were probably hitting some anti-abuse protection, this should help with that and is expected to get rid of any rate limiting issues.

It will also mean that anything on ci.jenkins.io Azure doesn't need to login anymore, as the docker daemons are going to have a mirror-registry set to point it at the acr cache

dduportal added a commit to jenkins-infra/azure that referenced this issue Aug 6, 2024
… inside the Jenkins Azure infrastructure (#794)

Related to jenkins-infra/helpdesk#4192

Fixup of
91cf2dc

Reference Azure documentation:
https://learn.microsoft.com/en-us/azure/container-registry/container-registry-artifact-cache?pivots=development-environment-azure-portal

This PR introduces an Azure Container Registry set up as a DockerHub
mirror using a "Cache Rule" which mirrors `docker.io/*` to `*` (note: it
forbids us to use other caching mechanism!).

This registry has the following properties:

- Only available in the "sponsorship" subscription
- Anonymous pull access (constraint due to Docker pull through cache -
moby/moby#30880)
- Private network only: since we have anonymous pull policy (see above),
then we restrict to only a subset of private networks. It uses ["Azure
Private
Endpoints"](https://learn.microsoft.com/en-us/azure/private-link/private-endpoint-overview)
for this
- Note: it implies using Private DNS zones linked to networks. These
zone might need to be reused in the future for other private link if
required


The registry is available for the following (heavy DockerHub users)
services (I've only setup the Azure ephemeral VM agents subnets for now)
through a combination of (private endpoint with a NIC in the subnet +
private DNS zone with automatic records + inbound and outbound NSG
rules):
- ci.jenkins.io
- cert.ci.jenkins.io
- trusted.jenkins.io
- infra.jenkins.io

Azure makes it mandatory to log-in on DockerHub for such a mirror
system. As such, we use a distinct token stored in an Azure Keyvault
which is "Public Images Read Only" associated to the `jenkinsciinfra`
organization to avoid the "application" rate limit (e.g. 5k pull / day /
IP) and only have the DockerHub anti-abuse system as upper limit (which
seems to be a combination of requests and amount of data).

![Capture d’écran 2024-08-05 à 16 31
38](https://github.com/user-attachments/assets/f04e4c49-3500-4589-b0fc-42b5b1792066)

----

*Testing and approving*

This PR is expected to have no changes in the plan as it was applied
manually:

- End to end testing was done on each controller by:
- Starting an Azure VM ephemeral agent using a pipeline replay with
correct label
- The pipeline tries to resolve the DNS name
`dockerhubmirror.azurecr.io` and should resolve to an IP local to the VM
subnet
- Once the VM is up, checking the connectivity in Azure UI portal
(`Network Watcher` -> `Connection troubleshoot`)
    - Source VM is the agent VM, which name is retrieved from build log
    - Destination is `https://dockerhubmirror.azurecr.io`

<img width="1185" alt="Capture d’écran 2024-08-06 à 10 42 25"
src="https://github.com/user-attachments/assets/11d762a6-119c-4e03-b7f0-91072364aaa2">


- The bootstrap must be done in 2 `terraform apply` commands as
documented, because the ACR component `CredentialSet` is not supported
by Terraform yet (see comments in TF code).

Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
@dduportal
Copy link
Contributor

Update:

We can now roll back the docker login steps, along with all the "pullonly" logic

@dduportal
Copy link
Contributor

All changes have been reverted, but I'll keep this issue opened until the 13 in case we see other issues

@dduportal
Copy link
Contributor

Update:

  • We were able to build and release 3 core releases for the Jenkins Security 2024-08-07 advisory using the ACR caching
  • Checked today: we went from ~12.9 Gb (Tuesday 06 at 13:00 UTC) to ~ 31.2 Gb right now
  • Monitoring mentions around succesful 6.6k pulls in the past 24h:
Capture d’écran 2024-08-07 à 16 41 46

My only concern is that some images or tags are still absent unless we explicitly docker pull them with a ci.jenkins.io pipeline replay. I don't see any errors in the Docker Engine logs so I guess there might be a slight delay between an initial request to a new image reference (which fails as it's not cached yet) so Docker CE falls back to the DockerHub, and the second try once the "ACR cache rule" routine has collected the image tags in the ACR.

Does it make sense @timja ? Have you already seen this behavior in your own infrastructure?

@timja
Copy link
Member

timja commented Aug 7, 2024

Hmm not sure we use it slightly differently and explicitly use the cached version.

It won’t show up in the cache unless one pull has been completed:
https://learn.microsoft.com/en-us/azure/container-registry/container-registry-artifact-cache?pivots=development-environment-azure-portal#limitations

But if it’s increasing in size it’s definitely caching some

@dduportal
Copy link
Contributor

Closing as we did not had any more errors. Feel free to reopen if you see some

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants