-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random download failures - 403 errors [hetzner] #138
Comments
That endpoint works fine from here $ curl -IL https://registry.k8s.io/v2/pause/manifests/3.7
HTTP/2 307
content-type: text/html; charset=utf-8
location: https://us-west2-docker.pkg.dev/v2/k8s-artifacts-prod/images/pause/manifests/3.7
x-cloud-trace-context: 9e8f3405a102bf4332d81593461d200a
date: Thu, 12 Jan 2023 20:13:55 GMT
server: Google Frontend
via: 1.1 google
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
HTTP/2 200
content-length: 2761
content-type: application/vnd.docker.distribution.manifest.list.v2+json
docker-content-digest: sha256:bb6ed397957e9ca7c65ada0db5c5d1c707c9c8afc80a94acbe69f3ae76988f0c
docker-distribution-api-version: registry/2.0
date: Thu, 12 Jan 2023 20:13:55 GMT
alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43" Is there a proxy involved? Can nerdctl produce more verbose results? That path should have served a redirect to some other backend. |
We don't even have code to serve 403 in the registry.k8s.io application, so that would be coming from the backing store we redirect to, but from the logs above we can't see that part. |
Thanks Ben, As a temp fix, I've looked at the kubespray logs, downloaded the missing elements on a working instance, then exported them to a local file, copied over and imported them back into the instance that is being blocked. I now have a working cluster, but it is concerning that access seems to being blocked in an arbitary fashion. Have a good weekend Mike |
I'm seeing the same behavior in a similar context. I'm trying to install the kube-prometheus-stack helm chart on a k3s cluster in Hetzner Cloud (hosted in their Oregon location) and getting a 403 when pulling
For me the 403 is appearing without following the redirect:
|
Thanks for the additional logs. cc @ameukam maybe cloud armor? I forgot about that dimension in the actual deployment. This definitely looks like it's coming from the infra in front of the app, we also don't serve HTML, only redirects (or simple API errors). |
@ameukam and I discussed this yesterday. This appears to be coming from the cloud loadbalancer security policy (we're using cloud armor, configured here: I don't think we're doing anything special here, best guess is hetzner IPs have been flagged for abuse? I actually can't seem to find these particular requests in the loadbalancer logs, otherwise we could see what preconfigured rule this is hitting. |
I can see other 403s served by the security policy for more obviously problematic incoming requests like |
Folks, I can confirm this issue shows randomly when pulling CSI images. It seems that some IPs are blacklisted or something! This has been a huge issue this last month for us! It started in late December. |
That would make absolute sense! Somehow, some Hetzner IPs seem to be blacklisted. For our Kube-Hetzner project, it's been a real pain. Please fix 🙏 kube-hetzner/terraform-hcloud-kube-hetzner#524 |
@mysticaltech can you please drop a few ip address(es) of boxes that seem to have trouble? |
@dims Definitely, I can try to get some. @aleksasiriski Could you fetch some of the 10 IPs that you had reserved as static IPs because they were blocked by registry.k8s.io when used for nodes? |
@dims I just deployed a test cluster of 10 nodes, and got "lucky" on one of them. The one affected IP is |
I had like 3 IPs that were blacklisted, I'll try to fetch them later today (UTC+1) when I'm home. |
Uploading downloaded-logs-20230126-065347.json.txt… I see 4 hits, all with a valid redirect using http status 307's, no 403's at all the code it hits is here: |
@dims Thanks for looking into this. The 403 are most probably appearing closer later down the request chain. As stated by @BenTheElder, it could be your LB security policy (cloud armor) configured here https://github.com/kubernetes/k8s.io/blob/f858f4680ada6385eaa4c76b2a295e33ec0ed51c/infra/gcp/terraform/k8s-infra-oci-proxy-prod/network.tf#L112 |
Also @dims, something interesting, discovered by one our users, is that if they tried to pull the image manually with Sometimes it works after 100 tries, sometimes it just does not work. So kind of a hit-or-miss situation! All this to say, there's something up with your LB IMHO. |
@dims I have created another small test cluster and the IP above 5.75.240.113 has been reused and it does it again. I will leave it on for 24h so that you can have more logs.
|
@dims Also an interesting finding. If I simply issue |
@mysticaltech yeah, looks like there is very little tolerance for your range of ips from hetzner |
Exactly! Which is really a pain when working with Kubernetes. If possible to fix, it would be awesome. |
@dims Did you do something? Because it started to work. |
@mysticaltech nope. theory is still the same - cloud armor! |
Oh my! Maybe some kind of form to request whitelisting? That would be kind of good. But not great for autoscaling nodes for instance. |
Hi, I'm getting an HTTP 404 code on this request:
from ip 78.47.222.2 (same project and provider as @mysticaltech ) . Does posting ip's and info like this, help? Or is there something else I can provide? |
@valkenburg-prevue-ch
This actually points to the issue with hetzner IPs existing with plain GCR. k8s.gcr.io is a special alias domain provided by GCR but it has the same allow-listing etc as any other gcr.io registry. Kubernetes doesn't run that infra, just populate the images.
I'm not sure how well this would scale given the relatively volunteer staffing we have for this sort of free image host ... It seems registry.k8s.io has no regression here vs k8s.gcr.io, though I can't recall ever having seen a similar issue reported to Kubernetes previously. |
At present time I would recommend mirroring images, which also helps us reduce our massive distribution costs and reallocate resources towards testing etc. |
@BenTheElder Thanks for clarifying. But Hetzner Cloud is still a major European cloud, not supporting fully it is a shame IMHO, and for a young open-source project like ours, we don't yet have the resource to deploy a full-blown mirror. However, if we were to do that, how would you recommend we proceed? This is something we obviously thought about, and have considered already both https://docs.k3s.io/installation/private-registry and https://github.com/goharbor/harbor, would you recommend anything else that is an easy fix for that particular issue? |
Again, this is not a valid API path. So the 404s are expected, the request is invalid. 403 are seemingly due to the security mechanism(s). I recommend
I hear that, but even as a large open source project we have constrained resources to host things and we're not actively choosing to block these IPs, some security layer on our donated hosting infrastructure is blocking these IPs. At the moment keeping things online and trying to bring our spend back within the budget is a bigger priority than resolving an issue present in the previous infrastructure, and even that is a bit of a stretch. Open source staffing is hard :( Perhaps you could ask your users to mirror for themselves if they encounter issues like this. Hetzner might also have thoughts about this issue? It seems in their best interest to avoid what seems to be an IP reputation issue. Searching online I see similar discussions for Amazon CloudFront and CloudFlare with respect to hetzner IP ban issues.
Mirroring guides are something I hope to get folks to contribute. Options will depend on the tools involved client-side (like container runtime). For consuming a mirror, I usually recommend containerd's mirroring config (as dockershim is deprecated), cri-o has something similar I beleive. For hosting a mirror, I recommend roughly populate images with |
Hi! I installed peerd in k3s (had to build the image with a changed containerd socket path and it's now running) but how to use it? Thanks |
After that images will automatically be pulled from other nodes in your cluster if they are present. |
Thanks :) In the meantime I ended up using https://github.com/spegel-org/spegel since it doesn't require me to open any ports in the firewall. Peerd does if I am not mistaken, right? |
@valkenburg-prevue-ch FYI above, the landscape of solutions for this has evolved fast! 🤯 |
@vitobotta Any tips on the config for k3s (or any other cluster), is it straightforward? |
@phillebaba Your project is resolving a big need, thank you for that 🙏 |
Yeah, I've been following this discussion closely! Very interested. |
Yes, there are some settings that differ on k3s, so here's how I ended up configuring it after some investigation. Hope it can save you and/or others some time: helm upgrade --install \
--version v0.0.22 \
--create-namespace \
--namespace spegel \
--set spegel.containerdSock=/run/k3s/containerd/containerd.sock \
--set spegel.containerdContentPath=/var/lib/rancher/k3s/agent/containerd/io.containerd.content.v1.content \
--set spegel.containerdRegistryConfigPath=/var/lib/rancher/k3s/agent/etc/containerd/certs.d \
--set spegel.logLevel="DEBUG" \
spegel oci://ghcr.io/spegel-org/helm-charts/spegel The only problem I have encountered with Spegel is that despite it's a DaemonSet, for some reason only max 100 pods exactly are up and running. With larger clusters (I tried with 400 and 500 nodes clusters) it always maxes at 100 pods and all the others on the other nodes remain in a non running state. I opened an issue about it here spegel-org/spegel#459. Other than that it seems to work pretty well. Like I mentioned I tested with clusters of up to 500 nodes to increase the likelihood to get some problematic IPs and in fact every time there were many out of that large number of nodes, and thanks to Spegel all pods that require images from problematic registries were started without any issue. And I see a nice boost in speed re: the time it takes for a node to acquire the image from other nodes, so deployments scale more quickly which is awesome. |
Thanks @vitobotta, appreciate it. I guess 100 IPs should be enough to successfully do the job. FYI, @valkenburg-prevue-ch just found out that Spegel is already integrated within k3s and can be enabled with the |
I know and I forgot to mention it. Maybe it's because it's still experimental and perhaps buggy, but I couldn't get the embedded spegel support to work after many attempts. Also seems that with that version you need to open a port in the firewall. Perhaps I will try again when I have some more time. |
@vitobotta Ah ok, good to know! So maybe for now your helm setup will be best to get the latest and greatest 🙏 @valkenburg-prevue-ch FYI |
I am trying the embedded spegel now again. Let's see if I can figure it out. |
Just to add to this, if you are running k3s I would suggest using the embedded Spegel. It works just as well without having to deal with daemonsets. |
Like I mentioned in a previous comment I couldn't get it to work, so I tried the Helm installation of Spegel and it worked. I am trying the embedded one now again and still can't get it to work. I have tried also with the port 5001 open for the peer to peer exchange but it's not working. There is nothing even listening on the nodes on that port as if the embedded registry is not configured at all, but I am indeed using the |
I think I know what the problem might be: when I create a cluster in Hetzner without private network, and then I install the Hetzner Cloud Controller Manager, the CCM populates the external IP field of the nodes, leaving the internal ip unset. But the k3s documentation about the embedded registry mirror talks about communication with internal IPs, so perhaps that's why it's not working. |
@phillebaba have you actually gotten the embedded registry mirror working? |
Got it working! My mistake was that I enabled the embedded registry on existing clusters without restarting the agents. When I restart the agents or create a new cluster then all is working. It seems that it requires the port 5001 to be opened in the firewall though... will test this more. |
Awesome news! Thanks for doing all this and reporting here. In which firewall do you have to open port 5001 though? The hetzner firewall only applies to public internet, right? Isn't the private network always all open? Or am I missing something in our setup of microos, is there a firewall too? |
Yep public firewall. There are no restrictions with private network afaik. The reason my I am testing without private networks is that they support max 100 nodes, so it's impossible to create a large cluster with them. I have tested (using hetzner-k3s, my tool) with clusters of up to 500 nodes using the public network and I could probably scale into the thousands now that I added support for cilium as cni and for external data stores like postgres instead of etcd. I wish I had the money to experiment with more nodes lol. |
Thanks for clarifying. Do I understand correctly that for up to 100 nodes, one does not need to open anything on the firewall, and that your use-case with everything over public ip's might be "beyond the scope of the default supported setups"? |
Correct. If you use the private network you don't need to open anything in the firewall provided you configure everything to use the private interface. |
Thanks @vitobotta and @phillebaba for sharing, really appreciate it. |
Hi,
Attepting to build a 3 node kubernetes cluster, using kubespray (latest) on hetzner cloud instances running Debian 11.
First attempt failed due to download failure for kubeadm on 1 of the 3 instances. Confirmed using local download, 1 fail, 2 sucess.
Swapped in a replacement instance, and moved past this point, assumed possible ip blacklisting, though not confirmed.
All 3 instances then downloaded 4 calico networking containers, and came to the pause 3.7 download, using a command like this.
root@kube-3:
# /usr/local/bin/nerdctl -n k8s.io pull --quiet registry.k8s.io/pause:3.7# nerdctl imagesroot@kube-3:
REPOSITORY TAG IMAGE ID CREATED PLATFORM SIZE BLOB SIZE
registry.k8s.io/pause 3.7 bb6ed397957e 4 seconds ago linux/amd64 700.0 KiB 304.0 KiB
bb6ed397957e 4 seconds ago linux/amd64 700.0 KiB 304.0 KiB
on the failing instance, we see the following error if applied by hand, using kubespray it tries 4 times and then fails the whole install at that point.
root@kube-2:~# /usr/local/bin/nerdctl -n k8s.io pull --quiet registry.k8s.io/pause:3.7
FATA[0000] failed to resolve reference "registry.k8s.io/pause:3.7": unexpected status from HEAD request to https://registry.k8s.io/v2/pause/manifests/3.7: 403 Forbidden
Do you have any idea why the download from this registry might be failing, and is there any alternative source I could try ?
The ip address starts and ends as shown below, and was run a couple of minutes ago
Thu 12 Jan 2023 02:52:21 PM UTC
65.x.x.244
Many thanks
Mike
The text was updated successfully, but these errors were encountered: