-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nil pointer dereference in main-2022-10-02-d533abf8 forward #6021
Comments
@RohitKochhar was kind enough to try the image tags built from the commits merged with
The only difference between these two images is the presence of the commit 7968915 (from #5716) in the image that has the bug. Can anyone else confirm the DNS resolution is working fine on v0.30? This might require a release 0.30.2 with a hotfix or rollback of that commit. |
Interesting 🤔 according to the stacktrace, crash happens here i.e. |
@RohitKochhar what is resolving DNS in your cluster: CoreDNS or KubeDNS? Which k8s version are you running? |
@GiedriusS are you sure that this is the correct reading stacktrace? I think tracing code will always be at the top of the stack because it wraps everything. But down there we can see thanos/pkg/query/endpointset.go Line 382 in 3327c51
|
Oops, yes, my bad 😞 you are correct 👍 |
Pinging for an update on this @douglascamata @fpetkovski |
@RohitKochhar we need some extra information to try to reproduce the issue, investigate and work on a fix. See my previous comment: #6021 (comment). |
Oh, sorry I didn't see that. We use CoreDNS @douglascamata |
@RohitKochhar do you have tracing enabled? Could we see the config? |
I am not able to reproduce this in a prod-like cluster also using CoreDNS. We need your help @RohitKochhar to investigate this cluster and look for anything out of the ordinary. |
Ok, I can form a better summary and share it soon |
@douglascamata after looking into it further, I am unsure why I didn't catch this originally, but the breaking version is v0.29.0. I am able to upgrade to v0.28.0 without problem |
There's some conflicting information then. I asked you to try two different images tags straight from the |
Yeah, I think I incorrectly synced the resources through ArgoCD. I am currently retrying |
I can now confirm that |
Okay. So that rules out what we thought caused your issue. 😄 If you have some time, please try to bisect the snapshot images to find exactly where it works. But seems like your pods do not like something in the miekdns resolver. It has been the default in Query since v0.23 and in Rule since v0.30. |
Between v0.28 and v0.29 though we had no DNS changes. |
Ok, I'll start working my way through the snapshot images between v0.28 and v0.29 to find the last version I can use before breaking |
@douglascamata I added more information into the initial issue comment, including the logs for both ruler and query pods that are failing and the configuration used for both. The latest snapshot I could use was |
That's pretty easy. The image tag format is From 21aa1db to d533abf we have only 64e7e3b. Are you sure that |
|
Here is our tracing configuration:
With the jaeger endpoints pointing to our grafana-agent-trace urls |
Yeah I think that's the same issue. We just never released v0.29.1 and never merge back those commits to main. |
@douglascamata Yeah I think we just needs to do the release and that's it. I think we didn't release v0.29.1 previously because of #5862 IIRC. @GiedriusS |
We need to release v0.29.1, then cherry pick the commit df047ef into |
Won't work df047ef#diff-eb3987ed9aba572232997e0b7878dbe798ccd79ca31bf8e933963e89d2901723R143 sampler still will be nil because "default" case will never be evaluated. |
A solution would be removing "default:" or adding "fallthrough" |
@xBazilio is this a theory or you tested it out? |
Yes, I've Tested. Deployed on dev, caught panic, searched and found this issue. Then i cherry-picked the commit, panic still existed. Then with debugger I found this switch/case issue and fixed it by removing "default:" line. Colleague suggested using "fallthrough" keyword instead, but right now it isn't necessary as "SamplerTypeRateLimiting" is the default. |
@xBazilio would you like to open a PR with this fix? |
I created the PR https://github.com/thanos-io/thanos/pull/6066/files |
@xBazilio just missing the changelog entry and then it'll be perfect. Thanks for the contribution! |
This seems to be solved by upgrading to v0.30.2, closing. |
Thanos, Prometheus and Golang version used: main-2022-10-02-d533abf8
Object Storage Provider: AWS
What happened:
Query pods fail with the following logs:
Ruler pods fail with the following logs:
What you expected to happen:
Query and Ruler pods to behave as expected
How to reproduce it (as minimally and precisely as possible):
Deploy a query pod with the following STS configuration:
Deploy a ruler pod using the
monitoring.coreos/v1
CRDFull logs to relevant components:
Anything else we need to know:
The text was updated successfully, but these errors were encountered: