nil pointer dereference in main-2022-10-02-d533abf8 forward #6021

RohitKochhar · 2023-01-05T17:01:29Z

Thanos, Prometheus and Golang version used: main-2022-10-02-d533abf8

Object Storage Provider: AWS

What happened:

Query pods fail with the following logs:

level=info ts=2023-01-09T19:05:58.690951095Z caller=factory.go:43 msg="loading tracing configuration"
level=info ts=2023-01-09T19:05:58.696610373Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2023-01-09T19:05:58.697727856Z caller=query.go:759 msg="starting query node"
level=info ts=2023-01-09T19:05:58.697991079Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2023-01-09T19:05:58.69803157Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2023-01-09T19:05:58.698204012Z caller=tls_config.go:195 service=http/server component=query msg="TLS is disabled." http2=false
level=info ts=2023-01-09T19:05:58.698275323Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2023-01-09T19:05:58.698316933Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=error ts=2023-01-09T19:05:58.700889734Z caller=resolver.go:99 msg="failed to lookup SRV records" host=_grpc._tcp.thanos-ruler-operated err="no such host"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x13e0ee8]

goroutine 296 [running]:
go.opentelemetry.io/otel/sdk/trace.parentBased.ShouldSample({{0x0, 0x0}, {{0x2672118, 0x39346d0}, {0x26720f0, 0x39346d0}, {0x2672118, 0x39346d0}, {0x26720f0, 0x39346d0}}}, ...)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/sampling.go:280 +0x1c8
github.com/thanos-io/thanos/pkg/tracing/migration.samplerWithOverride.ShouldSample(...)
	/app/pkg/tracing/migration/sampler.go:42
go.opentelemetry.io/otel/sdk/trace.(*tracer).newSpan(0xc0006e1bc0, {0x267d0d0, 0xc0007ec120}, {0xc000778630, 0x16}, 0xc0007eb470)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/tracer.go:90 +0x456
go.opentelemetry.io/otel/sdk/trace.(*tracer).Start(0xc0006e1bc0, {0x267d0d0, 0xc0007ec120}, {0xc000778630, 0x16}, {0xc0007bdc00?, 0x4?, 0xc0007eb5a0?})
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/tracer.go:47 +0x10e
go.opentelemetry.io/otel/bridge/opentracing.(*WrapperTracer).Start(0xc000782228, {0x267d0d0?, 0xc0007ec120?}, {0xc000778630?, 0x1e437a0?}, {0xc0007bdc00?, 0x7f52aecb03d8?, 0xc000a209c0?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/wrapper.go:79 +0x4b
go.opentelemetry.io/otel/bridge/opentracing.(*BridgeTracer).StartSpan(0xc000770960, {0xc000778630, 0x16}, {0xc0007ec0c0, 0x3, 0x39346d0?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/bridge.go:430 +0x3f4
github.com/thanos-io/thanos/pkg/tracing/migration.(*bridgeTracerWrapper).StartSpan(0x1f9a9a0?, {0xc000778630?, 0x39346d0?}, {0xc0007ec0c0?, 0x38c9480?, 0xc0007eb818?})
	/app/pkg/tracing/migration/bridge.go:89 +0x26
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/tracing.newClientSpanFromContext({0x267d098, 0xc0007b5b60}, {0x2678120, 0xc000125748}, {0xc000778630, 0x16})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/tracing/client.go:92 +0x244
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/tracing.(*opentracingClientReportable).ClientReporter(0xc0007824e0, {0x267d098, 0xc0007b5b60}, {0x0?, 0x0?}, {0x2197e1a, 0x5}, {0x21bcafa, 0x10}, {0x21bcb0b, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/tracing/client.go:51 +0x127
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.UnaryClientInterceptor.func1({0x267d098, 0xc0007b5b60}, {0x21bcaf9, 0x16}, {0x209f7e0, 0x39346d0}, {0x209f920, 0xc0007b5c80}, 0xc000100960?, 0x227e1b8, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/client.go:19 +0x195
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1({0x267d098?, 0xc0007b5b60?}, {0x21bcaf9?, 0x38?}, {0x209f7e0?, 0x39346d0?}, {0x209f920?, 0xc0007b5c80?}, 0x0?, {0xc000a13e80, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:74 +0x86
github.com/grpc-ecosystem/go-grpc-prometheus.(*ClientMetrics).UnaryClientInterceptor.func1({0x267d098, 0xc0007b5b60}, {0x21bcaf9, 0x16}, {0x209f7e0, 0x39346d0}, {0x209f920, 0xc0007b5c80}, 0x8?, 0xc00000d020, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/client_metrics.go:112 +0x117
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1.1.1({0x267d098?, 0xc0007b5b60?}, {0x21bcaf9?, 0x203000?}, {0x209f7e0?, 0x39346d0?}, {0x209f920?, 0xc0007b5c80?}, 0x1?, {0xc000a13e80, ...})
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:74 +0x86
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainUnaryClient.func1({0x267d098, 0xc0007b5b60}, {0x21bcaf9, 0x16}, {0x209f7e0, 0x39346d0}, {0x209f920, 0xc0007b5c80}, 0x0?, 0x227e1b8, ...)
	/go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:83 +0x157
google.golang.org/grpc.(*ClientConn).Invoke(0xc0005c3900?, {0x267d098?, 0xc0007b5b60?}, {0x21bcaf9?, 0x16?}, {0x209f7e0?, 0x39346d0?}, {0x209f920?, 0xc0007b5c80?}, {0xc00071b680, ...})
	/go/pkg/mod/google.golang.org/grpc@v1.45.0/call.go:35 +0x223
github.com/thanos-io/thanos/pkg/info/infopb.(*infoClient).Info(0xc000a04610, {0x267d098, 0xc0007b5b60}, 0x8?, {0xc00071b680, 0x1, 0x1})
	/app/pkg/info/infopb/rpc.pb.go:422 +0xc9
github.com/thanos-io/thanos/pkg/query.(*endpointRef).Metadata(0xc00016fa80, {0x267d098, 0xc0007b5b60}, {0x2663ca0, 0xc000a04610}, {0x267e368, 0xc000a04618})
	/app/pkg/query/endpointset.go:66 +0x11e
github.com/thanos-io/thanos/pkg/query.(*EndpointSet).updateEndpoint(0xc0000b4980, {0x267d098, 0xc0007b5b60}, 0xc00000ced0, 0xc00016fa80)
	/app/pkg/query/endpointset.go:414 +0x105
github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update.func2(0xc00000ced0)
	/app/pkg/query/endpointset.go:354 +0x2cb
created by github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update
	/app/pkg/query/endpointset.go:343 +0x60a

Ruler pods fail with the following logs:

level=info ts=2023-01-09T19:06:55.17554435Z caller=intrumentation.go:56 component=rules msg="changing probe status" status=ready
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x13e0ee8]

goroutine 1725 [running]:
go.opentelemetry.io/otel/sdk/trace.parentBased.ShouldSample({{0x0, 0x0}, {{0x2672118, 0x39346d0}, {0x26720f0, 0x39346d0}, {0x2672118, 0x39346d0}, {0x26720f0, 0x39346d0}}}, ...)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/sampling.go:280 +0x1c8
github.com/thanos-io/thanos/pkg/tracing/migration.samplerWithOverride.ShouldSample(...)
	/app/pkg/tracing/migration/sampler.go:42
go.opentelemetry.io/otel/sdk/trace.(*tracer).newSpan(0xc00084ebc0, {0x267d0d0, 0xc001b3e2d0}, {0x21a41c9, 0xc}, 0xc04dae9ab8)
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/tracer.go:90 +0x456
go.opentelemetry.io/otel/sdk/trace.(*tracer).Start(0xc00084ebc0, {0x267d0d0, 0xc001b3e2d0}, {0x21a41c9, 0xc}, {0xc020e56a40?, 0x4?, 0xc04c3b4be8?})
	/go/pkg/mod/go.opentelemetry.io/otel/sdk@v1.9.0/trace/tracer.go:47 +0x10e
go.opentelemetry.io/otel/bridge/opentracing.(*WrapperTracer).Start(0xc000516870, {0x267d0d0?, 0xc001b3e2d0?}, {0x21a41c9?, 0x1e437a0?}, {0xc020e56a40?, 0x0?, 0xc045d54c00?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/wrapper.go:79 +0x4b
go.opentelemetry.io/otel/bridge/opentracing.(*BridgeTracer).StartSpan(0xc00039ede0, {0x21a41c9, 0xc}, {0x0, 0x0, 0x33450?})
	/go/pkg/mod/go.opentelemetry.io/otel/bridge/opentracing@v1.10.0/bridge.go:430 +0x3f4
github.com/thanos-io/thanos/pkg/tracing/migration.(*bridgeTracerWrapper).StartSpan(0x267d0d0?, {0x21a41c9?, 0x10000000043b247?}, {0x0?, 0xc045c1c780?, 0xc045d54e90?})
	/app/pkg/tracing/migration/bridge.go:89 +0x26
github.com/thanos-io/thanos/pkg/tracing.StartSpan({0x267d0d0, 0xc03ace95c0}, {0x21a41c9, 0xc}, {0x0, 0x0, 0x0})
	/app/pkg/tracing/tracing.go:72 +0x18f
github.com/thanos-io/thanos/pkg/tracing.DoInSpan({0x267d0d0?, 0xc03ace95c0?}, {0x21a41c9?, 0x7a?}, 0xc04dae9f60, {0x0?, 0xc0469fa370?, 0xc0461c2840?})
	/app/pkg/tracing/tracing.go:93 +0x59
main.runRule.func7()
	/app/cmd/thanos/rule.go:532 +0x96
github.com/oklog/run.(*Group).Run.func1({0xc03ace95f0?, 0xc000293a40?})
	/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38 +0x2f
created by github.com/oklog/run.(*Group).Run
	/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:37 +0x22a

What you expected to happen:

Query and Ruler pods to behave as expected

How to reproduce it (as minimally and precisely as possible):

Deploy a query pod with the following STS configuration:

containers:
      - name: thanos-query
        args:
        - query
        - --grpc-address=[REDACTED]
        - --http-address=[REDACTED]
        - --log.level=info
        - --log.format=logfmt
        - --query.replica-label=prometheus_replica
        - --query.replica-label=rule_replica
        - --query.replica-label=thanos_ruler_replica
        - --endpoint=dnssrv+_grpc._tcp.thanos-ruler-operated
        - --endpoint=dnssrv+_grpc._tcp.thanos-sidecar
        - --endpoint=dnssrv+_grpc._tcp.thanos-store
        - --query.auto-downsampling
        - --tracing.config=$(TRACING_CONFIG)

Deploy a ruler pod using the monitoring.coreos/v1 CRD

Full logs to relevant components:

Anything else we need to know:

The text was updated successfully, but these errors were encountered:

douglascamata · 2023-01-05T17:38:20Z

@RohitKochhar was kind enough to try the image tags built from the commits merged with main (see conversation at https://cloud-native.slack.com/archives/CL25937SP/p1672937478880859) and the results are:

main-2022-10-31-7968915 has the bug.
main-2022-10-30-0e0f4fb doesn't have the bug.

The only difference between these two images is the presence of the commit 7968915 (from #5716) in the image that has the bug.

Can anyone else confirm the DNS resolution is working fine on v0.30?

This might require a release 0.30.2 with a hotfix or rollback of that commit.

GiedriusS · 2023-01-06T08:22:53Z

Interesting 🤔 according to the stacktrace, crash happens here i.e. pb.root is nil. I wonder how this DNS stuff relates to the stacktrace 😄

douglascamata · 2023-01-06T10:29:40Z

@RohitKochhar what is resolving DNS in your cluster: CoreDNS or KubeDNS? Which k8s version are you running?

douglascamata · 2023-01-06T10:40:48Z

Interesting 🤔 according to the stacktrace, crash happens here i.e. pb.root is nil. I wonder how this DNS stuff relates to the stacktrace 😄

@GiedriusS are you sure that this is the correct reading stacktrace? I think tracing code will always be at the top of the stack because it wraps everything. But down there we can see github.com/thanos-io/thanos/pkg/query.(*EndpointSet).Update, that should be using the data produced by the code modified in 7968915. I think the nil pointer derefence comes from the new code somehow returning a nil and it blows up exactly at the 2nd innermost LoC in the stack (the 1st innermost is an anonymous go-function call):

thanos/pkg/query/endpointset.go

Line 382 in 3327c51

e.updateEndpoint(ctx, spec, newRef)

GiedriusS · 2023-01-06T10:49:18Z

Oops, yes, my bad 😞 you are correct 👍

RohitKochhar · 2023-01-09T13:01:36Z

Pinging for an update on this @douglascamata @fpetkovski

douglascamata · 2023-01-09T13:59:35Z

@RohitKochhar we need some extra information to try to reproduce the issue, investigate and work on a fix. See my previous comment: #6021 (comment).

RohitKochhar · 2023-01-09T15:12:44Z

Oh, sorry I didn't see that. We use CoreDNS @douglascamata

GiedriusS · 2023-01-09T15:44:00Z

@RohitKochhar do you have tracing enabled? Could we see the config?

douglascamata · 2023-01-09T15:45:58Z

I am not able to reproduce this in a prod-like cluster also using CoreDNS. We need your help @RohitKochhar to investigate this cluster and look for anything out of the ordinary.

RohitKochhar · 2023-01-09T15:54:58Z

Ok, I can form a better summary and share it soon

RohitKochhar · 2023-01-09T18:00:33Z

@douglascamata after looking into it further, I am unsure why I didn't catch this originally, but the breaking version is v0.29.0. I am able to upgrade to v0.28.0 without problem

douglascamata · 2023-01-09T18:28:03Z

There's some conflicting information then. I asked you to try two different images tags straight from the main branch container builds: main-2022-10-31-7968915 (has the bug) and main-2022-10-30-0e0f4fb (doesn't have the bug). Both of these images contain all the commits from v0.29.0 as they are part of the v0.30 release.

RohitKochhar · 2023-01-09T18:28:45Z

Yeah, I think I incorrectly synced the resources through ArgoCD. I am currently retrying main-2022-10-30-0e0f4fb

RohitKochhar · 2023-01-09T18:29:25Z

I can now confirm that main-2022-10-30-0e0f4fb does not work for our query and ruler pods

douglascamata · 2023-01-09T18:32:57Z

Okay. So that rules out what we thought caused your issue. 😄

If you have some time, please try to bisect the snapshot images to find exactly where it works. But seems like your pods do not like something in the miekdns resolver. It has been the default in Query since v0.23 and in Rule since v0.30.

douglascamata · 2023-01-09T18:34:13Z

Between v0.28 and v0.29 though we had no DNS changes.

RohitKochhar · 2023-01-09T18:35:22Z

Ok, I'll start working my way through the snapshot images between v0.28 and v0.29 to find the last version I can use before breaking

RohitKochhar · 2023-01-09T19:14:05Z

@douglascamata I added more information into the initial issue comment, including the logs for both ruler and query pods that are failing and the configuration used for both.

The latest snapshot I could use was main-2022-09-30-21aa1db0, anything from main-2022-10-02-d533abf8 causes the logs above. Can you point me to commits that were added between these images?

douglascamata · 2023-01-10T10:44:24Z

The latest snapshot I could use was main-2022-09-30-21aa1db0, anything from main-2022-10-02-d533abf8 causes the logs above. Can you point me to commits that were added between these images?

That's pretty easy. The image tag format is <BRANCH>-<DATE>-<COMMIT_HASH>.

From 21aa1db to d533abf we have only 64e7e3b.

Are you sure that main-2022-10-02-64e7e3bb worked? This came right after main-2022-09-30-21aa1db0 and is migrating tracing from OpenTracing to OTLP. This makes me think @GiedriusS's initial hunch was correct.

RohitKochhar · 2023-01-10T12:57:39Z

main-2022-10-02-64e7e3bb does not work, I think you are right that @GiedriusS initial hunch was correct. Any ideas for solutions?

RohitKochhar · 2023-01-10T13:34:33Z

Here is our tracing configuration:

type: jaeger
config:
  service_name: thanos-query
  endpoint: $(JAEGER_ENDPOINT)
  sampler_manager_host_port: $(JAEGER_SAMPLING_ENDPOINT)

With the jaeger endpoints pointing to our grafana-agent-trace urls

matej-g · 2023-01-10T15:12:21Z

Hmm this looks like #5872 which is fixed in #5887, so it should be fine, but it seems as if this was never merged back to main from 0.29, can you confirm @yeya24?

yeya24 · 2023-01-10T16:39:51Z

Yeah I think that's the same issue. We just never released v0.29.1 and never merge back those commits to main.

douglascamata · 2023-01-10T17:11:50Z

Great find, @matej-g! 👏

@yeya24 @matej-g: shall we merge it back then and release 0.29.1 and 0.30.2?

yeya24 · 2023-01-10T18:45:19Z

@douglascamata Yeah I think we just needs to do the release and that's it. I think we didn't release v0.29.1 previously because of #5862 IIRC. @GiedriusS

douglascamata · 2023-01-11T15:05:00Z

We need to release v0.29.1, then cherry pick the commit df047ef into main and release v0.30.2 with it.

xBazilio · 2023-01-20T16:02:56Z

Won't work df047ef#diff-eb3987ed9aba572232997e0b7878dbe798ccd79ca31bf8e933963e89d2901723R143

sampler still will be nil because "default" case will never be evaluated.

xBazilio · 2023-01-20T16:03:53Z

A solution would be removing "default:" or adding "fallthrough"

douglascamata · 2023-01-20T18:30:08Z

@xBazilio is this a theory or you tested it out?

xBazilio · 2023-01-20T19:02:00Z

@xBazilio is this a theory or you tested it out?

Yes, I've Tested. Deployed on dev, caught panic, searched and found this issue. Then i cherry-picked the commit, panic still existed. Then with debugger I found this switch/case issue and fixed it by removing "default:" line. Colleague suggested using "fallthrough" keyword instead, but right now it isn't necessary as "SamplerTypeRateLimiting" is the default.

douglascamata · 2023-01-23T11:04:52Z

@xBazilio would you like to open a PR with this fix?

xBazilio · 2023-01-23T12:26:54Z

I created the PR https://github.com/thanos-io/thanos/pull/6066/files
it's beeing reviewed

douglascamata · 2023-01-23T14:15:37Z

@xBazilio just missing the changelog entry and then it'll be perfect. Thanks for the contribution!

RohitKochhar · 2023-02-01T14:01:42Z

This seems to be solved by upgrading to v0.30.2, closing.

RohitKochhar changed the title ~~nil pointer dereference in v0.30.0~~ nil pointer dereference in main-2022-10-02-d533abf8 forward Jan 9, 2023

RohitKochhar closed this as completed Feb 1, 2023

nil pointer dereference in main-2022-10-02-d533abf8 forward #6021

nil pointer dereference in main-2022-10-02-d533abf8 forward #6021

Comments

RohitKochhar commented Jan 5, 2023 • edited Loading

douglascamata commented Jan 5, 2023 • edited Loading

GiedriusS commented Jan 6, 2023

douglascamata commented Jan 6, 2023

douglascamata commented Jan 6, 2023

GiedriusS commented Jan 6, 2023

RohitKochhar commented Jan 9, 2023

douglascamata commented Jan 9, 2023

RohitKochhar commented Jan 9, 2023

GiedriusS commented Jan 9, 2023

douglascamata commented Jan 9, 2023

RohitKochhar commented Jan 9, 2023

RohitKochhar commented Jan 9, 2023

douglascamata commented Jan 9, 2023

RohitKochhar commented Jan 9, 2023

RohitKochhar commented Jan 9, 2023 • edited Loading

douglascamata commented Jan 9, 2023

douglascamata commented Jan 9, 2023

RohitKochhar commented Jan 9, 2023

RohitKochhar commented Jan 9, 2023

douglascamata commented Jan 10, 2023

RohitKochhar commented Jan 10, 2023

RohitKochhar commented Jan 10, 2023

matej-g commented Jan 10, 2023

yeya24 commented Jan 10, 2023

douglascamata commented Jan 10, 2023 • edited Loading

yeya24 commented Jan 10, 2023

douglascamata commented Jan 11, 2023

xBazilio commented Jan 20, 2023

xBazilio commented Jan 20, 2023 • edited Loading

douglascamata commented Jan 20, 2023

xBazilio commented Jan 20, 2023 • edited Loading

douglascamata commented Jan 23, 2023

xBazilio commented Jan 23, 2023

douglascamata commented Jan 23, 2023

RohitKochhar commented Feb 1, 2023

RohitKochhar commented Jan 5, 2023 •

edited

Loading

douglascamata commented Jan 5, 2023 •

edited

Loading

RohitKochhar commented Jan 9, 2023 •

edited

Loading

douglascamata commented Jan 10, 2023 •

edited

Loading

xBazilio commented Jan 20, 2023 •

edited

Loading

xBazilio commented Jan 20, 2023 •

edited

Loading