Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Values of new metrics not renderable with realtime-index enabled. #395

Closed
timw6n opened this issue Jan 19, 2021 · 13 comments
Closed
Labels

Comments

@timw6n
Copy link

timw6n commented Jan 19, 2021

Describe the bug

Quite a similar situation to that reported in #372. We have a large cluster (~3 million metrics) and so make heavy use of the cache, with max-updates-per-second throttled down to keep disk activity within sensible limits.

Have recently upgraded to 0.15.5, and at that point enabled realtime-index and concurrent-index. At present the issue we're having is that the names of new metrics are visible immediately (and so can be found by Graphite find queries) but the datapoints cannot be retrieved from the cache (render queries) until the whisper files end up on disk. New datapoints on existing metrics pull through from the cache fine.

Logged error seems to be

[2021-01-19T16:58:32.475Z] ERROR [access] fetch failed {"handler": "render", "url": "/render/?format=protobuf&from=1611064712&target=[redacted]&until=1611075512", "peer": "10.11.173.11:44604", "carbonapi_uuid": "04877f92-6d7c-4ae4-b593-86d3123f7f12", "carbonzipper_uuid": "04877f92-6d7c-4ae4-b593-86d3123f7f12", "format": "carbonapi_v2_pb", "targets": ["[redacted]], "runtime_seconds": 0.001309091, "reason": "panic during serving the request", "stack": "github.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).renderHandler.func1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:201\nruntime.gopanic\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/runtime/panic.go:969\nruntime.panicmem\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/runtime/panic.go:212\nruntime.sigpanic\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/runtime/signal_unix.go:720\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchFromCache\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/fetchfromcache.go:56\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchSingleMetric\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/fetchsinglemetric.go:141\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchSingleMetricV2\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/fetchsinglemetric.go:162\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchDataPB\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:543\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).prepareDataProto\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:394\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).fetchWithCache\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:311\ngithub.com/go-graphite/go-carbon/carbonserver.(*CarbonserverListener).renderHandler\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/render.go:209\nnet/http.HandlerFunc.ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2042\ngithub.com/go-graphite/go-carbon/carbonserver.TraceHandler.func1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/carbonserver/trace.go:140\ngithub.com/dgryski/httputil.TimeHandler.func1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/vendor/github.com/dgryski/httputil/times.go:26\ngithub.com/dgryski/httputil.TrackConnections.func1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/vendor/github.com/dgryski/httputil/track.go:40\nnet/http.HandlerFunc.ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2042\nnet/http.(*ServeMux).ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2417\ngithub.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1\n\t/home/travis/gopath/src/github.com/go-graphite/go-carbon/vendor/github.com/NYTimes/gziphandler/gzip.go:287\nnet/http.HandlerFunc.ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2042\nnet/http.serverHandler.ServeHTTP\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:2843\nnet/http.(*conn).serve\n\t/home/travis/.gimme/versions/go1.15.3.linux.amd64/src/net/http/server.go:1925", "error": "runtime error: invalid memory address or nil pointer dereference", "http_code": 500}

Go-carbon Configuration:

[common]
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "tcp://[redacted]:2003"
max-cpu = 4
metric-interval = "1m0s"

[whisper]
data-dir = "/opt/go-carbon/whisper"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = ""
workers = 8
max-updates-per-second = 400
sparse-create = false
flock = true
enabled = true
hash-filenames = true

[cache]
max-size = 500000000
write-strategy = "noop"

[tcp]
listen = ":2003"
enabled = true
buffer-size = 2000000
compression = ""

[carbonlink]
enabled = false

[grpc]
enabled = false

[tags]
enabled = false

[carbonserver]
listen = ":80"
enabled = true
query-cache-enabled = true
query-cache-size-mb = 4096
find-cache-enabled = true
buckets = 10
max-globs = 1000
fail-on-max-globs = false
metrics-as-counters = false
trie-index = true
concurrent-index = true
realtime-index = 1000000
trigram-index = false
cache-scan = false
graphite-web-10-strict-mode = true
internal-stats-dir = "/opt/go-carbon/carbonserver"
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "10m0s"
stats-percentiles = [99, 95, 75, 50]

[dump]
enabled = false

[pprof]
enabled = false

[[logging]]
logger = ""
file = "stdout"
level = "error"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

Storage schemas

Everything relevant to this on 1m:30d.

@timw6n timw6n added the bug label Jan 19, 2021
@bom-d-van
Copy link
Member

hi @timw6n-thg , can you try enabling cache-scan as well, to see if the panic is resolved?

realtime-index and concurrent-index reuses part of its implementation.

@bom-d-van
Copy link
Member

I have also pushed a fix, with the new changes, we don't have to enable cache-scan when using realtime-index.

@timw6n
Copy link
Author

timw6n commented Jan 20, 2021

Thanks for that fix @bom-d-van. I've applied the patch to my cluster and it seems to have improved the situation.

What's happening now seems to be that the metrics are being added to the index immediately (and are now renderable at that time 😀) but are being removed the next time the disk scan runs, presumably as I have cache-scan = false.

I will give enabling cache-scan a go tomorrow. Have previously had difficulty with that though on older versions, presumably because of the volume of cached metrics (130 million data-points atm to give you an idea of scale).

@bom-d-van
Copy link
Member

but are being removed the next time the disk scan runs, presumably as I have cache-scan = false

@timw6n-thg Hmm, this sounds odd. Was concurrent-index still enabled in the test?

@timw6n
Copy link
Author

timw6n commented Jan 20, 2021

Yes, that was enabled. Config unchanged from above.

@bom-d-van
Copy link
Member

bom-d-van commented Jan 20, 2021

Just in case, was there any restart before the disk scan during your test? If so, I might have an explanation.

@timw6n
Copy link
Author

timw6n commented Jan 20, 2021

No restarts I'm afraid.

@timw6n
Copy link
Author

timw6n commented Jan 20, 2021

Other than the one when I deployed the new go-carbon version of course, but that was before started sending the new metrics.

@bom-d-van
Copy link
Member

Oh, I think I have a good explanation now. Yeah, it's a bug in realtime index.

But your scan-frequency is 10m, so that means after 10min, the new metrics are still not flushed to disk.

@timw6n
Copy link
Author

timw6n commented Jan 20, 2021

Yeah, running a queue writeout time of about 2 hours at the moment, which I'm aware is probably not a normal configuration. Writes throttled down a lot to protect the underlying storage.

@bom-d-van
Copy link
Member

Will see how can I fix the issue. cache-scan is probably not helpful for this case, but you can still give it another shot.


If the storage is a big factor for your cluster, have you tried out the compressed feature? There should be some performance gains with it. But you would lose out of order updates on the metrics.

https://github.com/go-graphite/go-whisper#compressed-format

bom-d-van added a commit that referenced this issue Jan 20, 2021
… after disk scan #395

New unflushed metrics might still be unreachable due to restart though.
But it is a different case that happens less frequently and could be
addressed in a different commit.
@bom-d-van
Copy link
Member

bom-d-van commented Jan 20, 2021

@timw6n-thg I was wrong again. Actually enabling cache-scan would help avoiding the issue.
But again, I have pushed a fix to make sure that new un-flushed metrics could be reindexed after disk scan: #396

(also handled a bug related to cache-scan, the new metrics aren't removed from memory after it's persisted. not sure if it's related to the problem that you had when testing cache-scan).

@timw6n
Copy link
Author

timw6n commented Jan 22, 2021

That last patch seems to have done the trick. New metrics pulling through correctly straight away. Thank you.

deniszh added a commit that referenced this issue Jan 24, 2021
trie/realtime: set wspConfigRetriever when realtime index is enabled #395
@timw6n timw6n closed this as completed May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants