Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(prometheus): http server exit on error #590

Merged
merged 5 commits into from
Mar 3, 2025

Conversation

dnut
Copy link
Contributor

@dnut dnut commented Feb 28, 2025

Currently, if the prometheus http server has an error, it just prints the name of the error to stderr and then exits the thread. Metrics stop being served. All kinds of errors could happen in here, but that doesn't mean we should stop serving metrics. It's better to log an error and then continue running. I used runService from the service manager to accomplish this, with a bit of refactoring in how the thread is spawned.

Copy link

codecov bot commented Feb 28, 2025

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/utils/service.zig 93.75% 1 Missing ⚠️
Files with missing lines Coverage Δ
src/prometheus/http.zig 0.00% <ø> (ø)
src/utils/service.zig 85.93% <93.75%> (+0.45%) ⬆️

... and 3 files with indirect coverage changes

@dnut dnut added this pull request to the merge queue Mar 3, 2025
Merged via the queue into main with commit 2349e19 Mar 3, 2025
17 checks passed
@dnut dnut deleted the dnut/metrics/server-resilience branch March 3, 2025 15:38
github-merge-queue bot pushed a commit that referenced this pull request Mar 8, 2025
## Problems

Prometheus does not consistently get metrics from our metrics endpoint.
The metrics are intermittent.

![image
(2)](https://github.com/user-attachments/assets/b0a34b63-7a9a-487e-8c51-a37a9c3fa223)

Also, the server often crashes due to errors. In #590 I added a wrapper
to catch these errors and restart the server. But sometimes an error
occurs that is handled, a server restart is attempted, but the server
never responds to any http requests any more, so the metrics become
completely inaccessible until sig is completely restarted. Usually this
is caused by `error.BrokenPipe`

## Solution

> Revert "fix(prometheus): remove httpz again and fix prometheus metrics
(#555)"
> 
> This reverts commit 3502333.

I understand the desire to eliminate unnecessary dependencies. In the
grand scheme of things, eliminating httpz is not that important. Having
a working metrics endpoint is critical though. We have a solution that
works with httpz. We should test more thoroughly before replacing it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

2 participants