Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] collection of issues arising from Grafana alerts in January #14240

Closed
danking opened this issue Feb 1, 2024 · 1 comment
Closed

[batch] collection of issues arising from Grafana alerts in January #14240

danking opened this issue Feb 1, 2024 · 1 comment
Assignees

Comments

@danking
Copy link
Contributor

danking commented Feb 1, 2024

What happened?

Still a lot of deadlock errors. Largely from MJC https://cloudlogging.app.goo.gl/N8hoXPWYYWLiDPPi9

Looks like workers are leaving tasks running when they shutdown https://cloudlogging.app.goo.gl/JFYoACF9qcDvCaqk8

Looks like we need to set the severity correctly in the worker logs. I'm also seeing a lot of this

WARNING: Published ports are discarded when using host network mode
Also looks like we incorrectly log a ContainerTimeoutError as an error log even though that's a user error: https://cloudlogging.app.goo.gl/TUGWNxnFiBiEdsDo9
Moved to #14262

And we log ImageCannotBePulled as an error even though that's a user error: https://cloudlogging.app.goo.gl/TchqwUKNCrd6qqmh7

Also a few like this: Unknown child process pid 12331, will report returncode 255

Version

0.2.127

Relevant log output

No response

@daniel-goldstein
Copy link
Contributor

daniel-goldstein commented Feb 6, 2024

Looks like we need to set the severity correctly in the worker logs. I'm also seeing a lot of this
WARNING: Published ports are discarded when using host network mode
Also looks like we incorrectly log a ContainerTimeoutError as an error log even though that's a user error: https://cloudlogging.app.goo.gl/TUGWNxnFiBiEdsDo9

For what it's worth, this was showing up as an info log because this is a docker log message not from our code, so it's not going through our logging filters. The reason this showed up in the Google Logging query was because the query included this line

severity=ERROR OR WARNING

which means "logs whose severity is ERROR or whose log entry contains "WARNING"", it is not equivalent to severity=ERROR OR severity=WARNING which does not show that log entry. Either way, #14252 gets rid of that log message entirely.

danking pushed a commit that referenced this issue Feb 6, 2024
As discussed in #14240, we emit warnings on database deadlocks, which
there are enough of to trigger noisy alerts. Since there's nothing to be
done operationally (and there's no current work underway to get rid of
them), these alerts only contribute to alert fatigue and hide potential
problems in the system that could be addressed. This demotes a deadlock
to the `info` level so we can still see how often they occur but are not
alerted by them. In the future when we resolve the current deadlock we
can re-escalate this error so that we can catch new deadlocks that are
introduced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants