Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: doc how to find out the bottleneck resources for trouble shouting high latency issue #1927

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

st1page
Copy link
Contributor

@st1page st1page commented Mar 5, 2024

Info

  • Description

    • [ What's changed? Which parts of the docs are affected? ]
  • Notes

    • [ Include any supplementary context or references here. ]
  • Related code PR

    • [ Provide a link to the relevant code PR here, if applicable. ]
  • Related doc issue

    Resolves [ Provide a link to the relevant doc issue here, if applicable. ]

For reviewers

  • Preview

    • [ Paste the preview link to the updated page(s) here. Edit this item after the preview site is ready. To find the updated pages, scroll down to locate and open the Amplify preview link and select the dev version of the documentation. ]
  • Key points

    • [ Parts that may need revision or extra consideration. ]

Before merging

  • I have checked the doc site preview, and the updated parts look good.

  • I have acquired the approval from the owner (and optionally the reviewers) of the code PR and at least one tech writer (CharlieSYH, emile-00, & hengm3467).

Copy link

This pull request is automatically being deployed by Amplify Hosting (learn more).

Access this pull request here: https://pr-1927.d2fbku9n2b6wde.amplifyapp.com

@fuyufjh
Copy link
Member

fuyufjh commented Mar 6, 2024

LGTM 👍. It would be even better if we can put some example screenshots to illustrate what does the normal and abnormal cases look like.


**Grafana dashboard (dev)** > **Cluster Node** > **Node CPU** panel, and find the "cpu usage (avg per core) - compute" time series

### State bottleneck(write & compaction)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the key metric should be the backpressure rate.
If I understand correctly, when it's "slow" and hits some bottleneck, there must be some "backpressured" streaming jobs.
So the first step to identify the job should be identifying the jobs with highest BP value.

Copy link
Contributor Author

@st1page st1page Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is contained in the previous chapter "## Diagnosis —— find out the bottleneck streaming job" and has published in our doc

@st1page st1page changed the title doc how to find out the bottleneck resources for trouble shouting high latency issue WIP: doc how to find out the bottleneck resources for trouble shouting high latency issue Mar 25, 2024
@st1page st1page marked this pull request as ready for review March 25, 2024 06:19
@st1page st1page marked this pull request as draft March 25, 2024 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants