Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-cursor: pending task count monitoring & mitigation #3275

Merged
merged 4 commits into from
Aug 31, 2022

Conversation

yycptt
Copy link
Member

@yycptt yycptt commented Aug 29, 2022

What changed?

  • Add monitoring and mitigation logic for # of pending task in a queue.
  • For mitigation, basic idea is to find the namespace with the highest # of pending tasks, split it into the next reader, and unload pending tasks.

Why?

  • Multi-cursor policy

How did you test it?

  • Testing locally.

Potential risks

Is hotfix candidate?

  • No.

@yycptt yycptt requested review from yux0 and yiminc August 29, 2022 18:59
@yycptt yycptt requested a review from a team as a code owner August 29, 2022 18:59
common/dynamicconfig/constants.go Outdated Show resolved Hide resolved
service/history/queues/action_pending_task_count.go Outdated Show resolved Hide resolved
service/history/queues/action_pending_task_count.go Outdated Show resolved Hide resolved
} else {
nextReader = readerGroup.NewReader(readerID+1, splitSlices...)
}
nextReader.Pause(clearSliceThrottleDuration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pause here? (for 10s?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we just did an unloading, without pause those tasks may be loaded back immediately.
Although nextReader will use a lower max task count so it won't trigger another unloading, but I think it will still increase the chance of another unloading. So pausing for a while and let some in memory tasks get processed first before attempting another load.

AlertType AlertType
AlertAttributesReaderStuck *AlertAttributesReaderStuck
AlertType AlertType
AlertAttributesQueuePendingTaskCount *AlertAttributesQueuePendingTaskCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove Queue from the type name and variable name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Queue in the names since we may/will monitor # of pending tasks on a host level in the future, so want to avoid naming confusion when we do that.

service/history/queues/monitor.go Outdated Show resolved Hide resolved
@yycptt yycptt merged commit 5ab8ab8 into temporalio:master Aug 31, 2022
@yycptt yycptt deleted the pending-task-action branch August 31, 2022 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants