Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slow shutdown #2113

Merged
merged 4 commits into from
Sep 9, 2024
Merged

Fix slow shutdown #2113

merged 4 commits into from
Sep 9, 2024

Conversation

kradalby
Copy link
Collaborator

@kradalby kradalby commented Sep 9, 2024

Turns out we were waiting for connections to close, without telling them to close...

Now:

  • Shutdown the HTTP server, so no new client conns are accepted
  • Shutdown notifier, telling all the clients to disconnect
  • Wait for all polls to close

Fixes #1968

Signed-off-by: Kristoffer Dalby kristoffer@tailscale.com

Summary by CodeRabbit

  • New Features

    • Introduced improved error handling for server shutdown scenarios, enhancing clarity in error reporting.
    • Added a mechanism in the Notifier to track its lifecycle, preventing operations after closure.
  • Bug Fixes

    • Resolved an issue where shutting down the headscale application would hang, ensuring a smoother shutdown process.
  • Documentation

    • Updated the changelog to reflect the recent fixes and improvements.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Copy link
Contributor

coderabbitai bot commented Sep 9, 2024

Walkthrough

The changes in this pull request enhance error handling in the serveCmd command, streamline the shutdown process in the Headscale struct, and improve the lifecycle management of the Notifier struct. The modifications include differentiating between server closure and other errors, reorganising shutdown procedures for clarity, and introducing a boolean field to track the notifier's state. These updates aim to improve the robustness and clarity of the application's error reporting and resource management.

Changes

Files Change Summary
cmd/headscale/cli/serve.go Improved error handling in serveCmd to differentiate server closure from other errors. Added necessary imports.
hscontrol/app.go Reordered and consolidated shutdown procedures for better clarity and efficiency.
hscontrol/notifier/notifier.go Introduced closed boolean in Notifier to manage its lifecycle and prevent operations after closure.
CHANGELOG.md Added entry for fix addressing hang issue during shutdown of headscale.

Assessment against linked issues

Objective Addressed Explanation
Improve shutdown speed for systemctl stop headscale (#1968)

Possibly related PRs

Poem

🐇 In the garden where headscale plays,
Errors now dance in clearer ways.
With notifiers closed, they won't misbehave,
A smoother shutdown, our code we save!
Hops of joy as we tidy the flow,
In the world of servers, watch us grow! 🌼

Tip

New features

Walkthrough comment now includes:

  • Possibly related PRs: A list of potentially related PRs to help you recall past context.
  • Suggested labels: CodeRabbit can now suggest labels by learning from your past PRs. You can also provide custom labeling instructions in the UI or configuration file.

Notes:

  • Please share any feedback in the discussion post on our Discord.
  • Possibly related PRs, automatic label suggestions based on past PRs, learnings, and possibly related issues require data opt-in (enabled by default).

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5597eda and 1ac07cd.

Files selected for processing (3)
  • cmd/headscale/cli/serve.go (2 hunks)
  • hscontrol/app.go (3 hunks)
  • hscontrol/notifier/notifier.go (7 hunks)
Additional comments not posted (10)
cmd/headscale/cli/serve.go (2)

4-5: Approved import additions.

The new imports for errors and net/http are necessary for the updated error handling logic and are correctly placed.


28-29: Improved error handling in server shutdown.

The updated error handling logic now correctly differentiates between a normal server closure and other errors, preventing the logging of misleading fatal errors during a normal shutdown. This is a significant improvement in the robustness of the application.

hscontrol/notifier/notifier.go (7)

39-39: Approved addition of closed field.

The addition of the closed boolean field to the Notifier struct is a good practice for managing the lifecycle of the notifier. It helps ensure that no operations are performed on a closed notifier, which can prevent resource leaks and other errors.


47-47: Approved initialization of closed field in NewNotifier.

The initialization of the closed field to false in the NewNotifier function is correctly implemented. This ensures that the notifier starts in an active state, ready to handle operations.


85-87: Approved early return in AddNode method.

The addition of an early return in the AddNode method when the notifier is closed is a good practice. It prevents the addition of new nodes to a closed notifier, maintaining the integrity of the notifier's state and avoiding potential errors.


115-117: Approved early return in RemoveNode method.

The implementation of an early return in the RemoveNode method when the notifier is closed is correctly done. This prevents modifications to the notifier's state after it has been closed, which is essential for maintaining the robustness of the system.


177-179: Approved check for closed state in NotifyWithIgnore.

The addition of a check for the closed state in the NotifyWithIgnore method is a good practice. It ensures that no updates are sent from a closed notifier, which helps maintain the stability and predictability of the notification system.


197-199: Approved consistency in checking closed state in NotifyByNodeID.

The implementation of a check for the closed state in the NotifyByNodeID method is consistent with other methods in the Notifier class. This consistency is crucial for ensuring that the notifier does not perform operations after it has been closed, maintaining the robustness of the system.


236-238: Approved check for closed state in sendAll.

The implementation of a check for the closed state in the sendAll method is a good practice. It ensures that updates are not sent to all nodes from a closed notifier, which is crucial for preventing errors that could affect the entire network of nodes.

hscontrol/app.go (1)

797-802: Ensure proper resource management during shutdown.

The changes made in lines 797-802 address the slow shutdown issue by ensuring that resources are properly managed and released during the shutdown process. The reordering of the shutdown sequence, specifically the closing of the node notifier and waiting for the netmap stream to close, is a critical improvement.

However, it's important to ensure that all resources are accounted for and that there are no potential leaks or deadlocks. The use of sync.WaitGroup for managing concurrent operations is appropriate, but thorough testing is necessary to ensure that all go routines are properly synchronized and that the WaitGroup counter is correctly managed throughout the application lifecycle.

Consider adding more detailed logging at each step of the shutdown process to help diagnose issues and ensure that each component is shut down in the correct order and timely manner. This could be particularly useful in production environments where shutdown behaviours might vary due to different loads or configurations.

Additionally, review the entire shutdown sequence to ensure that all network connections, file handles, and other resources are properly closed. This includes verifying that all goroutines are terminated, which can be done by checking for active goroutines before and after the shutdown sequence in a test environment.

Run the following script to verify the proper management of resources during shutdown:

Verification successful

Shutdown sequence is well-managed and logged.

The shutdown sequence in hscontrol/app.go is comprehensive and includes appropriate use of sync.WaitGroup and logging mechanisms. The sequence ensures that resources are closed in a specific order, and the logging provides visibility into the process, aiding in diagnosing potential issues. The original review comment's suggestion for detailed logging is already implemented in the codebase.

  • The sync.WaitGroup is used effectively to manage concurrent operations during shutdown.
  • Logging statements are present throughout the shutdown process, providing detailed insights.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that all resources are properly managed during shutdown.

# Test: Check for resource leaks or deadlocks during shutdown. Expect: No leaks or deadlocks.
# This script assumes the presence of a testing framework that can simulate shutdown scenarios.
# Replace `simulate_shutdown` with the actual command or script that triggers the shutdown process.
simulate_shutdown
# Check system logs or monitoring tools to verify that all resources are released and no errors are reported.

Length of output: 104


Script:

#!/bin/bash
# Search for the usage of sync.WaitGroup in the codebase to understand its role in the shutdown process.
ast-grep --lang go --pattern 'sync.WaitGroup' 

# Search for functions or methods related to shutdown processes.
rg 'shutdown|close|terminate' -A 5

# Check for logging mechanisms related to shutdown processes.
rg 'log|trace' -A 5

Length of output: 221008

hscontrol/notifier/notifier.go Show resolved Hide resolved
Signed-off-by: Kristoffer Dalby <kristoffer@tailscale.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1ac07cd and 71ef689.

Files selected for processing (1)
  • CHANGELOG.md (1 hunks)
Additional comments not posted (1)
CHANGELOG.md (1)

77-77: Approve the changelog entry with a suggestion.

The entry clearly documents the fix for the slow shutdown issue and correctly links to PR #2113. Ensure that the release date 2023-XX-XX for version 0.23.0 is updated to the actual release date upon finalising the version.

@kradalby kradalby marked this pull request as ready for review September 9, 2024 08:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 71ef689 and f7444d7.

Files selected for processing (1)
  • hscontrol/app.go (1 hunks)
Files skipped from review as they are similar to previous changes (1)
  • hscontrol/app.go

@kradalby kradalby enabled auto-merge (squash) September 9, 2024 10:19
@kradalby kradalby merged commit 60b94b0 into juanfont:main Sep 9, 2024
117 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] systemctl stop headscale is very slow!
2 participants