Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] WAL Disk Usage Metrics #2043

Open
CtrlAltDft opened this issue Sep 27, 2024 · 0 comments
Open

[Enhancement] WAL Disk Usage Metrics #2043

CtrlAltDft opened this issue Sep 27, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@CtrlAltDft
Copy link
Contributor

Who is this for and what problem do they have today?

Target Audience:

  • Engineers who manage and maintain AutoMQ clusters, particularly those deployed on AWS infrastructure using EBS volumes for Write-Ahead Log (WAL) storage.

Problem Statement:

  • Lack of Visibility into WAL Disk Usage:
    • AutoMQ uses EBS volumes mounted as block devices for WAL storage to optimize performance.
    • Standard disk usage monitoring tools cannot track free disk space on block devices that are not mounted with a traditional file system.
    • Administrators currently have limited metrics available, restricted to IOPS and read/write throughput, which do not provide insights into actual disk space utilization.
  • Operational Challenges:
    • Without accurate metrics on WAL disk usage, there is a risk of unexpected disk space exhaustion, which can lead to system crashes or data loss.
    • Capacity Planning Difficulties: Inability to forecast when additional storage is needed hampers proactive resource management.
    • Alerting Limitations: Lack of thresholds and alerts for disk usage prevents timely intervention before critical issues arise.

Why is solving this problem impactful?

  • Ensures System Reliability and Stability:
    • Monitoring WAL disk usage helps prevent service interruptions caused by full disks.
    • Enables proactive maintenance, reducing the risk of data loss or corruption.
  • Improves Operational Efficiency:
    • Provides administrators with the necessary insights to make informed decisions about scaling storage resources.
    • Facilitates capacity planning, ensuring that resources are allocated efficiently and cost-effectively.
  • Enhances Monitoring and Alerting Capabilities:
    • Allows integration with existing monitoring tools to set up alerts and notifications when disk usage reaches critical levels.
    • Empowers teams to respond quickly to potential issues, minimizing downtime.
  • Aligns with Best Practices:
    • Adhering to industry standards for system monitoring and observability.
    • Helps maintain high availability and performance of AutoMQ clusters.
  • Supports Autoscaling Efforts:
    • Accurate metrics are essential for implementing event-driven autoscaling, ensuring that scaling actions are based on reliable data.
    • Enhances the effectiveness of auto-balancing mechanisms by providing comprehensive system insights.

Additional Notes

  • Possible Solutions:
    • Expose WAL Disk Usage Metrics:
      • AutoMQ could provide built-in metrics for WAL disk utilization accessible via standard monitoring interfaces (e.g., JMX, Prometheus exporters).
@CtrlAltDft CtrlAltDft added the enhancement New feature or request label Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant