Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard Spec - Volume Dashboard #224

Open
julienlim opened this issue Aug 8, 2017 · 10 comments
Open

Dashboard Spec - Volume Dashboard #224

julienlim opened this issue Aug 8, 2017 · 10 comments

Comments

@julienlim
Copy link
Member

julienlim commented Aug 8, 2017

Dashboard Spec - Volume Dashboard

Display a default dashboard for a Gluster volume present in Tendrl that provides at-a-glance information about a single Gluster volumethat includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight the Tendrl user's (e.g. Gluster Administrator) attention to potential issues in the volume, brick, and disk.

Problem description

A Gluster Administrator wants to be able to answer the following questions by looking at the cluster dashboard:

  • Is my volume up and running, is it healthy?
  • Is there a problem with my volume?
  • What’s actually wrong with my storage volume, why it it slow?
  • Is my storage volume filling up too fast?
  • When will my storage volume run out of capacity?
  • If something is down / broken / failed (e.g. brick down, disk failure, etc.), where and what is the issue, and when did it happen?
  • Have the number of clients (indicated via connections) increased (which may possibly be the reason for the performance degradation that the clients / applications are observing?

Use Cases

Uses Cases in the form of user stories:

  • As a Gluster Administrator, I want to view at-a-glance information about my Gluster volume that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight my attention to potential issues in the volume, brick, and disk.

  • As a Gluster Administrator, I want to compare 1 or more metrics (e.g. IOPS, CPU, Memory, Network Load) across bricks within the volume

  • Compare utilization (e.g. IOPS, capacity, etc.) across bricks within a volume

  • Look at performance by brick (within a volume) to address diagnosing of RAID 6 disk failure/rebuild/degradation poor performance on one brick

Proposed change

Provide a pre-canned, default volume dashboard in Grafana (that is initially launchable from the Tendrl UI, and eventually embed it into the Tendrl UI) that shows the following metrics rendered either in text or in a chart/graph depending on the type of metric being displayed below:

The Dashboard is composed of individual Panels (dashboard widgits) arranged on a number of Rows.

Note: The cluster and volume name or unique identifier should be visible at all times, and user should be able to switch to another volume.

Row 1

Panel 1: Health

Panel 2: Subvolumes

  • n total - total number (n) of subvolumes in the volume
  • n up - count (n) of subvolumes in the volume that are up
  • chart type: Stacked Card

Panel 3: Bricks

  • n total - total number (n) of bricks in the volume
  • n up - count (n) of bricks in the volume that are up
  • n down - count (n) of bricks in the volume that are down
  • chart type: Stacked Card

[FUTURE] Panel 4: Disks

  • n total - total number (n) of disks in the volume
  • n up - count (n) of disks in the volume that are up
  • n down - count (n) of disks in the volume that are down
  • chart type: Stacked Card

Panel 5: Geo-Replication Sessions

  • n total - total number (n) of geo-replication sessions for the volume
  • n Up - count (n) of geo-replication sessions for the volume that are up
  • n Up (Partial) - count (n) of geo-replication sessions for the volume that are up(partial)
  • n Down - count (n) of geo-replication sessions for the volume that are down
  • chart type: Stacked Card

Panel 6: Healing

Panel 7: Rebalance

  • Status: In Progress, Completed, Stopped, or Paused
  • n Rebalanced Files - total number (n) of rebalanced-files that are in progress
  • n Size - size of files on volumes needing rebalancing
  • n Scanned - total number (n) of scanned files
  • n Failures - total number (n) of rebalanced-files failures resulting from rebalancing
  • chart type: Stacked Card

Panel 8: Snapshots

  • n total - count (n) of active snapshots for the volume
  • chart type: Singlestat

Panel 9: Connections Trend

  • count (n) of client connections to the bricks in the volume over a period of time
  • chart type: Line Chart / Spark

Row 2

Panel 10: Capacity Utilization

  • Disk space used for the volume
  • chart type: Gauge

Panel 11: Capacity Available

  • Disk space free for the volume
  • chart type: Singlestat

Panel 12: Growth Rate

  • growth rate computed based on beginning and last end point to perform estimation
  • chart type: Singlestat

Panel 13: Time Remaining (Weeks)

  • based on projected growth rate in Panel 12, provide estimated # of weeks remaining
  • chart type: Singlestat

Panel 14: Inodes Utilization

  • Inodes used for the volume over a period of time
  • chart type: Line Chart / Spark

Panel 15: Inodes Available

  • Inodes free for the volume
  • chart type: Singlestat

Panel 16: Quotas???

Row 3

Panel 17: IOPS Trend

  • show the IOPS for the volume over a period of time
  • chart type: Line Chart / Spark

[FUTURE] Panel 18: IO Size

  • show IO Size
  • chart type: Singlestat

Panel 19: Throughput Trend

  • show throughput for the host over a period of time
  • chart type: Line Chart / Spark

Panel 20: LVM thin pool metadata %

  • LVM thin pool metadata %
  • infotip: Monitoring the utilization of LVM thin pool metadata and data usage is important to ensure they don't run out of space. If the data space is exhausted then, based on the configuration, I/O operations are either queued or failing. If metadata space is exhausted, you will observe error I/O's until the LVM pool is taken offline and repair is performed to fix potential inconsistencies. Moreover, due to the metadata transaction being aborted and the pool doing caching there might be uncomitted (to disk) I/O operations that were acknowledged to the upper storage layers (file system) so those layers will need to have checks/repairs performed as well.
  • chart type: Line Chart / Spark

Panel 21: LVM thin pool data usage %

  • LVM thin pool data usage %
  • infotip: Monitoring the utilization of LVM thin pool metadata and data usage is important to ensure they don't run out of space. If the data space is exhausted then, based on the configuration, I/O operations are either queued or failing. If metadata space is exhausted, you will observe error I/O's until the LVM pool is taken offline and repair is performed to fix potential inconsistencies. Moreover, due to the metadata transaction being aborted and the pool doing caching there might be uncomitted (to disk) I/O operations that were acknowledged to the upper storage layers (file system) so those layers will need to have checks/repairs performed as well.
  • chart type: Line Chart / Spark

Row 4

Panel 22: Top Connections

  • Top 5 connections based on IO
  • infotip: Top 5 connections based on IO
  • chart type: Bar Chart / Histogram

Panel 23: Top Utilized Bricks

  • Top 5 bricks by capacity utilization
  • infotip: Top 5 bricks by capacity utilization
  • chart type: Bar Chart / Histogram

Panel 24: Top busiest bricks

  • Top 5 bricks by IOPS
  • infotip: Top 5 bricks by IOPS
  • chart type: Bar Chart / Histogram

Row 5 (part of volume storage profile - should this be its own Volume Storage Profile dashboard for performance reasons?)

Panel 25: Top File Operation (% Latency)

  • show the top 5 FOP (file operations) with the highest % latency
  • infotip: % latency is the fraction of the File Operation (FOP) response time that is consumed by the FOP
  • chart type: Bar Chart / Histogram
  • Example: FSYNC 47.35%, FXATTROP 11.88%, LOOKUP 11.35%, READDIRP 5.70%, WRITE 4.82%

Panel 26: Reads and Writes by Block Size

  • chart type: Histogram
    Example:
    Block Size (x-axis): 1b+ 32b+ 64b+ 128b+ 256b+
    Read (y-axis): 0 0 0 0 6 Write (y-axis): 908 28 8 5 23

Row 6 (part of volume storage profile - should this be its own Volume Storage Profile dashboard for performance reasons?)

Panel 27: File Operations for Locks Trend

  • show average latency, maximum latency, call rate for each FOP for Locks[1]
  • x-axis: time
  • y-axis: average latency, maximum latency, call rate
  • chart type: Line Chart / Spark

Panel 28: File Operations for Read/Write Operations Trend

  • show average latency, maximum latency, call rate for each FOP for Read/Write Operations [1]
  • x-axis: time
  • y-axis: average latency, maximum latency, call rate
  • chart type: Line Chart / Spark

Row 7 (part of volume storage profile - should this be its own Volume Storage Profile dashboard for performance reasons?)

Panel 29: File Operations for Inode Operations Trend

  • show average latency, maximum latency, call rate for each FOP for Inode Operations
  • x-axis: time
  • y-axis: average latency, maximum latency, call rate
  • chart type: Line Chart / Spark

Panel 30: File Operations for Entry Operations Trend

  • show average latency, maximum latency, call rate for each FOP for Entry Operations [1]
  • x-axis: time
  • y-axis: average latency, maximum latency, call rate
  • chart type: Line Chart / Spark

[1] There exists approximately 46 File Operations (FOPs) that would need to be mapped into 4 categories for the data to be consumable for troubleshooting in order to identify patterns:

  • L = Locks
  • D = Data Read/Write Operations
  • I = Inode Operations
  • E = Entry Operations

List of [FOP Categories] to FOPs:

  • [I] ACCESS - ?
  • [D] CREATE - create a file
  • [D] DISCARD - support for trim?
  • [L] ENTRYLK - lock a directory given its pathname?
  • [D] FALLOCATE - allocate space for file without actually writing to it
  • [L] FENTRYLK - lock a file given its handle
  • [I] FGETXATTR - get named extended attribute value for a file (handle)
  • [L] FINODELK - lock a file/directory for write/read
  • [D] FLUSH - ensure all written data is persistently stored
  • [I] FREMOVEXATTR - remove a named extended attribute from a file handle
  • [I] FSETATTR - set value of metadata field (which ones?) for a file (handle)
  • [I] FSETXATTR - set value of a named extended attribute for a file handle
  • [I] FSTAT - get standard metadata about a file given its file handle
  • [D] FSYNC - ensure all written data for a file is persistently stored
  • [D] FSYNCDIR - ensure all directory entries in directory are persistently stored
  • [I] FTRUNCATE - set file size to specified value, deallocating data beyond this point
  • [I] FXATTROP - used by AFR replication?
  • [I] GETXATTR - get value of named extended attribute
  • [L] INODELK - lock a directory for write or for read
  • [E] LINK - create a hard link
  • [L] LK - lock?
  • [I] LOOKUP - lookup file within directory
  • [E] MKDIR - create directory
  • [E] MKNOD - create device special file
  • [I] OPEN - open a file
  • [I] OPENDIR- open a directory (in preparation for READDIR)
  • [D] RCHECKSUM - ?
  • [D] READ - read data from a file
  • [D] READDIR - read directory entries from a directory
  • [D] READDIRP - read directory entries with standard metadata for each file (readdirplus)
  • [I] READLINK - get the pathname of a file that a symlink is pointing to
  • [D] READY - ?
  • [I] REMOVEXATTR - remove a named extended attribute from a pathname?
  • [E] RENAME - rename a file
  • [E] RMDIR - remove a directory (assumes it is already empty)
  • [I] SEEK - ?
  • [I] SETATTR - set field in standard file metadata for pathname
  • [I] SETXATTR - set named extended attribute value for file given pathname
  • [I] STAT - get standard metadata for file given pathname
  • [I] STATFS - get metadata for the filesystem
  • [E] SYMLINK - create a softlink to specified pathname
  • [I] TRUNCATE - truncate file at pathname to specified size
  • [E] UNLINK - delete file
  • [D] WRITE - write data to file
  • [I] XATTROP - ?
  • [D] ZEROFILL - write zeroes to the file in specified offset range

Note: The dashboard layout for the panels and panels within the rows may need to alter based on implementation and actual visualization especially when certain metrics may need to be aligned together whether vertically or horizontally.

Alternatives

Create similar dashboard using PatternFly (www.patternfly.org) or d3.js components to show similar information within the Tendrl UI.

Data model impact:

TBD

Impacted Modules:

TBD

Tendrl API impact:

TBD

Notifications/Monitoring impact:

TBD

Tendrl/common impact:

TBD

Tendrl/node_agent impact:

TBD

Sds integration impact:

TBD

Security impact:

TBD

Other end user impact:

User will mostly interact with this feature via the Grafana UI, though access via Grafana API and Tendrl API is possible, but would require API calls to provide similar information.

Performance impact:

TBD

Other deployer impact:

  • Plug-ins required by Grafana will need to be packaged and installed with tendrl-ansible.

  • This (default) host dashboard will need to be automatically generated whenever a cluster is imported to be managed by Tendrl.

Developer impact:

TBD

Implementation:

TBD

Assignee(s):

Primary assignee: @cloudbehl

Other contributors: @anmolbabu, @anivargi, @julienlim, @japplewhite

Work Items:

TBD

Estimate:

TBD

Dependencies:

TBD

Testing:

Test whether health, status, and metrics displayed for a given volume is correct and that the information is up-to-date as failures or other changes are observed on a given volume.

Documentation impact:

Documentation should include information related to what's being displayed and explained for clarity if not immediately obvious from looking at the dashboard. This may include but not be limited to what the metrics refers to, the measurement unit, how to use or apply it to solving troubleshooting problems, e.g. healing / split brain issues, lost of quorum, etc.

References and Related GitHub Links:

@julienlim
Copy link
Member Author

julienlim commented Aug 8, 2017

@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano

This dashboard proposal is ready for review. Note: API impact, module impact, etc. has to be filled out by someone else -- maybe @cloudbehl, @anmolbabu, or @anivargi.

Suggested Labels (for folks who have permissions to label the spec):

  • FEATURE:Monitoring
  • INTERFACE:Dashboard
  • INTERFACE:GUI

@ltrilety
Copy link

ltrilety commented Aug 8, 2017

I have a question about panels numbers.
IIANM Grafana doesn't use any number for panel in configuration. Moreover I noticed that some panels have the same number as others. Is it intended or just some coincidence? Is it even possible to have the same panel on more rows?

@julienlim
Copy link
Member Author

@ltrilety I put the panel numbers for specification purposes (so that if someone comments, they can specify panel #), and it's not to be implemented with a panel #. If some panels have the same panel numbers, it's a typo on my end. I'll fix it. Thanks.

@nthomas-redhat
Copy link
Contributor

nthomas-redhat commented Aug 10, 2017

Row 1
Panel 1: Health
show host status?
Valid volume states are up, down, up(partial) and up(degraded)

Panel 4: Disks
No platform support for disk status as such. This won't be supported now

Panel 5: Geo-Replication Sessions
Valid states are : up, down, up(partial)

Panel 6: Healing
Isn't n healing needed and n split brain are same?

Panel 7: Rebalance
Chart type is not specified

Row 3
Panel 18: IO Size
Not MVP

Panel 20: LVM thin pool metadata
Panel 21: LVM thin pool data usage
Does it make sense to aggregate these stats at volume level. These are LVM level stats specific to bricks, what is the value add it brings by aggregation at volume level?

Items specified in Row 4,5,6,7 are not MVP.

@julienlim
Copy link
Member Author

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi

(1) Volume health, I read https://github.com/gluster/gstatus to see what up(partial) vs. up(degraded) meant. The "show host status" is a typo (from cut-n-paste) and it is really "show volume status" (I've fixed it above). I've updated Panel 1 in the spec above with the statuses per gstatus. How about quorum lost (in the case of an arbiter volume), or is this not supported or not possible?

(2) Ack. No support for Disk status. Will mark as FUTURE in the spec above (so it gets noted as a placeholder for future consideration).

(3) Geo-Replication Sessions - I've updated the statuses in Panel 5 in the above spec.

(4) Healing: healing and split brain are not the same. Healing is something that typically happens automatically (and does not require user intervention but gives indication about how "healthy" the files are), and healing can happen after a split brain. Anything thing caused by a split brain (without parameters/policy configured to trigger self-heal) will require user action to manually resolve.

(5) Rebalance -- I was waiting to check with Jeff regarding if we're doing rebalancing in the Tendrl UI or Grafana. Based on some recent conversations, I think I will assume the latter and will update this soon.

(6) IO Size -- not MVP. Ack. Will mark as FUTURE. This typically goes hand-in-hand with IOPS in storage management/monitoring applications.

(7) Rows 4 not clearly called out in MVP. Ack.

(8) Rows 5, 6, and 7 are all from volume storage profiling that we will be enabling/disabling during Import Cluster. This was called out in the Gluster Metrics discussion that we would collect this information. If we collect it, I was assuming we would be visualizing them.

@japplewhite @jjkabrown1 - please comment.

@nthomas-redhat
Copy link
Contributor

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi

(1) Volume health, I read https://github.com/gluster/gstatus to see what up(partial) vs. up(degraded) meant. The "show host status" is a typo (from cut-n-paste) and it is really "show volume status" (I've fixed it above). I've updated Panel 1 in the spec above with the statuses per gstatus. How about quorum lost (in the case of an arbiter volume), or is this not supported or not possible?

get-state cli provides the quorum status and tendrl syncs this into etcd.
Quorum status is derived from brick status(get-state). Brick status is also used in volume health computation and will be reflected there.

(2) Ack. No support for Disk status. Will mark as FUTURE in the spec above (so it gets noted as a placeholder for future consideration).

(3) Geo-Replication Sessions - I've updated the statuses in Panel 5 in the above spec.

(4) Healing: healing and split brain are not the same. Healing is something that typically happens automatically (and does not require user intervention but gives indication about how "healthy" the files are), and healing can happen after a split brain. Anything thing caused by a split brain (without parameters/policy configured to trigger self-heal) will require user action to manually resolve.

My whole point is that <<n healing needed - total number (n) of entries that need healing based on healinfo>> is not reported by healinfo(gluster)
healinfo provides the below information:
No. of entries healed
No. of entries in split-brain
No. of heal failed entries

(5) Rebalance -- I was waiting to check with Jeff regarding if we're doing rebalancing in the Tendrl UI or Grafana. Based on some recent conversations, I think I will assume the latter and will update this soon.

(6) IO Size -- not MVP. Ack. Will mark as FUTURE. This typically goes hand-in-hand with IOPS in storage management/monitoring applications.

(7) Rows 4 not clearly called out in MVP. Ack.

(8) Rows 5, 6, and 7 are all from volume storage profiling that we will be enabling/disabling during Import Cluster. This was called out in the Gluster Metrics discussion that we would collect this information. If we collect it, I was assuming we would be visualizing them.

@japplewhite @jjkabrown1 - please comment.

@julienlim
Copy link
Member Author

julienlim commented Aug 11, 2017

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi @mcarrano

(1) Volume health, I read https://github.com/gluster/gstatus to see what up(partial) vs. up(degraded) meant. The "show host status" is a typo (from cut-n-paste) and it is really "show volume status" (I've fixed it above). I've updated Panel 1 in the spec above with the statuses per gstatus. How about quorum lost (in the case of an arbiter volume), or is this not supported or not possible?
get-state cli provides the quorum status and tendrl syncs this into etcd.


Quorum status is derived from brick status(get-state). Brick status is also used in
volume health computation and will be reflected there.

I’ll take this to mean quorum is either not applicable or should not be shown at the volume level. I’ve remove/update.

(4) Healing: healing and split brain are not the same. Healing is something that typically happens automatically (and does not require user intervention but gives indication about how "healthy" the files are), and healing can happen after a split brain. Anything thing caused by a split brain (without parameters/policy configured to trigger self-heal) will require user action to manually resolve.

My whole point is that <<n healing needed - total number (n) of entries that need
healing based on healinfo>> is not reported by healinfo(gluster)
> healinfo provides the below information:
> No. of entries healed
> No. of entries in split-brain
> No. of heal failed entries

I meant for n healing needed - total number (n) of entries that need healing == no. of heal failed entries. This is meant to indicate action is required to investigate. I’ll update it to make it clearer to and also included entries that were healed:

  • No. of entries healed

  • No. of entries in split-brain

  • No. of heal failed entries

Updated Panel 7: Rebalance

@julienlim
Copy link
Member Author

@nthomas-redhat @japplewhite @jjkabrown1 @Tendrl/qe @Tendrl/tendrl-core @anmolbabu @cloudbehl @anivargi @mcarrano

Here's a mockup of the Volume Dashboard:
grafana dashboard - volume 1

@fbalak
Copy link

fbalak commented Oct 11, 2017

@julienlim the design differs with one extra panel Capacity Utilization Trend. Is it expected behaviour?
1011_capacity_ut

@julienlim
Copy link
Member Author

Noting that the IOPS Trend panel is not present yet. BZ has been created to track this as well at https://bugzilla.redhat.com/show_bug.cgi?id=1514054.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants