Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard Spec - Host Dashboard #223

Closed
julienlim opened this issue Aug 7, 2017 · 9 comments
Closed

Dashboard Spec - Host Dashboard #223

julienlim opened this issue Aug 7, 2017 · 9 comments

Comments

@julienlim
Copy link
Member

julienlim commented Aug 7, 2017

Dashboard Spec - Host Dashboard

Display a default dashboard for a single Gluster storage node present in Tendrl that provides at-a-glance information about a single Gluster node that is part of a Gluster trusted storage pool that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight the Tendrl user's (e.g. Gluster Administrator) attention to potential issues in the host, volume, brick, and disk.

Problem description

A Gluster Administrator wants to be able to answer the following questions by looking at the cluster dashboard:

  • Is my storage node up and running, is it healthy?
  • Is there a problem with my storage node?
  • What’s actually wrong with my storage node, why it it slow?
  • Is my storage node filling up too fast?
  • When will my storage node run out of capacity?
  • If something is down / broken / failed (e.g. brick down, disk failure, etc.), where and what is the issue, and when did it happen?
  • Have the number of clients (indicated via connections) increased (which may possibly be the reason for the performance degradation that the clients / applications are observing?

Use Cases

Uses Cases in the form of user stories:

  • As a Gluster Administrator, I want to view at-a-glance information about my Gluster host that includes health and status information, key performance indicators (e.g. IOPS, throughput, etc.), and alerts that can highlight my attention to potential issues in the host, volume, brick, and disk.

  • As a Gluster Administrator, I want to compare 1 or more metrics (e.g. IOPS, CPU, Memory, Network Load) across bricks within the host

  • Compare utilization (e.g. IOPS, capacity, etc.) across bricks within a host

Proposed change

Provide a pre-canned, default host dashboard in Grafana (that is initially launchable from the Tendrl UI, and eventually embed it into the Tendrl UI) that shows the following metrics rendered either in text or in a chart/graph depending on the type of metric being displayed below:

The Dashboard is composed of individual Panels (dashboard widgits) arranged on a number of Rows.

Note: The cluster and host name or unique identifier should be visible at all times, and user should be able to switch to another host.

Row 1

Panel 1: Health

Panel 2: Bricks

  • n total - total number (n) of bricks in the host
  • n up - count (n) of bricks in the host that are up
  • n down - count (n) of bricks in the host that are down
  • chart type: Stacked Card

[FUTURE] Panel 3: Disks

  • n total - total number (n) of disks in the host
  • n up - count (n) of disks in the host that are up
  • n down - count (n) of disks in the host that are down
  • chart type: Stacked Card

Panel 4: Connections Trend

  • count (n) of client connections to the bricks in the volume over a period of time
  • chart type: Line Chart / Spark

Panel 5: IOPS Trend

  • show the IOPS for the host over a period of time
  • chart type: Line Chart / Spark

[FUTURE] Panel 6: IO Size

  • show IO Size
  • chart type: Singlestat

Row 2

Panel 7: CPU Utilization Trend

  • show CPU used for the host over a period of time
  • chart type: Line Chart / Spark

[OPTIONAL] Panel 8: CPU Available

  • show CPU that is available for the host
  • chart type: Gauge

Panel 9: Memory Used Trend

  • show memory used for the host over a period of time
  • chart type: Line Chart / Spark

Panel 10: Memory Free

  • show memory that is available for the host
  • chart type: Gauge

Panel 11: Swap Used Trend

  • show swap space used for the host over a period of time
  • chart type: Line Chart / Spark

Panel 10: Swap Free

  • show swap space that is available for the host
  • chart type: Gauge

Row 3

Panel 11: Capacity Utilization

  • Disk space used
  • chart type: Gauge

Panel 12: Capacity Available

  • Disk space free
  • chart type: Singlestat

Panel 13: Growth Rate

  • growth rate computed based on beginning and last end point to perform estimation
  • chart type: Singlestat

Panel 14: Time Remaining (Weeks)

  • based on projected growth rate in Panel 13, provide estimated # of weeks remaining
  • chart type: Singlestat

[FUTURE] Panel 15: Services Trend

  • based on Gluster svc events (connected, disconnected, failed) over a period of time
  • similar to what was available with the previous Gluster Console with the Nagios plug-in
  • chart type: Line Chart / Spark

Panel 15A: LVM thin pool metadata %

  • LVM thin pool metadata %
  • infotip: Monitoring the utilization of LVM thin pool metadata and data usage is important to ensure they don't run out of space. If the data space is exhausted then, based on the configuration, I/O operations are either queued or failing. If metadata space is exhausted, you will observe error I/O's until the LVM pool is taken offline and repair is performed to fix potential inconsistencies. Moreover, due to the metadata transaction being aborted and the pool doing caching there might be uncomitted (to disk) I/O operations that were acknowledged to the upper storage layers (file system) so those layers will need to have checks/repairs performed as well.
  • chart type: Line Chart / Spark

Panel 15B: LVM thin pool data usage %

  • LVM thin pool data usage %
  • infotip: Monitoring the utilization of LVM thin pool metadata and data usage is important to ensure they don't run out of space. If the data space is exhausted then, based on the configuration, I/O operations are either queued or failing. If metadata space is exhausted, you will observe error I/O's until the LVM pool is taken offline and repair is performed to fix potential inconsistencies. Moreover, due to the metadata transaction being aborted and the pool doing caching there might be uncomitted (to disk) I/O operations that were acknowledged to the upper storage layers (file system) so those layers will need to have checks/repairs performed as well.
  • chart type: Line Chart / Spark

Row 4 (Disk Metrics) - combined into Row 3

= Row 4 assumptions (for the disk metrics) - aggregation should depend on how many disks we expect on the host. If we expect <=8 disks per host, then no aggregation is needed, but if we expect higher, then aggregation per host is recommended.

Panel 16: Disk Load Trend

  • shows disk_octets (read, write) on y-axis over a period of time (x-axis)
  • chart type: Line Chart / Spark

Panel 17: Disk Operations Trend

  • shows disk_ops (read, write) on y-axis over a period of time (x-axis)
  • chart type: Line Chart / Spark

Panel 18: Disk IO Trend - Line Chart / Spark

  • shows disk_io_time (read, write) on y-axis over a period of time (x-axis)
  • Note: similar to Panel 5: IOPS (which is aggregated)/ Tjos pme os broken out by individual disks
  • chart type: Line Chart / Spark

Row 5 (Network Metrics)

  • Row 5 Assumptions: shows all network interfaces including cluster traffic. So if user sees other network interfaces, he/she can optionally remove the interfaces he/she is not interested in.

Panel 20: Throughput Trend

  • show throughput for the host over a period of time
  • chart type: Line Chart / Spark

Panel 22: Dropped Packets Trend

  • show dropped packets for the host over a period of time
  • infotip: Dropped Packets indicates network congestion, e.g. the queue on the switchport your host is connected to is full and packets are dropped because it can’t transmit data fast enough.
  • chart type: Line Chart / Spark

Panel 23: Errors Trend

  • show errors for the host over a period of time
  • infotip: the number of errors indicate issues that occurred while transmitting packets due to carrier errors (duplex mismatch, faulty cable), fifo errors, heartbeat errors, and window errors, CRC errors too short frames, and/or too long frames. In short, errors typically result from faulty hardware, and/or speed mismatch.
  • chart type: Line Chart / Spark

Panel 24: Overruns Trend

  • show overruns for the host over a period of time
  • infotip: overruns indicates that the ring-buffer of the network interface is full and the network interface doesn’t seem to get any kernel time to send out the frames stuck in the ring-buffer.
  • chart type: Line Chart / Spark

Note: The dashboard layout for the panels and panels within the rows may need to alter based on implementation and actual visualization especially when certain metrics may need to be aligned together whether vertically or horizontally.

Alternatives

Create similar dashboard using PatternFly (www.patternfly.org) or d3.js components to show similar information within the Tendrl UI.

Data model impact:

TBD

Impacted Modules:

TBD

Tendrl API impact:

TBD

Notifications/Monitoring impact:

TBD

Tendrl/common impact:

TBD

Tendrl/node_agent impact:

TBD

Sds integration impact:

TBD

Security impact:

TBD

Other end user impact:

User will mostly interact with this feature via the Grafana UI, though access via Grafana API and Tendrl API is possible, but would require API calls to provide similar information.

Performance impact:

TBD

Other deployer impact:

  • Plug-ins required by Grafana will need to be packaged and installed with tendrl-ansible.

  • This (default) host dashboard will need to be automatically generated whenever a cluster is imported to be managed by Tendrl.

Developer impact:

TBD

Implementation:

TBD

Assignee(s):

Primary assignee: @cloudbehl

Other contributors: @anmolbabu, @anivargi, @julienlim, @japplewhite

Work Items:

TBD

Estimate:

TBD

Dependencies:

TBD

Testing:

Test whether health, status, and metrics displayed for a given host is correct and that the information is up-to-date as failures or other changes are observed on a given host.

Documentation impact:

Documentation should include information related to what's being displayed and explained for clarity if not immediately obvious from looking at the dashboard. This may include but not be limited to what the metrics refers to, the measurement unit, how to use or apply it to solving troubleshooting problems, e.g. healing / split brain issues, lost of quorum, etc.

References and Related GitHub Links:

@julienlim
Copy link
Member Author

julienlim commented Aug 7, 2017

@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano

This dashboard proposal is ready for review. Note: API impact, module impact, etc. has to be filled out by someone else -- maybe @cloudbehl, @anmolbabu, or @anivargi.

Suggested Labels (for folks who have permissions to label the spec):

  • FEATURE:Monitoring
  • INTERFACE:Dashboard
  • INTERFACE:GUI

@nthomas-redhat
Copy link
Contributor

nthomas-redhat commented Aug 10, 2017

Row 1
Panel 3: Disks
No platform support for disk status as such. This won't be supported now

Panel 6: IO Size
I havn't seen a mention of this in the MVP. Also this stats is not provided by collectd plugin. So I prefer to defer this

Row 2
Panel 8: CPU Available?
does this really make any sense?

Row 3
Panel 13: Growth Rate
Panel 14: Time Remaining (Weeks)
Does this really make any sense to display at the host level? Also MVP just talks about the projections at volume level only.

Panel 15: Disk Load Trend
This is already covered in Row4 - typo?

Panel 16: Services Trend
Can we get some clarity around this? Is it part of MVP?

Row 4
(Assumption: all stats in row 4 are aggregated on per host)

Panel 18: Disk Operations Trend
Isn't this same as IOPs mentioned in Row 1? is there a difference?

Row 5
(Assumption: collected and displayed only for cluster network)

Panel 21: Bytes Sent and Received Trend
Throughput is calculated from Bytes Sent and Received. Do we need to graph this again?

@julienlim
Copy link
Member Author

@nthomas-redhat

Row 1 
Panel 3: Disks - Marked as FUTURE

Panel 6: IO Size - Marked as FUTURE


Row 2 
Panel 8: CPU Available - This was for consistency with the other (memory, swap, capacity). It’s not critical and can be removed or made optional. I’ll make it as Optional.


Row 3 
Panel 13: Growth Rate and 
Panel 14: Time Remaining (Weeks)

Since disks are typically on a host (that get consumed for use by a volume), knowing when the volume runs out of space is not sufficient in helping Admins know where to plan to add capacity to and what kind of runway is needed. This is also related to forecast / projections for bricks that I mentioned in #230 (comment).

Panel 15: Disk Load Trend - typo (removed)


Panel 16: Services Trend
 I raised this a few times in BLR, and I'm suggesting this to have parity with the old Console (via Nagios plugin). This was the only we didn't address. The use scenario is that there's not easy way for Admins to know if their services/daemons die today or are still ok, and this is a means for monitoring their health. I will defer this to @japplewhite if this is part of the MVP.

Row 4 
(Assumption: all stats in row 4 are aggregated on per host)

  • depends on how many disks we expect on the host. If we expect <=8 disks per host, then no aggregation is needed, but if we expect higher, then aggregation per host is recommended.

Panel 18: Disk Operations Trend - Isn't this same as IOPs mentioned in Row 1? is there a difference?

  • they are related. For the IOPS, I was hoping that is aggregated, for the Disk Operations Trend, I wanted it broken out by individual disks.

Row 5 - Network stats
If we assume collected and displayed only for cluster network, is the same true when another network is being used by Cluster, e.g. replication network, CTDB, etc.? Why can’t we show it for all the interfaces and user can filter out what they don’t want (vs. making them have to manually add it in)? It’s easier for user to remove than add interfaces+metrics.

Panel 21: Bytes Sent and Received Trend
Typo / Redundant not needed. Removed.

I've updated the spec above based on comments.

@julienlim
Copy link
Member Author

julienlim commented Aug 14, 2017

@sankarshanmukhopadhyay @brainfunked @r0h4n @nthomas-redhat @Tendrl/qe @Tendrl/tendrl_frontend @japplewhite @rghatvis@redhat.com @mcarrano

Here's a rough mockup of what this might look like:
grafana dashboard - host 2

@fbalak
Copy link

fbalak commented Nov 2, 2017

@julienlim Would it make sense to merge Brick Capacity and Brick Capacity Used panels together and provide some information like 2.0 GiB / 19.0 GiB? It would free a lot of space.
1102_capacity

@r0h4n
Copy link
Contributor

r0h4n commented Jan 30, 2018

@julienlim any updates on this? need to close this issue

@julienlim
Copy link
Member Author

@r0h4n @fbalak @cloudbehl @nthomas-redhat @mbukatov

In reviewing the current Host Dashboard, there are multiple panels show brick utilization and capacity, i.e. Total Brick Capacity Utilization Trend, Total Brick Capacity Utilization, Total Brick Capacity Available, Brick Utilization, Brick Capacity, Brick Capacity Used -- making it confusing as it appears that some of the data may be redundant if a Tendrl host has only 1 brick.

screen shot 2017-10-18 at 2 28 16 pm

In looking at monitoring-integration/etc/tendrl/monitoring-integration/grafana/dashboards/tendrl-gluster-hosts.json, it isn't clear what the 3 Panels (Brick Utilization, Brick Capacity, and Brick Capacity Used) panels are trying to actually showcase (i.e. what problems they are trying to solve) based on current labels used. Specifically, are they intended to show all the bricks on the host, or the top n bricks on the host, or something else? The labels for the panels don't tell a user one way or either what the intent was, nor is there accompanying description to help explain it.

My thoughts is that it is probably the Top Brick Consumers that should be shown (and in the following left-to-right order):

  • Brick Capacity --> Top Bricks by Total Capacity
  • Brick Capacity Used --> Top Bricks by Capacity Utilized
  • Brick Utilization --> Top Bricks by Capacity Percent Utilized

With this updated definition of the 3 bricks panels, it no longer makes sense to combine the 2 panels as @fbalak suggested, as the results will vary based on the criteria.

Having made this proposal on how to address these 3 bricks panels, is this considered a RFE, a Bug, or both (which I believe this to be)? This should be a fairly straightforward fix. Can we get this fixed for the upcoming release (if not the next one, the following one)?

@julienlim
Copy link
Member Author

@r0h4n I created a separate issue for the suggested changes to the 3 brick panels mentioned in my previous comments -- this is more of a bug and RFE that probably should be tracked as a separate issue. With the creation of the Tendrl/monitoring-integration#324 issue, you can probably close this spec/issue.

@r0h4n
Copy link
Contributor

r0h4n commented Feb 1, 2018

Thanks, Ill close this one

@r0h4n r0h4n closed this as completed Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants