Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gluster metrics #206

Closed
wants to merge 1 commit into from
Closed

Conversation

anmolbabu
Copy link
Contributor

Gluster metrics

tendrl-bug-id: #188
Signed-off-by: anmolbabu anmolbudugutta@gmail.com

@anmolbabu
Copy link
Contributor Author

anmolbabu commented Jul 24, 2017

This is WIP spec with some unknowns:
status wise counters for bricks:
This brick status is availble in get-state for the brick belonging to the node on which the get-state command is fired.. So there's no way collectd residing on a node can push this counter. The best that can be done from collectd's perspective is to push brick status as a integral number(as ceph-metrics does) to graphite and a way to aggregate this at grafana needs to be found..
Note This is not a mere count query. As the actual status is number encoded in the metric value and not in metric name.. Encoding it in metric name makes no sense as when the status changes, the previous metric and the current metric both co-exist leading to unnecessary confusions

@cloudbehl Please add the grafana side of things either to the same spec or to a different one..

@r0h4n @brainfunked @shtripat @nthomas-redhat @anivargi Please review


== Problem description

Admin using tendrl for gluster cluster management, should be able to see the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/see the a easy/see an easy/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

**** Dropped packets -- collectd plugin
**** Errors/overruns -- collectd plugin
** Volume level:
*** Status -- gluster get-state glusterd odir /var/run file glusterd-state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say get-state do you mean sds sycn from gluster-integration or collectd as well runs this every few seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collectd also runs this. So I believe its better to use a different file for get-state o/p something like:
gluster get-state glusterd odir /var/run file collectd-glusterd-state
So that the files read and deleted by gluster-integration and collectd will differ and they don't end-up in a race condition

Failure: 90
brick_utilization:
Warning: 75
Failure: 90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rishubhjain @GowthamShanmugam I think these threshold values you will need to while implementing grafana's alerting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the thresholds will be part of Grafana configuration.

@anmolbabu
Copy link
Contributor Author

#213 (comment)

Copy link
Member

@shtripat shtripat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No specific comments.
@Tendrl/tendrl-core plz review.

r0h4n
r0h4n previously approved these changes Jul 31, 2017
Copy link
Contributor

@nthomas-redhat nthomas-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that lot of aggregation is done at grafana level. But we need capture them here and clearly state that it will be handled at grafana


xml attribute hierarchy in cli response:
cliOutput -> volProfile -> brick -> cumulativeStats -> totalRead
cliOutput -> volProfile -> brick -> cumulativeStats -> totalWrite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that read and write collected but please mention how IOPS is calculated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

xml attribute hierarchy in cli response:
cliOutput -> volProfile -> brick -> cumulativeStats -> totalRead
cliOutput -> volProfile -> brick -> cumulativeStats -> totalWrite

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the drill down from brick to disk is addressed? Are you planning to collect the disk level iostat or at a mount point level? If the brick is created on RAID, then how this is addressed? please provide the details here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drill down is something we need to figure out yet and it is slated for implementation in subsequent drops.. It is most likely that the relation from brick to its disk needs to be communicated to grafana by monitoring-integration.The details of how this happens is not yet found out @anivargi can you please add if I am missing something
As of now what we have is :

Also, the spec only currently deals with whatever is available today and for the upcoming immediate drop.. The spec is expected to evolve as we progress..

Note:
** This is of particular importance in case of certain operations where
the command execution too frequently can affect the performance.
* A generic flow as in:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can link #219 here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* The aggreageations that were previously done by the performance-monitoring
application will now be done at grafana.
** The rules of aggregation will be communicated to grafana by
monitoring-integration. Details of this are out of scope of this spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to as spec/issue if that exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Based on "peer in cluster" and "connected" in get-state output
*** Number of volumes
**** Total -- gluster get-state glusterd...
**** Status wise -- gluster get-state glusterd...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

near full?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding metrics or entities that cross threshold kind of information @cloudbehl since the thresholds are configured at grafana end, is this feasible via grafana.
Also get top 5 most used kind of thing is that possible in grafana?? @cloudbehl Cc @anivargi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

the collectd plugin is currently in execution, will be accessible and
info regarding the other bricks in cluster is not available and plugins
are stateless. This if required will not be found and updated to
graphite by monitoring-integration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are required, can grafana do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. We need some other module(like monitoring-integration) to be able to do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why grafana can't do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the individual brick statuses are mapped to integeral values and grafana I am not sure if it can look at the value of metric and classify them based on values and count each of the classifications @cloudbehl is this facilitated by grafana??

*** Pending Heal -- gluster volume heal all statistics
*** Rebalance status -- gluster get-state glusterd..
*** Number of connections -- gluster volume status all clients --xml
** Cluster level:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention how the these metrics are collected/calculated(grafana?)

  1. Cluster utilization
  2. cluster iops, latency
  3. Disks stats at cluster level(aggregate - iops, latency, throghput, nearful etcs)
    4.Network stats
  4. Hosts with high memory and cpu utilization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..

All these are not slated for upcoming drop and hence the spec will evolve to contain these information..

Regarding hosts that cross threshold kind of information @cloudbehl since the thresholds are configured at grafana end, is this feasible.
Also get top 5 most used kind of thing is that possible in grafana?? @cloudbehl Cc @anivargi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

**** Bytes sent and received(rx, tx) -- collectd plugin
**** Dropped packets -- collectd plugin
**** Errors/overruns -- collectd plugin
** Volume level:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utilization?
iops/latency
Disks stats at cluster level(aggregate - iops, latency, throghput, nearful etcs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..

All these are not slated for upcoming drop and hence the spec will evolve to contain these information..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*** Status -- gluster get-state glusterd odir /var/run file collectd-glusterd-state
**** Counters will be made available via collectd plugins based on get-state result
*** Bricks counter
**** Total -- gluster get-state glusterd...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brick status wise counts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collectd plugin does not have access to statuses of bricks from other nodes(get-state provide brick status of brick from the node where it is currently executed). So, monitoring-integration which will have access to etcd(where gluster-integrations from all gluster nodes sync their respective brick statuses to etcd) can push this stat to graphite as an exception

@nthomas-redhat @anivargi @brainfunked Please suggest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. As part of perfirmance improvements, the similar thing is planned for utilization, status, connections aggregations in gluster sds sync as well. IN that case we would have a separate thread which executes a flow once cluster is imported/created to sync the aggregated details and schedules it to run at regular intervals post that.
Something similar would be required here as well I feel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmolbabu , what is discussed earlier was to collect individual status per node and aggregate at grafana right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nthomas-redhat Yes that stands to be the case even now.
Only thing is the aggregation logic like for example, brick utilization to volume utilization aggregation should not ideally be a mere summation of utilization of all bricks but instead this differs based on volume type.. So these agregation logic need to be communicated to grafana from monitoring-integration..

*** Memory -- collectd plugin
*** Storage(Mount point) -- collectd plugin
*** SWAP -- collectd plugin
*** Disk IOPS -- collectd plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this individual disk level or aggregate for node like iops, latency, throghput, nearful etc
Also are we planning to collect brick status wise counts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aggregations are planned to be done by grafana in accordance with rules configured by monitoring-integration.. @anivargi or @cloudbehl can you add details in this regard like how those rules can be comminicated..

All these are not slated for upcoming drop and hence the spec will evolve to contain these information..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but this spec should list all the metrics and specifically state that this will done at grafana. Also link with spec if any

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

** Node level:
*** CPU -- collectd plugin
*** Memory -- collectd plugin
*** Storage(Mount point) -- collectd plugin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably decide on set of predefined mounts that we need to monitor right ? like /root, /boot, /home.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why??

I feel we can expose all data but allow UI to decide what it wants to show.. And if we limit the mount-points to be monitored, there is every chance we loose out some or many important mount points created from cli(ex: volume mounts, disk mounts etc..)..

Otherwise the moment we detect even some of these mount points via any tendrl service the collectd config file responsible for mount point stats will need to reconfigured and collectd needs to be restarted every time on any(all) nodes where ever this happens..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only concern with monitoring all mount points on the node is that all the glusterfs bricks are mount points and we are monitoring them as part of brick utilization collection. if we monitor all mount points we will be repeating the monitoring of gluster brick mount points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I acknowledge this.. But then if monitoring-integration can bury this info in templates that it communicates to grafana, I would prefer that rather than limiting collectd to monitor only a pre-defined set of mount points.. The reason is as I said, if we ever dynamically discover that we need a new mount-point to be monitored, that involves the following:

  1. collectd config change
  2. restart collectd to reflect config change to it.
    and this on every node where the change happened..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack.

**** Total -- gluster get-state glusterd...
*** Pending Heal -- gluster volume heal all statistics
*** Rebalance status -- gluster get-state glusterd..
*** Number of connections -- gluster volume status all clients --xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a costly operation. However get-state gives you per brick client info that is present in that particular node. I feel that can be used, what do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this currently from get-state o/p is it planned for some-time soon??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is available in latest version(this was added recently). you can check gluster nightly builds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

**** Bytes sent and received(rx, tx) -- collectd plugin
**** Dropped packets -- collectd plugin
**** Errors/overruns -- collectd plugin
** Volume level:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, isn't snapshot count for a volume, number of geo-rep sessions for a volume (with ok and faulty count) needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think we narrowed the scope to https://docs.google.com/spreadsheets/d/1X2JREn0TybSqLHgx9vcv-zjn4S9X9_Xz0JdhADew5-Q/edit?ts=5965bec8#gid=1819780204 and I missed this but that shouldn't be too difficult to add from get-state o/p.
I'll add them..

@nnDarshan
Copy link
Contributor

Looks good to me

@anmolbabu
Copy link
Contributor Author

@r0h4n @nthomas-redhat @shtripat @brainfunked As discussedI have trimmed this spec to deal with milestone1 part
I'll raise Gluster Metrics-milestone 2 once this is merged
Please review..

and a genric command to generate config file from template as in:

----
https://github.com/Tendrl/node-monitoring/blob/master/tendrl/node_monitoring/commands/config_manager.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the name of the binary as its tendrl_monitoring_config_manager.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


==== Notifications/Monitoring impact:

The configuration method would now change slightly in accordance with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add more details here or Proposed Change section has enough details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack


=== Other deployer impact:

The configuration of collectd from monitoring-integration will slightly change
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above because The configuration of collectd from monitoring-integration will slightly change creates confusion I feel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes removing it as this configuration happens internally and requires no user intervention

Copy link
Member

@shtripat shtripat Aug 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Can we have some details of these internal configurations.


=== Work Items:

Github issues will be raised and linked here once the repo that hosts the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is available now as monitoring-integration. Can we add specific issues here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done I just needed to link issue on node-agent Tendrl/node-agent#549

Gluster metrics

tendrl-bug-id: Tendrl#188
Signed-off-by: anmolbabu <anmolbudugutta@gmail.com>
Note:
The aggregated/derived metrics that will be made available at grafana
level will be dealt with in detail by spec issues:
* https://github.com/Tendrl/specifications/issues/179

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible could you provide a link instead of text?

@r0h4n r0h4n closed this Jan 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants