various alerts numbers #373

ltrilety · 2018-03-13T09:12:48Z

The number of cluster alerts does not always correspond to the other available alerts numbers. Moreover those numbers not always correspond with actual state of the cluster.

E.g.

I had 2 as number of alerts on Clusters page, whilst related Volumes and Hosts shows all 0. The same for the alert list.
In another case I had 1 alert on the Clusters, related Volumes and Hosts shows all 0 and Alerts list shows 3 active alerts.

In both cases the cluster was completely up and running.

Reproduction steps:

kill some bricks, let the tendrl notice (especially there should be alert about degraded state of a volume)
stop glusterd services on all nodes and wait few seconds
start glusterd services on all nodes so all bricks are running again
If needed repeat previous steps (in my case just stopping and starting glusterd was enough)

Tendrl version:
tendrl-commons-1.6.1-1.el7.centos.noarch
tendrl-api-1.6.1-1.el7.centos.noarch
tendrl-ui-1.6.1-1.el7.centos.noarch
tendrl-grafana-selinux-1.5.4-2.el7.centos.noarch
tendrl-ansible-1.6.1-1.el7.centos.noarch
tendrl-notifier-1.6.0-1.el7.centos.noarch
tendrl-node-agent-1.6.1-1.el7.centos.noarch
tendrl-api-httpd-1.6.1-1.el7.centos.noarch
tendrl-selinux-1.5.4-2.el7.centos.noarch
tendrl-grafana-plugins-1.6.1-1.el7.centos.noarch
tendrl-monitoring-integration-1.6.1-1.el7.centos.noarch

GowthamShanmugam · 2018-03-20T13:51:08Z

@ltrilety I can't reproduce with the latest master, need more info

GowthamShanmugam · 2018-03-21T06:43:40Z

If you are able to reproduce it please share screenshot also

nthomas-redhat · 2018-03-21T13:01:06Z

@ltrilety , Please provide the information requested. If not reproducible, close the issue

GowthamShanmugam · 2018-03-22T08:58:23Z

@ltrilety need more info

ltrilety · 2018-03-23T08:34:07Z

@GowthamShanmugam I tried it once more. You're right the scenario didn't bring the issue right away. However when I let the tendrl run (about 4 days) the issue was hit again.
I was doing the same steps, but moreover I stopped one of available volumes and start it again after a while. The result is as follows:
clusters page - 3 alerts
hosts page - 0 alerts
volumes page - 4 alerts
alerts side window - 1 alert

Checked version:
tendrl-ansible-1.6.1-2.el7rhgs.noarch
tendrl-commons-1.6.1-1.el7rhgs.noarch
tendrl-ui-1.6.1-1.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-api-1.6.1-1.el7rhgs.noarch
tendrl-api-httpd-1.6.1-1.el7rhgs.noarch
tendrl-monitoring-integration-1.6.1-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.0-1.el7rhgs.noarch
tendrl-node-agent-1.6.1-1.el7rhgs.noarch
tendrl-grafana-plugins-1.6.1-1.el7rhgs.noarch

GowthamShanmugam · 2018-03-23T11:27:54Z

@ltrilety have you unmanaged cluster in that 4 days, Why i am asking this means I feel during unmanage cluster we are missing something
@shtripat ^

ltrilety · 2018-03-23T14:41:09Z

@GowthamShanmugam nope, not this time. I didn't perform any un-manage of the cluster.

GowthamShanmugam · 2018-03-26T15:09:04Z

@ltrilety @r0h4n @shtripat problem is when glusterd stopped for some time then TTL for volume object is deleting entire volume details, including an alert count for that volume also. So when glusterd is started then again alert count is initialized by zero but still some warning alert present for that particular volume. So alert count in the cluster is not matching with volume alert count. Entire alert count calculation goes wrong here.

shall we put TTL only for deleted volumes?

shtripat · 2018-03-26T15:12:07Z

I understand that alert counter is a different object from volume. Even if volume gets deleted due to ttl feel alert counter you should still retain and that should work out. What you say?

GowthamShanmugam · 2018-03-26T16:09:21Z

@r0h4n @shirshendu i had discuss with @shtripat about this problem, So we are incrementing alert count when any warning alert came and decrementing any info alert came, Problem is any one time increment goes wrong then entire alert calculation will go wrong, So @shtripat suggested me like instead of storing alert count in etcd, we will calculate the alert count in API and send it to UI for the particular cluster will solve this issue.

suggestion please

shtripat · 2018-03-27T02:19:40Z

@r0h4n @shirshendu @nthomas-redhat why I say/suggest so is that maintaining counters could always be problematic with TTLs in place. What I suggest is that maintain the alerts for volumes and bricks as well at location as below in etcd

/alerting/alerts/volumes/{vol-id}
/alerting/alerts/bricks/{brick-path}

No need to maintain the counters at individual entities level. The object model which REST uses to return details in GET output for these entities can have additional fields for alert counters and during GET call API layer can actually make two get calls to etcd. One for getting the entity and one for counting the no of alerts for the entities by looking at /alerting/alerts dir. Then merge the two details and return details to API client.

@shirshendu thoughts??

GowthamShanmugam · 2018-03-27T08:41:45Z

@shtripat are you telling like:
/alerts/alerts/{aid}
/alerts/nodes/{nid}/{aid}
/alerts/cluster/{cid}/nodes/{nid}/{aid}
/alerts/cluster/{cid}/clusters/{aid}
/alerts/cluster/{cid}/volumes/{vid}/{aid}

and i don't think /alerts/cluster/{cid}/bricks/{b_path}/{aid} is required

But @shtripat in this case lot of redundancy of alert is will happen, (e.g) brick alert need to be stored in /alerting/alerts
/alerts/cluster/{cid}/clusters/{aid}
/alerts/cluster/{cid}/volumes/{vid}/{aid}
also /alerts/cluster/{cid}/nodes/{nid}/{aid}

the same alert should be stored in a lot of places, @shtripat this not correct structure when ceph comes into picture

GowthamShanmugam · 2018-03-27T13:03:27Z

The current alert directory structure is not fit with Concept B design because in concept A we displayed clusters and nodes separately But in concept B design we are displaying clusters and inside clusters, we are displaying nodes. Problem is when any alert which is related to the node then only alert count for the node is increased but if you see in cluster alert count shows 0 only. the node also inside cluster only I feel node alert also should increment cluster alert count. The structure which I suggest is

/alerting/clusters/{cid}/all/{aid} // all alerts for that cluster is stored here
(this directory have all details about alert and this is displayed in UI)
/alerting/clusters/{cid}/nodes/{nid}/{aid} // only store alert id which is related for particular node
(only we store alert_id of warning/critical alert, if clearing alert came we will remove the alert_id from
here, Because len(/alert/cluster/{cid}/nodes/{nid}) will give alert count for that particular node alert)
/alerting/clusters/{cid}/volumes/{vid}/{aid} // only store alert id which is related for particular volume
(only we store alert_id of warning/critical alert, if clearing alert came we will remove the alert_id from
here, Because len(/alert/cluster/{cid}/volumes/{vid}) will give alert count for that particular volume alert)
/alerting/clusters/{cid}/cluster_alerts/{aid} // only store alert id which is related for particular cluster
(only we store alert_id of warning/critical alert, if clearing alert came we will remove the alert_id from
here, Because len(/alert/cluster/{cid}/alerts) will give alert count for that particular cluster alert)

Using this structure deleting of alert for the deleted volume or node is possible, alert count logic also not required.

To do this change i don't think much effort is required

@shirshendu @shtripat @nthomas-redhat @r0h4n suggestion please

shirshendu · 2018-03-27T14:03:07Z

Any changes to the alerts structure will need a corresponding change in API also, because we need to read the new structure to be able to present as an API.

At this point in the release cycle, I would recommend against changes in the fundamental way of saving alerts. @r0h4n would like to hear your thoughts as well here.
From what I could gather from @GowthamShanmugam on this issue, this bug is an edge case that seems to crop up with delete/remove/process restart type actions. If possible, wouldn't it just be easier to add a job that recalculates all our cached alert counts? This job can be added to the queue whenever the edge cases are occurring, like on node-agent startup, or if necessary, maybe even periodically?

Edit: That said, changes to API are definitely possible, if so be the final consensus that we decide to change the alerts structure.

shtripat · 2018-03-28T02:12:09Z

@GowthamShanmugam as discussed lets keep the dir structure as below

/alerts/clusters/{int-id}/all/{list of alerts}
/alerts/clusters/{int-id}/volumes/{vol-id}/warn_and_critical/{list of alert ids}
/alerts/clusters/{int-id}/nodes/{node-id}/warn_and_critical/{list of alerts ids}
/alerts/clusters/{int-id}/warn_and_critical/{list of alert ids}

So this way last 3 would be used for count purpose only and first one for maintaining all the alerts across a cluster.

@GowthamShanmugam ^^^

GowthamShanmugam · 2018-04-20T09:00:07Z

Please verify and close this issue

ltrilety mentioned this issue Mar 13, 2018

various alerts numbers Tendrl/api#389

Closed

r0h4n assigned GowthamShanmugam Mar 13, 2018

nthomas-redhat closed this as completed Mar 21, 2018

nthomas-redhat reopened this Mar 21, 2018

This was referenced Mar 28, 2018

Calculating and updtating alert_count in every node sync Tendrl/node-agent#754

Merged

Calculating and updtating cluster and volume alert_count in each cluster sync Tendrl/gluster-integration#599

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

various alerts numbers #373

various alerts numbers #373

ltrilety commented Mar 13, 2018

GowthamShanmugam commented Mar 20, 2018 •

edited

Loading

GowthamShanmugam commented Mar 21, 2018

nthomas-redhat commented Mar 21, 2018

GowthamShanmugam commented Mar 22, 2018

ltrilety commented Mar 23, 2018 •

edited

Loading

GowthamShanmugam commented Mar 23, 2018

ltrilety commented Mar 23, 2018

GowthamShanmugam commented Mar 26, 2018

shtripat commented Mar 26, 2018

GowthamShanmugam commented Mar 26, 2018

shtripat commented Mar 27, 2018

GowthamShanmugam commented Mar 27, 2018 •

edited

Loading

GowthamShanmugam commented Mar 27, 2018 •

edited

Loading

shirshendu commented Mar 27, 2018 •

edited

Loading

shtripat commented Mar 28, 2018

GowthamShanmugam commented Apr 20, 2018

various alerts numbers #373

various alerts numbers #373

Comments

ltrilety commented Mar 13, 2018

GowthamShanmugam commented Mar 20, 2018 • edited Loading

GowthamShanmugam commented Mar 21, 2018

nthomas-redhat commented Mar 21, 2018

GowthamShanmugam commented Mar 22, 2018

ltrilety commented Mar 23, 2018 • edited Loading

GowthamShanmugam commented Mar 23, 2018

ltrilety commented Mar 23, 2018

GowthamShanmugam commented Mar 26, 2018

shtripat commented Mar 26, 2018

GowthamShanmugam commented Mar 26, 2018

shtripat commented Mar 27, 2018

GowthamShanmugam commented Mar 27, 2018 • edited Loading

GowthamShanmugam commented Mar 27, 2018 • edited Loading

shirshendu commented Mar 27, 2018 • edited Loading

shtripat commented Mar 28, 2018

GowthamShanmugam commented Apr 20, 2018

GowthamShanmugam commented Mar 20, 2018 •

edited

Loading

ltrilety commented Mar 23, 2018 •

edited

Loading

GowthamShanmugam commented Mar 27, 2018 •

edited

Loading

GowthamShanmugam commented Mar 27, 2018 •

edited

Loading

shirshendu commented Mar 27, 2018 •

edited

Loading