Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

various alerts numbers #373

Open
ltrilety opened this issue Mar 13, 2018 · 16 comments
Open

various alerts numbers #373

ltrilety opened this issue Mar 13, 2018 · 16 comments
Assignees

Comments

@ltrilety
Copy link

The number of cluster alerts does not always correspond to the other available alerts numbers. Moreover those numbers not always correspond with actual state of the cluster.

E.g.

  • I had 2 as number of alerts on Clusters page, whilst related Volumes and Hosts shows all 0. The same for the alert list.
  • In another case I had 1 alert on the Clusters, related Volumes and Hosts shows all 0 and Alerts list shows 3 active alerts.

In both cases the cluster was completely up and running.

Reproduction steps:

  1. kill some bricks, let the tendrl notice (especially there should be alert about degraded state of a volume)
  2. stop glusterd services on all nodes and wait few seconds
  3. start glusterd services on all nodes so all bricks are running again
  4. If needed repeat previous steps (in my case just stopping and starting glusterd was enough)

Tendrl version:
tendrl-commons-1.6.1-1.el7.centos.noarch
tendrl-api-1.6.1-1.el7.centos.noarch
tendrl-ui-1.6.1-1.el7.centos.noarch
tendrl-grafana-selinux-1.5.4-2.el7.centos.noarch
tendrl-ansible-1.6.1-1.el7.centos.noarch
tendrl-notifier-1.6.0-1.el7.centos.noarch
tendrl-node-agent-1.6.1-1.el7.centos.noarch
tendrl-api-httpd-1.6.1-1.el7.centos.noarch
tendrl-selinux-1.5.4-2.el7.centos.noarch
tendrl-grafana-plugins-1.6.1-1.el7.centos.noarch
tendrl-monitoring-integration-1.6.1-1.el7.centos.noarch

@GowthamShanmugam
Copy link
Collaborator

GowthamShanmugam commented Mar 20, 2018

@ltrilety I can't reproduce with the latest master, need more info

@GowthamShanmugam
Copy link
Collaborator

If you are able to reproduce it please share screenshot also

@nthomas-redhat
Copy link
Contributor

@ltrilety , Please provide the information requested. If not reproducible, close the issue

@GowthamShanmugam
Copy link
Collaborator

@ltrilety need more info

@ltrilety
Copy link
Author

ltrilety commented Mar 23, 2018

@GowthamShanmugam I tried it once more. You're right the scenario didn't bring the issue right away. However when I let the tendrl run (about 4 days) the issue was hit again.
I was doing the same steps, but moreover I stopped one of available volumes and start it again after a while. The result is as follows:
clusters page - 3 alerts
hosts page - 0 alerts
volumes page - 4 alerts
alerts side window - 1 alert

image
image
image

Checked version:
tendrl-ansible-1.6.1-2.el7rhgs.noarch
tendrl-commons-1.6.1-1.el7rhgs.noarch
tendrl-ui-1.6.1-1.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-api-1.6.1-1.el7rhgs.noarch
tendrl-api-httpd-1.6.1-1.el7rhgs.noarch
tendrl-monitoring-integration-1.6.1-1.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-notifier-1.6.0-1.el7rhgs.noarch
tendrl-node-agent-1.6.1-1.el7rhgs.noarch
tendrl-grafana-plugins-1.6.1-1.el7rhgs.noarch

@GowthamShanmugam
Copy link
Collaborator

@ltrilety have you unmanaged cluster in that 4 days, Why i am asking this means I feel during unmanage cluster we are missing something
@shtripat ^

@ltrilety
Copy link
Author

@GowthamShanmugam nope, not this time. I didn't perform any un-manage of the cluster.

@GowthamShanmugam
Copy link
Collaborator

@ltrilety @r0h4n @shtripat problem is when glusterd stopped for some time then TTL for volume object is deleting entire volume details, including an alert count for that volume also. So when glusterd is started then again alert count is initialized by zero but still some warning alert present for that particular volume. So alert count in the cluster is not matching with volume alert count. Entire alert count calculation goes wrong here.

shall we put TTL only for deleted volumes?

@shtripat
Copy link
Member

I understand that alert counter is a different object from volume. Even if volume gets deleted due to ttl feel alert counter you should still retain and that should work out. What you say?

@GowthamShanmugam
Copy link
Collaborator

@r0h4n @shirshendu i had discuss with @shtripat about this problem, So we are incrementing alert count when any warning alert came and decrementing any info alert came, Problem is any one time increment goes wrong then entire alert calculation will go wrong, So @shtripat suggested me like instead of storing alert count in etcd, we will calculate the alert count in API and send it to UI for the particular cluster will solve this issue.

suggestion please

@shtripat
Copy link
Member

@r0h4n @shirshendu @nthomas-redhat why I say/suggest so is that maintaining counters could always be problematic with TTLs in place. What I suggest is that maintain the alerts for volumes and bricks as well at location as below in etcd

/alerting/alerts/volumes/{vol-id}
/alerting/alerts/bricks/{brick-path}

No need to maintain the counters at individual entities level. The object model which REST uses to return details in GET output for these entities can have additional fields for alert counters and during GET call API layer can actually make two get calls to etcd. One for getting the entity and one for counting the no of alerts for the entities by looking at /alerting/alerts dir. Then merge the two details and return details to API client.

@shirshendu thoughts??

@GowthamShanmugam
Copy link
Collaborator

GowthamShanmugam commented Mar 27, 2018

@shtripat are you telling like:
/alerts/alerts/{aid}
/alerts/nodes/{nid}/{aid}
/alerts/cluster/{cid}/nodes/{nid}/{aid}
/alerts/cluster/{cid}/clusters/{aid}
/alerts/cluster/{cid}/volumes/{vid}/{aid}

and i don't think /alerts/cluster/{cid}/bricks/{b_path}/{aid} is required

But @shtripat in this case lot of redundancy of alert is will happen, (e.g) brick alert need to be stored in /alerting/alerts
/alerts/cluster/{cid}/clusters/{aid}
/alerts/cluster/{cid}/volumes/{vid}/{aid}
also /alerts/cluster/{cid}/nodes/{nid}/{aid}

the same alert should be stored in a lot of places, @shtripat this not correct structure when ceph comes into picture

@GowthamShanmugam
Copy link
Collaborator

GowthamShanmugam commented Mar 27, 2018

The current alert directory structure is not fit with Concept B design because in concept A we displayed clusters and nodes separately But in concept B design we are displaying clusters and inside clusters, we are displaying nodes. Problem is when any alert which is related to the node then only alert count for the node is increased but if you see in cluster alert count shows 0 only. the node also inside cluster only I feel node alert also should increment cluster alert count. The structure which I suggest is

  1. /alerting/clusters/{cid}/all/{aid} // all alerts for that cluster is stored here
    (this directory have all details about alert and this is displayed in UI)
  2. /alerting/clusters/{cid}/nodes/{nid}/{aid} // only store alert id which is related for particular node
    (only we store alert_id of warning/critical alert, if clearing alert came we will remove the alert_id from
    here, Because len(/alert/cluster/{cid}/nodes/{nid}) will give alert count for that particular node alert)
  3. /alerting/clusters/{cid}/volumes/{vid}/{aid} // only store alert id which is related for particular volume
    (only we store alert_id of warning/critical alert, if clearing alert came we will remove the alert_id from
    here, Because len(/alert/cluster/{cid}/volumes/{vid}) will give alert count for that particular volume alert)
  4. /alerting/clusters/{cid}/cluster_alerts/{aid} // only store alert id which is related for particular cluster
    (only we store alert_id of warning/critical alert, if clearing alert came we will remove the alert_id from
    here, Because len(/alert/cluster/{cid}/alerts) will give alert count for that particular cluster alert)

Using this structure deleting of alert for the deleted volume or node is possible, alert count logic also not required.

To do this change i don't think much effort is required

@shirshendu @shtripat @nthomas-redhat @r0h4n suggestion please

@shirshendu
Copy link

shirshendu commented Mar 27, 2018

Any changes to the alerts structure will need a corresponding change in API also, because we need to read the new structure to be able to present as an API.

At this point in the release cycle, I would recommend against changes in the fundamental way of saving alerts. @r0h4n would like to hear your thoughts as well here.
From what I could gather from @GowthamShanmugam on this issue, this bug is an edge case that seems to crop up with delete/remove/process restart type actions. If possible, wouldn't it just be easier to add a job that recalculates all our cached alert counts? This job can be added to the queue whenever the edge cases are occurring, like on node-agent startup, or if necessary, maybe even periodically?

Edit: That said, changes to API are definitely possible, if so be the final consensus that we decide to change the alerts structure.

@shtripat
Copy link
Member

@GowthamShanmugam as discussed lets keep the dir structure as below

/alerts/clusters/{int-id}/all/{list of alerts}
/alerts/clusters/{int-id}/volumes/{vol-id}/warn_and_critical/{list of alert ids}
/alerts/clusters/{int-id}/nodes/{node-id}/warn_and_critical/{list of alerts ids}
/alerts/clusters/{int-id}/warn_and_critical/{list of alert ids}

So this way last 3 would be used for count purpose only and first one for maintaining all the alerts across a cluster.

@GowthamShanmugam ^^^

@GowthamShanmugam
Copy link
Collaborator

Please verify and close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants