Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MicroServices throw UnboundLocalError when sending email alerts from kubernetes/docker containers #10234

Closed
amaltaro opened this issue Jan 21, 2021 · 27 comments · Fixed by #10244

Comments

@amaltaro
Copy link
Contributor

amaltaro commented Jan 21, 2021

Impact of the bug
MSTransferor (but it will apply to any system trying to send alerts via SMTP)

Describe the bug
This issue is meant to report/address two issues:

  1. some missing configuration on our kubernetes/docker setup which does not allow to create a connection on the host:port
  2. and likely deal with this alert failure as a soft failure, still allowing the workflow to be moved to staging.

The second issue isn't really a problem at the moment, but it makes MSTransferor to have a "silent" behaviour, in the sense that: when the service fails to send that alert notification, the workflow in question is skipped from that cycle - even though the Rucio rule creation has already happened; then in the next cycle, MSTransferor finds an existent rule (likely in INJECTED/REPLICATING status), so the service assumes data is already available and simply moves that workflow to staging without persisting any rule id to be monitored by MSMonitor. Thus, MSMonitor will bypass this workflow right away because there are no rules to be monitored.

How to reproduce it
trigger email notification in a pod

Expected behavior
Microservices - or any other WMCore service - should be able to send email notifications.

Regarding the MSTransferor behaviour, I think we can log the exception, make a record with the content that was supposed to be sent via email (I think MSTransferor already does that), and move on with the workflow processing as if there were no problems sending the email.

Additional context and error message
Some sections of the MSTransferor log under
amaltaro@vocms0750:/cephfs/product/dmwm-logs $ less ms-transferor-20210121-ms-transferor-7744c99cd8-5p6m8.log

2021-01-21 09:41:53,542:INFO:MSTransferor: Checking secondary data location for request: haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987, TrustPUSitelists: False, request white/black list PNNs: set([u'T2_CH_CE
RN'])
2021-01-21 09:41:53,543:INFO:MSTransferor: it has secondary: /Neutrino_E-10_gun/RunIISummer19ULPrePremix-UL18_106X_upgrade2018_realistic_v11_L1v1-v2/PREMIX, total size: 697625.14 GB, current disk locations: set([u'T1_FR_CCIN2P3_Disk', 
u'T1_US_FNAL_Disk'])
2021-01-21 09:41:53,543:INFO:MSTransferor: secondary: /Neutrino_E-10_gun/RunIISummer19ULPrePremix-UL18_106X_upgrade2018_realistic_v11_L1v1-v2/PREMIX will need data placement!!!
2021-01-21 09:41:53,543:INFO:MSTransferor: Finding final pileup destination for request: haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987
2021-01-21 09:41:53,543:INFO:MSTransferor:   found a PSN list: set([u'T2_CH_CERN']), which maps to a list of PNNs: set([u'T2_CH_CERN'])
2021-01-21 09:41:53,543:INFO:MSTransferor: Handling data subscriptions for request: haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987
2021-01-21 09:41:53,543:INFO:MSTransferor: Have whole PU dataset: /Neutrino_E-10_gun/RunIISummer19ULPrePremix-UL18_106X_upgrade2018_realistic_v11_L1v1-v2/PREMIX (697625.14 GB)
2021-01-21 09:41:53,558:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "GET /rses/?expression=T2_CH_CERN HTTP/1.1" 200 None
2021-01-21 09:41:53,612:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "GET /rses/T2_CH_CERN/attr/ HTTP/1.1" 200 None
2021-01-21 09:41:53,613:INFO:MSTransferor: Placing whole container, picked RSE: T2_CH_CERN out of an RSE list: T2_CH_CERN
2021-01-21 09:41:53,613:INFO:MSTransferor: Creating rule for workflow haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987 with 1 DIDs in container /Neutrino_E-10_gun/RunIISummer19ULPrePremix-UL18_106X_upgrade2018_
realistic_v11_L1v1-v2/PREMIX, RSEs: T2_CH_CERN, grouping: ALL
2021-01-21 09:41:53,664:DEBUG:connectionpool: http://cms-rucio.cern.ch:80 "POST /rules/ HTTP/1.1" 201 None
2021-01-21 09:41:53,664:INFO:MSTransferor: Rules successful created for /Neutrino_E-10_gun/RunIISummer19ULPrePremix-UL18_106X_upgrade2018_realistic_v11_L1v1-v2/PREMIX : [u'a383a3c0a23348b8a489b9296a09e454']
2021-01-21 09:41:53,665:ERROR:EmailAlert: Error sending alert email.
Details: [Errno 99] Cannot assign requested address
Traceback (most recent call last):
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.4.4.pre5/lib/python2.7/site-packages/Utils/EmailAlert.py", line 37, in send
    smtp = smtplib.SMTP(self.serverName)
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/external/python/2.7.13-comp/lib/python2.7/smtplib.py", line 256, in __init__
    (code, msg) = self.connect(host, port)
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/external/python/2.7.13-comp/lib/python2.7/smtplib.py", line 316, in connect
    self.sock = self._get_socket(host, port, self.timeout)
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/external/python/2.7.13-comp/lib/python2.7/smtplib.py", line 291, in _get_socket
    return socket.create_connection((host, port), timeout)
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/external/python/2.7.13-comp/lib/python2.7/socket.py", line 575, in create_connection
    raise err
error: [Errno 99] Cannot assign requested address
2021-01-21 09:41:53,667:ERROR:MSTransferor: Unknown exception while making Transfer Request for haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987       Error: local variable 'smtp' referenced before assignment
Traceback (most recent call last):
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.4.4.pre5/lib/python2.7/site-packages/WMCore/MicroService/Unified/MSTransferor.py", line 207, in execute
    success, transfers = self.makeTransferRequest(wflow)
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.4.4.pre5/lib/python2.7/site-packages/WMCore/MicroService/Unified/MSTransferor.py", line 625, in makeTransferRequest
    blocks, dataSize, nodes, idx)
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.4.4.pre5/lib/python2.7/site-packages/WMCore/MicroService/Unified/MSTransferor.py", line 723, in makeTransferRucio
    self.notifyLargeData(aboveWarningThreshold, transferId, wflow.getName(), dataSize, dataIn)
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.4.4.pre5/lib/python2.7/site-packages/WMCore/MicroService/Unified/MSTransferor.py", line 748, in notifyLargeData
    self.emailAlert.send(emailSubject, emailMsg)
  File "/data/srv/HG2101d/sw/slc7_amd64_gcc630/cms/reqmgr2ms/0.4.4.pre5/lib/python2.7/site-packages/Utils/EmailAlert.py", line 43, in send
    smtp.quit()
UnboundLocalError: local variable 'smtp' referenced before assignment

and in the subsequent MSTransferor cycle, this is what happened to that workflow:

2021-01-21 09:53:39,553:INFO:MSTransferor: Checking secondary data location for request: haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987, TrustPUSitelists: False, request white/black list PNNs: set([u'T2_CH_CERN'])
2021-01-21 09:53:39,553:INFO:MSTransferor: it has secondary: /Neutrino_E-10_gun/RunIISummer19ULPrePremix-UL18_106X_upgrade2018_realistic_v11_L1v1-v2/PREMIX, total size: 697625.14 GB, current disk locations: set([u'T1_FR_CCIN2P3_Disk', u'T2_CH_CERN', u'T1_US_FNAL_Disk'])
2021-01-21 09:53:39,553:INFO:MSTransferor: secondary dataset: /Neutrino_E-10_gun/RunIISummer19ULPrePremix-UL18_106X_upgrade2018_realistic_v11_L1v1-v2/PREMIX already in place. Common locations with site white/black list is: set([u'T2_CH_CERN'])
2021-01-21 09:53:39,553:INFO:MSTransferor: Request haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987 does not have any further data to transfer
2021-01-21 09:53:39,553:INFO:MSTransferor: Transfer requests successful for haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987. Summary: []
2021-01-21 09:53:39,656:INFO:ReqMgrAux: Update in-place: False for transfer doc: haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987 was successful.
2021-01-21 09:53:39,656:INFO:MSTransferor: Transfer document successfully created in CouchDB for: haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987
2021-01-21 09:53:39,656:INFO:MSCore: MSTransferor updating haozturk_task_PPS-RunIISummer19UL18pLHEGEN-00007__v1_T_210121_092904_8987 status to: staging
@amaltaro
Copy link
Contributor Author

For reference, a similar issue has been reported here:
DefectDojo/django-DefectDojo#1423

which might be useful to debug this problem.

@amaltaro amaltaro self-assigned this Jan 25, 2021
@amaltaro
Copy link
Contributor Author

I have temporarily disabled these alerts from MSTransferor POD - in the production kubernetes cluster - by applying this change:

data.warningTransferThreshold = -1  # 100. * (1000 ** 4)  # 100 TB (terabyte)

of course, if that service/POD gets restarted, then it will bring the default transferor configuration with email notifications enabled again.

Into the issue, I did some googling and tried to create a SMTP connection following a few suggestions, like:

import smtplib
server = smtplib.SMTP()
server = smtplib.SMTP("localhost")
server = smtplib.SMTP("127.0.0.1")
server = smtplib.SMTP("0.0.0.0")
server = smtplib.SMTP("host.docker.internal")
server = smtplib.SMTP("eth0 inet address")

but none of them work. From what I understood, we would need to have our application to connect to the localhost host namespace (not the container one), and for that I have also tried the --net=host option, which did not work as well.

@goughes Erik, would you have any suggestion here? Or perhaps, do you think you could work on this issue once MSOutput is deployed in production k8s?

@vkuznet
Copy link
Contributor

vkuznet commented Jan 25, 2021

Alan, did you verified from the pod itself that you can send mail from CLI, e.g. mail -s "test" ...? Please remember that pod is not full Linux OS and may or may not contain necessary libraries you may rely on in your stack.

@amaltaro
Copy link
Contributor Author

Good point, I forgot to mention it in my previous reply.
Yes, I did test it from the ms-transferor POD, but it looks like the connection to the SMTP server is made with anonymous client (from what I quickly read as well), here is an example:

[root@ms-transferor-7744c99cd8-5p6m8 current]# mail -vv -s "Subject test" MY_EMAIL <<< 'Alan message'
[<-] 220 smtp.cern.ch Microsoft ESMTP MAIL Service ready at Mon, 25 Jan 2021 15:18:35 +0100
[->] HELO ms-transferor-7744c99cd8-5p6m8
[<-] 250 smtp.cern.ch Hello [REAL_IP_ADDRESS]
[->] MAIL FROM:<root@ms-transferor-7744c99cd8-5p6m8>
[<-] 530 5.7.1 Client was not authenticated
send-mail: 530 5.7.1 Client was not authenticated

@amaltaro
Copy link
Contributor Author

Problem with the WMCore source code has been resolved in: #10244

However, we still have to recover the ability to send email notifications from the kubernetes infrastructure.
@goughes Erik, once you have some free cycles, can you please look into this issue?

I talked to Eric a couple of days ago, and they use a sendmail software to take care of email notifications. Here are a few pointers that he provided (but I do hope we don't need something like that):
https://github.com/rucio/containers/blob/fef758e0f90f9cc4e359078cbd9af8c95db46393/daemons/Dockerfile#L46
and
https://github.com/rucio/containers/blob/fef758e0f90f9cc4e359078cbd9af8c95db46393/daemons/start-daemon.sh#L18

@goughes
Copy link
Contributor

goughes commented Feb 15, 2021

Did a bit of research to determine if there was a better way to get email alerts for this. Most examples require modifying the sendmail/postfix config on the host or installing your own SMTP server. The simplest method that is in our control is installing sendmail similar to what Eric has done. I have tested this method via a reqmgr2ms container on my VM and it works fine.

Don't these pod logs get scraped and sent off somewhere for parsing? MONIT? I'm not familiar with that, but was thinking of a scenario where that system would be responsible for the alerting instead of WMCore and sendmail in a pod. Maybe @vkuznet can comment.

@vkuznet
Copy link
Contributor

vkuznet commented Feb 15, 2021

Once again (I think I commented on this issue in different places) please use amtool (it is available on CVMFS /cvmfs/cms.cern.ch/cmsmon/amtool or you can grab executable directly from github https://github.com/prometheus/alertmanager/releases/tag/v0.21.0). Then use it within a pod. It will send alert to our AlertManager (meaning that it will be visible in CMS Monitoring infrastructure) and we can confiture a channel for you to pass this alert(s) to either slack or send emails.

The amtool is tiny static executable and does not require any configuration and pods/service tweaking. If you need example how to use it please refer to our crons, e.g. https://github.com/dmwm/CMSKubernetes/blob/master/docker/sqoop/run.sh#L24
Here is example how to get it from Dockerfile:
https://github.com/dmwm/CMSKubernetes/blob/master/docker/sqoop/Dockerfile#L37

@amaltaro
Copy link
Contributor Author

amaltaro commented Feb 16, 2021

I'm sure there are many pros/cons from both approaches, but given that I don't know any of them, I'm afraid I have no preference at the moment. I'd say the most important is to pick the most robust and maintainable tool.

Just some random thoughts, sendmail gives us full flexibility and control over the messages and destination, but it's not integrated with the CMS Monitoring setup, thus limiting us to emails only type of notification.
On the other side, amtool could be extended to other types of alerts, but then we become dependent of the CMS Monitoring setup as well.

Erik, it's up to you ;)

@klannon
Copy link

klannon commented Feb 16, 2021

We're kind of tied to CMS by definition, and I think CMS Monitoring has sufficient long-term support that it's not a concern. I think that we should use amtool.

@vkuznet
Copy link
Contributor

vkuznet commented Feb 16, 2021

Here is pros/cons of each approach (in my view):

using sendmail tool(s)

  • pros:
    • standard unix tool which comes (but should be explicitly installed) with any Linux distribution
    • doing single task, send email
  • cons
    • limited to emails only
    • require to be explicitly installed and configured
    • may require list of dependencies, e.g. to send emails you may need mail daemon, SMTP server, etc.
    • no integration with CMS Monitoring tools
    • will increase docker image size since all required dependencies should be installed

using amtool

  • pros
    • static executable, does not depend on Linux distribution
    • does not require any dependencies or configuration
      • can be run from any other script (bash or python)
      • does not increase docker image size
    • integrated with CMS Monitoring tools
      • will send notification to AlertManager who can route them to various channels, e.g. emails, Slack
      • we can use notification to annotate your favorite dashboard
      • customization in terms of tags, expiration time stamps, severity levels, etc.
  • cons
    • new tool you never used but afraid to use
    • do not send direct emails and need AlertManager URL where message will go (part of CMS Monitoring infrastructure)

@todor-ivanov
Copy link
Contributor

Thanks @vkuznet
If I may give my 2c here. This one in particular:

  • do not send direct emails and need AlertManager URL where message will go (part of CMS Monitoring infrastructure)

can easily turn out to be a positive feature in the long term.

@vkuznet
Copy link
Contributor

vkuznet commented Feb 16, 2021

anything can be point of failure: k8s, AlertManger (AM), SMTP, etc. It depends on what is critical for you and how you treat the infrastructure. The AM runs on k8s, therefore k8s will ensure it will restart it if necessary.

I'm not sure what are you trying to solve here. If you care about stability of MS itself, again k8s ensures it will be restarted in case of failure. If you need notification about it either sending email or amtool will do a job, if you save logs you can manually check logs if something happen. How critical it is for you it is different story. If you want to be paranoid you may use both.

@goughes
Copy link
Contributor

goughes commented Feb 17, 2021

Thanks everyone for the comments. I will use amtool for these alerts.

One question for @vkuznet, what info do you need from us to configure this on the Alertmanager side?

Also, @amaltaro do you want these to go to just email or a slack channel as well?

@amaltaro
Copy link
Contributor Author

Also, @amaltaro do you want these to go to just email or a slack channel as well?

I think we could have it in the #alerts-dmwm channel as well.

@vkuznet
Copy link
Contributor

vkuznet commented Feb 17, 2021

Erik, for alert routing you should decide which labels to use in your alert. The labels may include tag, severity, service, etc. Please have a look how alerts are defined, e.g. for reqmgr2 we have this rules
https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules

In rules you'll see different labels. Therefore when you use amtool you can define any set of labels, e.g. see example how we define different labels in CMSpark crons (like severity, tag, etc), https://github.com/dmwm/CMSSpark/blob/master/bin/cron4aggregation#L33
You can also add to alert any set of annotations which will help to understand the alert nature.

So, in AlertManager we'll use labels to identify your alerts. For example, if you use tag=bla, then you'll need to tell us this tags. And, you should tell us which channels you want to propagate the alert based on your tags. The channel can be email, or slack. Therefore you can tell us, route alerts with tag=bla to email, but if severity=high route it to slack as well.

@goughes
Copy link
Contributor

goughes commented Feb 17, 2021

Thanks Valentin. I see three cases where reqmgr2 microservices would need to send a mail. Does it make sense to make a single tag called microservices with these routes:

severity=medium would send a mail to cms-service-production-admins AT cern.ch
severity=high would send a mail and send to #alerts-dmwm

Then I can provide the additional error message contents through the summary and description annotations.

Do I get any benefit from separating the service label into each component? ms-output, ms-transferor, etc. or can it just be reqmgr2ms?

emailSubject = "[MSOutput] Campaign '{}' not found in central CouchDB".format(dataItem['Campaign'])

tag: microservices
service: ms-output
severity: high

emailSubject = "[MSOutput] Datatier not found in the Unified configuration: {}".format(dataTier)

tag: microservices
service: ms-output
severity: high

emailSubject = "[MS] Large pending data transfer under request id: {}".format(transferId)

tag: microservices
service: ms-transferor
severity: medium or high? @amaltaro?

@vkuznet
Copy link
Contributor

vkuznet commented Feb 17, 2021

Erik,
the separation of service values is desired since we can use tag to outline all services, but use service value to pick one or multiple. This is useful if you want to integrate alerts on any dashboard. For instance, I can use tag=microservices to display all alerts on dashboard plots, or I can use tag=ms-transferor to display only transferor alerts. This allows flexibility to build dashboards/plots for your needs. For instance, we may build status board for individual services such that user can see which one is misbehaving.

I'll go ahead and configure necessary pieces with your information in AlertManager, then I'll request to test it from your end.

@vkuznet
Copy link
Contributor

vkuznet commented Feb 17, 2021

I put new changes in place for our AM instance. You can see them here:
https://cms-monitoring.cern.ch/alertmanager/#/status
and search for dmwm-admins and dmwm-slack. You'll see that rules are applied sequentially, i.e. if tag=microservices the receiper would be dmwm-admins (in other words the email will be send), while if tag=microservices and severity=high the receiver will be dmwm-slack. Since in both cases tag is the same you'll get emails, but only if your alert has high severity it will be redirected to slack.

Therefore, you may test your alert using amtool and --alertmanager.url=http://cms-monitoring.cern.ch:30093 within CERN network. So far I only deployed this to AM instance running in our monitoring cluster, if everything will be fine I'll update configuration of AMs in High-Availability clusters and you may send alerts to them as well. Please remember to specify proper expiration timestamp for your alert, otherwise it will be fired forever.

I suggest that you use the following script (with whatever adjustments you may want to have) for testing purposes (as I usually do when testing alerts):

#!/bin/bash
amurl=http://cms-monitoring.cern.ch:30093
expire=`date -d '+5 min' --rfc-3339=ns | tr ' ' 'T'`
/cvmfs/cms.cern.ch/cmsmon/amtool alert add test_alert \
    alertname=test_alert severity=medium tag=test alert=amtool \
    --end=$expire \
    --annotation=date="`date`" \
    --annotation=hostname="`hostname`" \
    --alertmanager.url=$amurl

@goughes
Copy link
Contributor

goughes commented Feb 17, 2021

Thanks for the script Valentin. I just ran two alerts, one medium and one high and can see them here https://cms-monitoring.cern.ch/alertmanager/#/alerts.

I'll work on adding an amtool wrapper to WMCore to replace the current email functionality.

@vkuznet
Copy link
Contributor

vkuznet commented Feb 17, 2021

Erik, please confirm if you receive emails for your test alerts and did you see them in Slack channel. As I wrote, if everything is fine, then I'll update our HA clusters with this configuration. And you can use either http://cms-monitoring-ha1.cern.ch:30093 and/or http://cms-monitoring-ha2.cern.ch:30093 for your needs after that. For reliability purposes we suggest to send notification to both, but since they configure as a cluster clients will only receive single notification.

My suggestion that you should keep url to be configurable in your stack such that we can easily change it if we'll move things around.

@amaltaro
Copy link
Contributor Author

Trying to compile many comments/questions in this reply.

Please have a look how alerts are defined, e.g. for reqmgr2 we have this rules
https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb/monitoring/prometheus/rules/reqmgr2.rules

Valentin, does it mean we need to create a new "*rules" template under that repository for every type of alert our service needs to generate?

Valentin/Erik, does it make sense to group all our WMCore alerts under the tag=wmcore tag? We would then use the service tag as a reference to which service is generating the alert, thus service=reqmgr2|workqueue|ms-transferor|and so on?

severity=medium would send a mail to cms-service-production-admins AT cern.ch
severity=high would send a mail and send to #alerts-dmwm

How about:
severity=medium|low : simply add the alert to slack #alerts-dmwm
severy=high : add alert to the slack #alerts-dmwm AND send an email to a given group / list of recipients (I don't think production-admins is the best choice for those)
?

and search for dmwm-admins and dmwm-slack

I must be doing something wrong over there, because I cannot filter any alerts with those strings. Is there any specific syntax for that (env=production does filter stuff, for instance)?

Valentin, can you please clarify how the alert expiration flag works? If there is an alert not expired yet and our service generates another one, does it get fired? What happens when it expires, does it get RESOLVED in slack or what? Anything important that we need to know about this property?

@vkuznet
Copy link
Contributor

vkuznet commented Feb 18, 2021

Alan, let's walk through all your questions:

  • *.rules files are for creating alert rules for metrics in Prometheus, it has nothing to do with alerts you send. If you read rules files you'll see how it is define. It takes some metric condition and generate alert in Prometheus with labels, descriptions, summary.
  • I don't really care how you assign tag, labels, etc. values, therefore you should think about global picture of your alerts and assign them appropriately. If you'll examine the rules files you'll see that over there we use
      severity: high
      tag: cmsweb
      service: reqmgr2
      host: "{{ $labels.host }}"
      kind: dmwm

which assign tag cmsweb (the alert comes from cmsweb, service reqmgr2, and it belongs to kind dmwm. For alerts you generate within your code you may have completely different set of values for those. It is not required to have identical values in rules and in your own alerts. We use tag=cmsweb in our Prometheus rules for all systems running on cmsweb,

  • About severity, again it should be your decision. I can adjust configuration to any values.
  • regarding dmwm-admins and dmwm-slack. I referred to AM configuration under https://cms-monitoring.cern.ch/alertmanager/#/status and not to alerts. For alerts (the man page of AM) you can use different filter based on labels your alerts carry. For instance, if you will use tag=wmcore and severity=high you can enter these values in a filter. The example there env=production represents tag-value pair used in alert.
  • The alert lifetime can be understood from the diagram (see page 8) when it is fired by Prometheus rule. The alerts lifetime generated by amtool only defined by --end=$expire flag you provide. Once alert is expired it will be marked as resolved in this case. If your system generate alert with the same labels/attributes before its expiration time of alert will be update and alert will stays active up to new expiration time.

Please note that you can use amtool also as CLI tool to query and silence your alerts. For instance

/cvmfs/cms.cern.ch/cmsmon/amtool alert query tag=test --alertmanager.url=http://cms-monitoring.cern.ch:30093
Alertname      Starts At                Summary
vk_test_alert  2021-02-18 14:59:12 CET

And, you can query your alerts using either amtool as shown above or in web interface, e.g. here is a query for specific receiver:

https://cms-monitoring.cern.ch/alertmanager/#/alerts?receiver=test

and here is for different filters

https://cms-monitoring.cern.ch/alertmanager/#/alerts?silenced=false&inhibited=false&active=true&filter={severity="medium", tag="test"}

You can go to our AM web page and play with filters to understand their behavior using different alerts.

@goughes
Copy link
Contributor

goughes commented Feb 19, 2021

Hi @vkuznet, could you update the email associated with the two AM rules you created to point to my CERN mail (egough AT cern.ch) so I can test my wrapper without mailing the larger group?

@vkuznet
Copy link
Contributor

vkuznet commented Feb 19, 2021

Erik, instead of changing existing dmwm channels, I created a new one for you. So, if you'll use user=erik tag-value pair instead of tag=microservices, then the AM will route alert to your email address. If you'll use both user=erik and tag=microserices then alert will be routed to both. Feel free to test it with your own channel.

I can create as many different (individual channels) as it will be necessary, including slack. But you should explicitly tell me what do you want. So far I only added email channel.

@vkuznet
Copy link
Contributor

vkuznet commented Feb 22, 2021

The AM API: /api/v1/alerts, and you can post to it the following JSON:
https://www.prometheus.io/docs/alerting/latest/clients/

@vkuznet
Copy link
Contributor

vkuznet commented Feb 22, 2021

And I don't really see a problem of making custom function for py2 to create rfc3339 timestamps.

@goughes goughes changed the title MicroServices fail to send email alerts from kubernetes/docker containers MicroServices throw UnboundLocalError when sending email alerts from kubernetes/docker containers Mar 8, 2021
@goughes
Copy link
Contributor

goughes commented Mar 8, 2021

From our discussion in the team meeting earlier today, I have renamed this issue to reflect the initial bug and will close the issue. Fix was via Alan's PR: #10244

A new issue to track the work on the transition to AlertManager for alerting is here: #10340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants