Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

DWR Monitoring, Alerting and Issue Resolution Strategy

George Scheer edited this page Jun 10, 2019 · 1 revision

Goals

Enable DWR to monitor, identify and rectify most if not all of the DWR GOES17 data issues.

Justification

Because of the significantly larger data size and frequency of GOES17 data as compared to GOES15, data processing for Spatial CIMIS introduces significantly higher probability for data corruption. It is for this fact that a premature promotion of the DWR GOES17 Spatial CIMIS processes to production / live status will unnecessarily put the team (DWR & UCD) on endless alert potentially introducing delays in providing ETo data to customers.

What is the strategy?

Prior to promoting the DWR GOES17 Spatial CIMIS processes to production / live status, DWR should be able to demonstrate the ability to go 2 weeks without a major processing issue while being able to adequately address live data delivery issues in a timely manner in order to avoid data loss and an interruption to their ETo delivery responsibilities.

To accomplish this the following strategy should be considered:

  • Designate DWR personnel for the alert team
  • Create an alert email list populated with the alert team
  • DWR alert team members should setup Pushover
  • DWR alert team members should be actively participating at the Spatial CIMIS Slack Channel
  • Develop a monitoring system such as the UCD status page
  • There should be basic monitoring of critical systems (ping, http, ssh, etc.)
  • There should be monitoring of quality of real-time data which could produce erroneous data such as at UCD
  • Receive training on how to identify and resolve issues once an alert is sent out.
  • Currently there is documentation on how to resolve every known issue at UCD.
  • Solutions to better handle existing issues is always ongoing.