Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alarm events should be able to take on multiple values (+add other missing Alarm metadata) #57

Open
edwardchapin opened this issue Mar 13, 2018 · 7 comments

Comments

@edwardchapin
Copy link

My understanding from the CSW Final Design documentation is that an alarm should be able to take on multiple values (e.g., good, ill, bad, critical). TMT then monitors all of the alarms associated with a software component, and uses the worst alarm level to derive health (e.g., good, ill, bad).

At the moment, it looks like one can only define an alarm event with a particular severity, e.g.,

    {
      name = temperature
      description = "EE air temperature is outside acceptable limits"
      severity = major
      archive = true
    }

Presumably the "temperature" alarm state would be considered good until this event is published.

It seems like it should be possible to define an alarm: (i) without specifying a severity; or (ii) specify an array of possible values that it might take on, e.g.,

    {
      name = temperature
      description = "EE air temperature is outside acceptable limits"
      possibleSeverity = "good,warning,major,critical"
      archive = true
    }
@abrighton
Copy link
Contributor

Yes, I think you are right. You should be able to specify a list of possible severities.
The other thing I noticed is that the alarm schema has [none, minor, major, invalid] as possible values, while the CSW prototype uses Disconnected, Indeterminate, Okay, Warning, Major, Critical (although only the last three would need to be declared).

We could also interpret the severity value to mean "the most critical severity".

Otherwise, I can change the alarm schema to require an array for severity (and maybe also change the name to something like possibleSeverity).
Existing ICDs would then need to be updated to use the array syntax: [ warning, major, critical ].
(We could also use a comma separated string, but an array would make more sense there.)

@edwardchapin
Copy link
Author

Is there a place where the intended meaning of each alarm level is defined?

We are currently updating a bunch of ICDs for the NFIRAOS final design review anyway, so we're fine with fixing up everything to handle a potential schema change.

I think I like the idea of: no possibleSeverity => alarm can take on any of the values specified in the schema, otherwise you specify an explicit array of the possible values.

@abrighton
Copy link
Contributor

In the prototype there are comments in the AlarmModel class (The alarm service has not been implemented in csw-prod yet):

    /**
     * The component associated with the alarms is currently not executing.
     * (i.e.: The severity value expired because it was not refreshed within the defined amount of time)
     */
    case object Disconnected extends SeverityLevelBase(-2, "Disconnected")

    /**
     * The conditions for the alarm can not be determined with any reliability by the component.
     */
    case object Indeterminate extends SeverityLevelBase(-1, "Indeterminate")

    /**
     * Normal operations
     */
    case object Okay extends SeverityLevelBase(0, "Okay")

    /**
     * An alarm condition that essentially has no required response during an observing night,
     * but should be handled by day staff during the following day.
     */
    case object Warning extends SeverityLevelBase(1, "Warning")

    /**
     * An alarm condition requires action within 30- 60 minutes. Alarm may not impact
     * observing but may degrade system performance until handled. May get worse over time.
     */
    case object Major extends SeverityLevelBase(2, "Major")

    /**
     * An alarm condition requires action soon. Observing cannot continue with critical alarms.
     */
    case object Critical extends SeverityLevelBase(3, "Critical")

@abrighton
Copy link
Contributor

The comments were based on the specification that Kim wrote:
OSW TN019 - ALARM SERVICE PROTOTYPE DESIGN
TMT.SFT.TEC.16.004.REL01.DRF02

@tmtsoftwareadmin
Copy link

tmtsoftwareadmin commented Mar 15, 2018 via email

@abrighton
Copy link
Contributor

OK, in that case, please continue to use the master branch for now. We will probably still add the additional fields, but we can discuss that.

@jasonweiss
Copy link

We have chosen to implement the Alarm Service as originally designed, that is, with each alarm having a set of supported/allowed severities. This doesn't preclude developers to create alarms such that they can only have one severity, if desired, to align with the design described in Kim's comment above from March 15, 2018.

With that decision, Ed's suggestion still stands. In addition, the rest of the Alarm metadata fields from the CSW implementation should be added to the models.

@jasonweiss jasonweiss changed the title Alarm events should be able to take on multiple values Alarm events should be able to take on multiple values (+add other missing Alarm metadata) Oct 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants