Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incidents may get stuck open #935

Open
2 tasks
elfjes opened this issue Nov 7, 2024 · 2 comments · May be fixed by #946
Open
2 tasks

Incidents may get stuck open #935

elfjes opened this issue Nov 7, 2024 · 2 comments · May be fixed by #946
Assignees

Comments

@elfjes
Copy link
Collaborator

elfjes commented Nov 7, 2024

We have a couple of incidents that the source system cannot close (ie END event). When sending an END event, the event is happily accepted, and nothing happens. Upon further investigation, I tracked it to the following code in EventViewSet.perform_create

        try:
            self.validate_event_type_for_incident(event_type, incident)
        except serializers.ValidationError as e:
            # Allow any event from source systems (as long as the user is allowed to post the event type),
            # even if the posted type is invalid for the incident - like if an `INCIDENT_END` event
            # is sent after the incident has been manually closed
            if not user.is_source_system:
                raise e

In case of an error (ie The incident already has a related event of type 'END'.) the end event is created, but not applied. The associated exception is swallowed. It would be really nice if instead of just dropping the exception, at least a message would be logged.

The incidents that can't be ENDed currently have an end_time == infinity and should be able to be close. Apparently something went wrong during the first END event, although I haven't figured out what yet. Unfortunately our logs don't go back long enough to see what happened the first time the incident was supposed to be ENDed so I can't pinpoint what was the issue there

I think that sending an END event now should still close the incident. Maybe it's better to look at the current end time, and end the incident if end_time==infinity, regardless of whether an END event was previously created. (Perhaps the same holds true for START events, but I'm not sure). Now we've ended up (pun intended) in a situation where an incident cannot be closed by the source system.

So, my suggestions

  • log a message when an event is accepted but ignored
  • update the logic on when an end event is excecuted based on its current end_time
@hmpf
Copy link
Contributor

hmpf commented Nov 13, 2024

Absolutely there should be logs. I can see several reasons why this happens but we need to log to find others.

A cause I can see: Both END and CLOSED sets end_time. An event can be CLOSED before a source system sets END so then I would say: let the END be created but keep the end_time as is, if it isn't already like that.

Another cause I can see: If end_time is still infinity even if there is an END- or CLOSE-event then we have been bitten by not having been atomic enough (those will be easier once we are rid of the current way of sending notifications. Not this year).

A third cause: race-conditions.

Fix: on a new END-event where there is already an END/CLOSED and end_time is still "infinity":

If there are no REOPEN events, set end_time to the timestamp of the oldest END/CLOSED. If there are REOPEN events, ignore END/CLOSE happening before the latest REOPEN and look at the oldest remaining END/CLOSED. There's hopefully a better way to set a write-lock on the incident-row instead of the "atomic" decorator/context manager.

We could maybe have a data migration to repair existing such in the database, and the actual code doing it should be a utils function so that it can also be run from elsewhere, like a management command or admin.

@hmpf
Copy link
Contributor

hmpf commented Nov 13, 2024

If however there is already a CLOSED-event, end_time not infinity and another CLOSED-event arrives, that sounds like a bug or a hack-attempt (probably trivial via the admin, it has not been locked down I suspect). Log, including user pk, and toss/raise error.

@hmpf hmpf linked a pull request Nov 15, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this to 📋 Backlog in HTMXify Argus Jan 6, 2025
@lunkwill42 lunkwill42 self-assigned this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏗 In progress
Development

Successfully merging a pull request may close this issue.

3 participants