Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Command Workflow: Incident Response/Communication #698

Closed
imbriaco opened this issue May 17, 2016 · 0 comments
Closed

Command Workflow: Incident Response/Communication #698

imbriaco opened this issue May 17, 2016 · 0 comments

Comments

@imbriaco
Copy link
Member

imbriaco commented May 17, 2016

This issue contains a proposed workflow and set of commands that would be built to support the workflow. This should be decomposed into a number of additional issues, likely one per command, which cover the specific details related to implementing those commands.

The example command lines included are just that, examples. We should feel free to iterate on the UX around the command interface and make improvements as we use them in practice while they're being built.

Scenario Assumptions:

  • Pingdom monitor: http://exmaple.com
    • Alert: PagerDuty - Service = Web
    • Alert: Webhook - Cog Trigger
  • PagerDuty
    • Service: Web - Jane on-call
    • Service: Support - Fred on-call
  • StatusPage.io
    • Component: website
    • Component: email
  • Twitter: @ExampleComSupport

Scenario:

Example.com site goes down. The on-call engineer is paged and fixes the problem while managing status updates and incident state in PagerDuty.

  • Pingdom detects that example.com is DOWN.
    • PagerDuty incident is created, and Jane
    • Webhook sent to a trigger in Cog that updates a statuspage.io component and writes a message in #ops and #support.
    • filter -p check_params.hostname -m example.com | filter -p current_state -m DOWN | statuspage:component website yellow *> #ops #supportT
  • Jane enters #ops and acknowledges that she has received the page and is investigating.
    • pagerduty:ack 123
  • Jane investigates the problem and confirms that the site is down.
    • statuspage:incident new -s investigating -c website "Site Outage" "We are currently investigating problems with example.com. We expect to resolve the issue shortly and will post updates as additional information is available. Thanks for your patience." *> here #support
      • Incident Id: 911
  • Jane fixes the problem outside of chat.
    • statuspage:incident update 911 The example.com site is back online. We will monitor the site closely to ensure that the problems are fully resolved. *> here #support
  • ... time passes ...
  • Jane marks the PagerDuty event resolved and closes out the status incidents.
    • statuspage:component status website green
    • statuspage:incident update -s resolved 911 All systems go. *> here #support
    • pagerduty:resolve 123

Other Use Cases:

  • Support team notices a problem, needs help:
    • pagerduty:alert website Customers complaining the site is down. Need help in #support
  • Support team tweets from @ExampleComSupport to let customers know of an issue:
    • twitter:tweet Our engineers are currently investigating problems connecting to example.com. Watch status.example.com for more information
  • Developer wants to know who is oncall for web so they can ask for help for a non-emergency issue without paging:
    • pagerduty:oncall web
  • Developer pushed changes to the website and wants to make sure that it didn't generate any monitoring issues:
    • pingdom:check list | filter -p hostname -m example.com | pingdom:check results $id

Bundles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants