Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incident commander role + more steps to support process #422

Merged
merged 17 commits into from
Jun 27, 2022

Conversation

choldgraf
Copy link
Member

@choldgraf choldgraf commented May 16, 2022

Context

Our current support process is under-specified given the different kinds of support issues that we may get. This PR adds a few major concepts to our Support process documentation:

  • Change Requests, Incidents, and Guidance Requests are a way of categorizing and prioritizing support tickets
  • Incident Response process is a process we'll follow to resolve incidents
  • Non-incident response process is a process for triaging and prioritizing Change Requests

It also adds some extra contextual information, terminology, etc to help us get on the same page.

What are we missing

  • There is not a clear picture of how support-related work items get routed to specific experts on our team
  • ...anything else we're missing?

ref: 2i2c-org/infrastructure#1068 (comment)

also related to: 2i2c-org/docs#143 and 2i2c-org/infrastructure#1118

closes 2i2c-org/infrastructure#1154 closes 2i2c-org/infrastructure#1155

- They assess whether they can resolve it quickly, and potentially do so.
- If they cannot resolve it, then we raise this support issue with our engineering team.
- If the issue is an {term}`Incident`
- We will prioritize resolving it over everything else.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this sentence!

:::
:::{note}
We currently keep this term intentionally vague, and ask that communities are respectful of our time when making change requests.
We are investigating the support budget that we should give to each community, and will update here when we have specific numbers in mind.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3


1. **Acknowledge the incident**. Communicate with the Community Representative that there is an incident. Here is a template to get started:

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make this a freshdesk template too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is possible then I agree! I haven't looked into this but if it exists, then we should make a template and that should be the "source of truth"

@choldgraf
Copy link
Member Author

choldgraf commented May 17, 2022

Update: added incident commander

After some feedback I've made some more edits with the following main changes:

  • Added the Incident Commander role
  • Separated out separate pages for Support and Incident Response
  • Removed our dedicated "Roles for the service" page, and instead defined a "Roles / team structure" section in more specific sections
  • General content improvement and clean-up

@yuvipanda want to take a look and we can discuss?

EDIT: argh, the commit didn't get pushed because i'm tethering on my phone, one sec I will try to find wifi

EDIT EDIT: muahahah I have finally gotten the wifi code for the cafe beneath my apartment,

Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like the direction this is going towards and I am curious to see how this will play out in practice, with this process formalized. I left a few suggestions and questions about things that I didn't understand. Thank for working on this @choldgraf

projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved
projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved
projects/managed-hubs/incidents.md Show resolved Hide resolved
projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved
projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved
projects/managed-hubs/incidents.md Outdated Show resolved Hide resolved
projects/managed-hubs/support.md Outdated Show resolved Hide resolved
Co-authored-by: Georgiana Elena <georgiana.dolocan@gmail.com>
@choldgraf
Copy link
Member Author

Thanks for those comments @GeorgianaElena ! I believe I've addressed them all and also added in a section about handing off IC status to others. Let me know if you have other thoughts or suggestions!

projects/managed-hubs/support.md Outdated Show resolved Hide resolved
Co-authored-by: Sarah Gibson <44771837+sgibson91@users.noreply.github.com>
Copy link
Member

@GeorgianaElena GeorgianaElena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for writing this up, @choldgraf! I absolutely love it and I think it's an improvement over our current process.

I've left some inline comments about reducing a burden on the support steward when the IC is a person different from the support steward. One of the important outcomes here is to make sure the support stewards don't burn out, and I think the default state when there is a separate IC should be that the support steward is no longer responsible for that (unless they are asked to). For example, if there are multiple ongoing incidents this puts a particularly bigger burden on the support steward. The communication overhead may also be significant, as there's now two extra places where communication needs to happen by default. I recognize that this might vary with individual IC style, but I think the default should be that we don't require the support steward to do this. Instead, the IC can call in someone to help with communications - this person can be the support steward, or someone else. I'd rather have us codify that than default to adding another duty to the support steward role.


### External communication

- The {term}`Support Steward` team acts as the primary point of communication with external stakeholders like the {term}`Community Representative`s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, specifically for incidents, this should be optional. This adds an additional load for the support steward and the incident commander (if they are different people). The IC can request the support steward to act as this if necessary, but otherwise I think we should not require this of the support steward. They should be able to delegate an incident to the incident commander, and then by default continue with their existing role. I think if the IC is the source of truth, they should by default be the person who communicates too - otherwise we're adding an entirely new person to this chain, and that is often extremely frustrating during an incident process for everyone involved.

So my suggestion is that the IC can ask someone else to be the point of contact (support steward or someone else) if they so choose to, but that is not the default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reasoning makes sense to me - one concern I have is that the incident commander needs to spend their cycles on resolving the incident, not necessarily also communicating it externally. Maybe the answer is to say the incident commander does this by default, but if they must log-off or are otherwise overwhelmed, they should delegate another team member to provide external communication?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly right! The default is they do it but if in their judgement delegating it is the right thing - if the extra communication overhead is worth it (as it often is) they can. They just delegate it to someone, not necessarily the support steward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the IC presents itself (or it is actually introduced by the support steward) to the ticket submitter, I think it makes sense NOW to make this optional. I also like the idea to delegate the communication to another one who is not necessarily the support steward (because we do not have a lot of people in support to address other tickets).

Why do I highlight the NOW word? Because when we get a dedicated support team, there should be a clear separation of boxes, IMHO. The support team should be handling the communication with the ticket submitters because they are trained and specialized to interact with people in stress looking for answers. An IC coming from the eng team is not well prepared for that interaction... and that might be a source of issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah - thinking to the Incident Commander docs from pagerduty I think this is the "external liason" role. They also have a dedicated person/team to do that work. In our case, I think there are two things I worry about there:

  • We don't have the staffing capacity for this now, but maybe this will change in the future as @damianavila suggests.
  • If people are not awake at the same time, we pay a big communication penalty when we have bottlenecks of information. If one person must be the one to communicate externally, and that person just went to sleep, then it means no communication can occur until they return to work. This feels like a stressful situation given that we don't have the staffing to ensure seamless handoffs between time-zones all the time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is why I think this is fine for now but we should change it in the future when we have enough capacity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've updated this so that the Incident Commander is the communicator by default, and may delegate if they wish.

For each {term}`Incident` we create a dedicated issue to track its progress. [{bdg-primary}`open an incident issue`](https://github.com/2i2c-org/infrastructure/issues/new?assignees=&labels=type%3A+Hub+Incident%2Csupport&template=3_incident-report.md&title=%5BIncident%5D+%7B%7B+TITLE+%7D%7D) and notify our engineering team via Slack.
3. **Try resolving the issue** and take notes while you gather information about it.
4. **If after 30 minutes the issue is not solved or you know you cannot resolve it**, ping our engineering team and our Project Manager in the {guilabel}`#support-freshdesk` channel so that they are aware of the incident.
5. **Designate an {term}`Incident Commander`**. If the Support Steward wishes to designate someone other than themselves as Incident Commander, do this in the Incident issue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this line implies the support steward has the "power" to designate the IC. How that power will be practiced?
I think there should be some known expectations/details around this. For instance, can the designation be rejected, and what do we do if that happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point - also important to think about the power dynamics here. If the support steward is a brand new team member with less experience than others, they may not feel comfortable just "delegating" to somebody else.

Is this a role that the Project Manager could play? I recall from 2i2c-org/infrastructure#1068 that one of the case studies there used a workflow like:

  • Support person tries to resolve themselves first
  • If they can't, they bring open an issue about this and discuss with the team manager (in our case, I think this would be the project manager)
  • Team manager then routes that work item to somebody else on the team.
  • Or if it is more complex, they discuss in their next team standup (I believe it is daily for them) and somebody is assigned to that work item out of that meeting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a role that the Project Manager could play?

Yep, probably.
I would suggest still keeping the power to designate in the support steward's hands for the sake of simplicity and quickness... but putting in the PM's hands the tie-breaker "power" is some conflict arises.
I would also encourage the support steward to have a conversation and agreement with the future IC before the designation actually happens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note about asking the Project Manager

5. **Designate an {term}`Incident Commander`**. If the Support Steward wishes to designate someone other than themselves as Incident Commander, do this in the Incident issue.
6. **Investigate and resolve the incident**. The Incident Commander should follow the structure of the incident issue opened in the step above.
7. **Delegate to Subject Matter Experts as-needed**. The Incident Commander is empowered to delegate actions to Subject Matter Experts in order to investigate and resolve the incident quickly.
8. **Communicate every few hours**. The {term}`Incident Commander` is expected to communicate incident status and plan with the {term}`Support Steward`s, and the Support Stewards are expected to communicate to the {term}`Community Representative`s. They should provide periodic updates to communities as we attempt to resolve the incident. Here is a template to get started:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about removing support steward from the line of fire here by default. I think this is especially important to us given our diverse timezones that we reduce communciation overhead by default.

@choldgraf
Copy link
Member Author

OK I believe that I've addressed each of the comments above! Please let me know if that makes sense or if you'd like to see any other edits!


To designate another team member as the Incident Commander, follow these steps:

1. **Confirm with them** that they are able and willing to serve as the Incident Commander
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Confirm with them** that they are able and willing to serve as the Incident Commander
1. **Confirm with them** that they are able and willing to serve as the Incident Commander.

When a support request is made that requires an action from a 2i2c engineer, a Support Steward should describe this change in a GitHub issue, and add it to the [Sprint Board](coordination:sprint-board).
Think about an engineering team member that likely has the skills and capacity needed, and ask them if they are willing to take on resolving this issue.
Try not to ask the same person for support help many times in a row - we should spread the work needed to address support issues across the team.
Here is a rough idea of our rationale to follow for arriving at a specific number:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention arriving at "a specific number" but I think you did not mention that specific number above, is that intended?

From your previous comment, that number would be:

we could say that those various hub types correspond to 34/20 ~= 1 hours, or 34 / 8 ~= 4 hours of support each month.

I know you can derive that from the rationale you have included but I feel it needs a conclusion like the above one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to make it clear that this number is not yet specific, and that we want to keep it as a rationale only, but no precise numbers.

@choldgraf
Copy link
Member Author

choldgraf commented Jun 16, 2022

Hey all - I've addressed @damianavila's comment above, and I've also replaced the response template text with links to our FreshDesk templates, which I have just created (that addressed the other comment from @yuvipanda):

https://2i2c.freshdesk.com/helpdesk/canned_responses/folders/80000143608/responses/80000247490/edit

I think that this should now be relatively ready to go unless there are further comments. There might be some link failures that are dependent on another PR to get in, but I'd prefer if we solve them in follow-ups rather than block this PR if that's OK


```{button-link} https://2i2c.freshdesk.com/helpdesk/canned_responses/folders/80000143608/responses/80000247490/edit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is "not found" for me, it redirects to https://2i2c.freshdesk.com/a/notfound (and I am logged in).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I think I fixed this

@choldgraf choldgraf changed the title Update our support process Add incident commander role to support process and define more explicit steps Jun 20, 2022
@choldgraf
Copy link
Member Author

Question: should this be in docs.2i2c.org/about ?

I made a few more edits to this content, and I find myself naturally wanting to look for it in the "About the service" section of our 2i2c service documentation (somewhere in here: https://docs.2i2c.org/en/latest/about/overview.html). I feel like this section describes the human infrastructure and strategy behind the service, and is a hybrid of documentation for our team (to know what to follow) and other communities (to know what to expect).

Do others agree that this content would make more sense inside docs.2i2c.org ?

If so, I am curious what you think about the other content in the "Managed Hubs Service" section in our Team Compass. I kind-of feel like this could also live in docs.2i2c.org since much of it is 2i2c-facing as well. If that were the case, we could just have pointers in our team compass to those docs.

Curious if others have thoughts

@choldgraf
Copy link
Member Author

Updates to this PR

I spoke with @damianavila in particular (and others in passing) about the idea of moving these docs to docs.2i2c.org, and in general the consensus seemed to be that we should keep "2i2c-specific" docs in our team compass to avoid cluttering up the service docs at docs.2i2c.org.

So, the latest commit adds a few more updates and reorganizations to try and lean into this separation of duties a bit more. It does a few main things:

  • Removes some sections and definitions here and cross-links to docs.2i2c.org instead, to treat those docs as the source of truth for most things
  • Adds some more glossary entries that we can use here and elsewhere
  • Tries to reframe our relationship with communities as partnerships rather than customer relationships
  • Adds some extra references to our incident response and support models

Copy link
Member

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the incident commander stuff, and we can iterate as we move forward.

I'm very confused by 'Collaborative' replacing 'Managed' service though.

@@ -39,15 +39,15 @@ As a start, we wish to launch a **Managed JupyterHubs** service, and will begin
- 2i2c manages JupyterHubs for at least two institutions.
- 2i2c manages more lightweight, community-specific JupyterHubs for several smaller groups in research and education.
- 2i2c manages a "generic" JupyterHub that is not tied to any single institution or group.
- 2i2c has a beta-level business model for the first iteration of our Managed JupyterHub service.
- 2i2c has a beta-level sustainability model for the first iteration of our Collaborative JupyterHub Service.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO Managed is a more straightforward and broadly accepted term for what we do, while Collaborative is generally overloaded to the point of being a buzzword. I'd suggest we keep calling it 'Managed'

@yuvipanda
Copy link
Member

I'm happy to get this merged without the s/managed/collaborative/. We can perhaps discuss that in a different PR?

@choldgraf
Copy link
Member Author

I opened up the issue below to discuss terminology etc, so we can merge this one in:

@choldgraf choldgraf changed the title Add incident commander role to support process and define more explicit steps Add incident commander role + more steps to support process Jun 27, 2022
@choldgraf choldgraf merged commit 47a9799 into main Jun 27, 2022
@choldgraf choldgraf deleted the service-process branch June 27, 2022 14:10
@damianavila damianavila mentioned this pull request Jul 11, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
5 participants