Define some first-line and second-line support processes #1068

choldgraf · 2022-03-08T19:56:07Z

Background and proposal

There are often cases where our support process is under-documented. For example, a few questions that people weren't sure how to answer:

How should we prioritize support requests?
What should we do if a request is not immediately "closeable"? Or if it requires ongoing follow-up work?
How can we communicate our inability to fulfill a request?
What kind of communication should we use throughout the support process?

We should document some rough guidelines for these common questions, and also provide references to documentation about how to carry out first-line support.

Implementation guide and constraints

Another way to approach this is to ask "what are some common support situations, and what should we do in each situation?" We can draw from our experiences thus far to agree on some team practices to follow.

Updates and ongoing work

Add items below as we learn more

Do some research into support processes at other orgs (see refs below)
Define separate support and operations teams #1155
Define support ticket urgency levels and practices for how to deal with them #1154

Refs

Distributed communication and incident response write-up
Example support process from Sarah
Example support process from Chris
Wikimedia clinic duty
We have a semi-related issue to this here: Define escalation practices when there are hub outages #1118

choldgraf · 2022-03-08T19:56:31Z

cc @sgibson91 who I believe has some notes about support processes at other organizations that we could use for comparison.

sgibson91 · 2022-03-09T16:51:48Z

Notes on Managed Service Provider (MSP) first-line support protocols

I had a chat with my partner who has done first-line support in a couple of companies and this are my notes from that discussion.

Note: This will be from an industry perspective and will not be entirely applicable to us. We should pick and choose what we think would work for us.

Responding to tickets

We should setup Freshdesk to send an auto-response to all incoming tickets saying we have received the ticket and issuing a reference number (if it does not already do this?) (Sarah is not sure what emails get sent to clients when certain buttons are clicked and this fuels some of her anxiety.)
First response from the Support Steward should be within a given timeframe of the ticket coming in, corrected for active support hours (i.e., what timezone is the person providing support in vs. the support requester). An entirely unrealistic for us example might be: First response from Support Steward should be within half an hour of the ticket coming in.
Escalating to third parties (e.g., GCP or Azure...)
- Support Steward would provide a message to the client indicating they've escalated the ticket to a third party and are waiting on a response
- Ticket state is "On Hold"

Prioritisation of tickets

A priority queue with all integers set to maximum is just a queue...

P5: Service request - This is an "I would like..."-style ticket
P4: "I have a problem and it affects only me"
P3: More than 5 people or a whole department can't do the same thing.* (Or request comes from a VIP)
P2: Issue that affects multiple departments but not the whole company. Perhaps more than one service is down.
P1: No one can work. Everything is on fire.

Note: If there is a temporary work-around for a P3 problem, you can apply this fix and downgrade the ticket to a P4 while working on the real fix. Often this is when tickets get pulled out into projects with proper project management as this can affect SLAs agreed with the client. This project work is then billed back to the client.

SLAs regarding timeframes for carrying out support-related tasks

These are the timeframes we agree with a client for resolving problems. They should be defined in number of working hours within our active support hours bracket (I'm cheating in the example below).

These are another set of example timeframes that would be unrealistic for us to adopt

P5: Resolve within 5 working days unless it needs to be a project
P4: Resolve within 3 working days
P3: Resolve within 1 working day
P2: Maximum 4 working hours to resolution
P1: ~1-2 working hours to resolution

Communications during P2/P1 events

P2: Provide updates every 2 hours
P1: Provide updates every hour

Have a comms template for these events. From what was described to me, this is very similar to the top half of our Hub Incident issue template. Includes a timeline of the event, the symptom reported, etc. These comms should still go out regularly, even if the update is still "we're investigating".

What channels do these comms go out on? Mailing lists, forum posts, twitter?

After the event is resolved, compile and release an autopsy report. This is very similar to the second half of our Hub Incident template. Covers what went wrong, how did it go wrong, what are the steps to prevent it happening again or minimise disruption if prevention isn't possible.

Support Tiers

Pay as you Go
Tier 1
- "break & fix" (something is broken, we fix it)
- Charge £X per Y hours per month
Tier 2
- break & fix
- provide some recommendations for improvement
- Charge £X+dX per Y+dY hours per month
Tier 3
- Unlimited support hours per month
- Extra services such as 24/7 monitoring, automatic upgrades

This is potentially something we could fit into our alpha service pricing matrix

Guidelines for what is in scope of our support model

We need to define these (maybe on a per client basis?) and be strict about it.

The "maximum capitalist company" thing to do might be:

If a problem is found to be a client's fault
- We offer guidance to help them fix it
- If they really want us to do the work, fine but we have a flat hourly rate of £X

I'm not 100% convinced that's the right thing for 2i2c to do, just presenting it here as one aspect I learned about.

One guideline I do think we should implement is that clients should give us at least two weeks notice for when a requested change (low priority) needs to be ready by, e.g., if we are managing their image and they request a package. We should avoid the situation of a request for a package comes in and 2 days later, the client says they really need it. In the meantime, we've been battling timeout builds/pushes or whatever. If a low-priority request comes in with less than two weeks notice, then we make no promises to have it ready by the time the client needs it.

Disadvantage

(And a pretty big one IMO) This style of MSP support requires time-tracking in order to bill the client for, e.g., overtime on support hours, projects generated from support tickets, etc.

choldgraf · 2022-03-10T00:22:25Z

Some info from a friend at a tech company

I also had a conversation with a friend at a tech company about this that runs internal and external software services (not mentioning company or friend's name for privacy purposes). Here's a short breakdown of their process:

Roles they use

An engineering team broadly understands the infrastructure behind the service. They do a combination of development and operations.
An operations role is a member of this team that will attempt to resolve all operational issues first. This rotates weekly through the team.
A support person is a dedicated role that is always held by the same (non-engineer) person. This person communicates with the customers and forward information to the Operations Role on the engineering team for resolution.
A project manager largely oversees timelines and planning around development efforts, but may decide to change deadlines if enough support issues pop up that the team doesn't have time to meet them.
A team manager is more like a "line manager", they help with the team processes but focus more of their efforts on making sure team members are supported, on a good career path, etc.

When a support request comes in

Here's what happens:

The support person interacts with the person that made the request - they respond and try to identify what's going on.
If an action needs to be taken, they contact the operations member for that week, who tries to resolve the issue.
If they cannot resolve the issue, they speak to the Project Manager, who identifies another member of the team to help with the issue.
If they cannot find another person to assign to the issue, it goes into a list of open operations issues with no owner.
Each week, they have a triage meeting led by the support person. The outcome of that meeting is that every operations / support issue must have an owner.
In this meeting, the Project Manager may help estimate capacity of team members and suggests the person to take on an issue, if none of the engineers want to pick up the issue. Apparently there are often "awkward silences" where they wait for somebody to volunteer to do something 😆

How this interfaces with development

They do development in parallel with these operations tasks, and each team member has ongoing projects with deadlines associated with them.

Occasionally, there are enough operational tasks that they realize they won't hit their deadlines. When this happens, the Project and Team Managers discuss with one another and agree on a plan forward, potentially to move back the deadlines for their projects.

damianavila · 2022-03-17T13:55:40Z

Thank you both for sharing these pieces of information!
Everything described here pretty much aligns with my own previous experiences!
I think several of the pieces detailed above could be adopted with adjustments accordingly to our current state and our mission/vision.

choldgraf · 2022-05-11T16:32:26Z

Update

@yuvipanda had some great ideas in #1154 for steps in this direction, along with process notes shared from Sarah. I think we should make a quick push to document some of that, since the content is mostly there. We could also use this as an opportunity to update our SLA docs a little bit to make them more clear.

choldgraf · 2022-06-30T08:03:12Z

I'm going to close this one and say it was completed by merging the following PR:

Add incident commander role + more steps to support process team-compass#422

We can continue to iterate on these team processes over time!

choldgraf added the 🏷️ team-process label Mar 8, 2022

sgibson91 changed the title ~~Define some first-tier and second-tier support processes~~ Define some first-line and second-line support processes Mar 9, 2022

choldgraf mentioned this issue Mar 15, 2022

Define escalation practices when there are hub outages #1118

Open

choldgraf mentioned this issue Mar 29, 2022

Define support ticket urgency levels and practices for how to deal with them #1154

Closed

This was referenced May 11, 2022

Team process for support using FreshDesk 2i2c-org/team-compass#167

Closed

Grow our support capacity 2i2c-org/team-compass#420

Open

choldgraf self-assigned this May 11, 2022

choldgraf mentioned this issue May 16, 2022

Add incident commander role + more steps to support process 2i2c-org/team-compass#422

Merged

2 tasks

choldgraf closed this as completed Jun 30, 2022

damianavila mentioned this issue Jul 11, 2022

[blog] Quarter 2 update 2i2c-org/team-compass#452

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define some first-line and second-line support processes #1068

Define some first-line and second-line support processes #1068

choldgraf commented Mar 8, 2022 •

edited

Loading

choldgraf commented Mar 8, 2022

sgibson91 commented Mar 9, 2022 •

edited

Loading

choldgraf commented Mar 10, 2022

damianavila commented Mar 17, 2022

choldgraf commented May 11, 2022

choldgraf commented Jun 30, 2022

Define some first-line and second-line support processes #1068

Define some first-line and second-line support processes #1068

Comments

choldgraf commented Mar 8, 2022 • edited Loading

Background and proposal

Implementation guide and constraints

Updates and ongoing work

Refs

choldgraf commented Mar 8, 2022

sgibson91 commented Mar 9, 2022 • edited Loading

Notes on Managed Service Provider (MSP) first-line support protocols

Responding to tickets

Prioritisation of tickets

SLAs regarding timeframes for carrying out support-related tasks

Communications during P2/P1 events

Support Tiers

Guidelines for what is in scope of our support model

Disadvantage

choldgraf commented Mar 10, 2022

Some info from a friend at a tech company

damianavila commented Mar 17, 2022

choldgraf commented May 11, 2022

Update

choldgraf commented Jun 30, 2022

choldgraf commented Mar 8, 2022 •

edited

Loading

sgibson91 commented Mar 9, 2022 •

edited

Loading