US Digital Services Playbook Play 12

#Play 12: Use Data to Drive Decisions At every stage of a project, we should measure how well our service is working for our users. This includes measuring how well a system performs and how people are interacting with it in real-time. Our teams and agency leadership should carefully watch these metrics to find issues and identify which bug fixes and improvements should be prioritized. Along with monitoring tools, a feedback mechanism should be in place for people to report issues directly.

##Checklist

1. Monitor system-level resource utilization in real time

We are using Nagios and Azure built-in tools to monitor system-level resource utilization in real time.

2. Monitor system performance in real-time (e.g. response time, latency, throughput, and error rates)

We are using Nagios to monitor system performance in real time.

3. Ensure monitoring can measure median, 95th percentile, and 98th percentile performance

Nagios is capable of monitoring performance in this way, but it was not necessary for this prototype.

4. Create automated alerts based on this monitoring

We are using Nagios to perform automated alerts based on monitoring results. Nagios can send alerts when critical infrastructure components fail and recover, providing administrators with notice of important events via email, SMS, or custom script.

5. Track concurrent users in real-time, and monitor user behaviors in the aggregate to determine how well the service meets user needs

We have a backlog item to track integration with Google Analytics for this capability (Issue #73)

6. Publish metrics internally

Nagios has comprehensive monitoring dashboard for those with access, and reports and alerts for sending metrics to internal users as necessary.

7. Publish metrics externally

Nagios has comprehensive monitoring dashboard for those with access, and reports and alerts for sending metrics to external users as necessary.

8. Use an experimentation tool that supports multivariate testing in production

Not applicable for this prototype.

##Key Questions

What are the key metrics for the service?

Key metrics include Memory, CPU, Latency, etc.

How have these metrics performed over the life of the service?

Not applicable for this prototype.

Which system monitoring tools are in place?

We are using Nagios and Azure built-in tools.

What is the targeted average response time for your service? What percent of requests take more than 1 second, 2 seconds, 4 seconds, and 8 seconds?

The majority of requests take less than 2 seconds.

What is the average response time and percentile breakdown (percent of requests taking more than 1s, 2s, 4s, and 8s) for the top 10 transactions?

See Nagios.

What is the volume of each of your service’s top 10 transactions? What is the percentage of transactions started vs. completed?

Not applicable for this prototype.

What is your service’s monthly uptime target?

99%

What is your service’s monthly uptime percentage, including scheduled maintenance? Excluding scheduled maintenance?

Not applicable for prototype.

How does your team receive automated alerts when incidents occur?

We set up automatic email notifications in Nagios.

How does your team respond to incidents? What is your post-mortem process?

Not applicable for this prototype.

Which tools are in place to measure user behavior?

Not applicable for this prototype.

What tools or technologies are used for A/B testing?

Not applicable for this prototype.

How do you measure customer satisfaction?

Not applicable for this prototype.

#US Digital Services Playbook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

US Digital Services Playbook Play 12

Clone this wiki locally