Skip to content

Feature: PCP Integration

Garrett LeSage edited this page Apr 11, 2018 · 20 revisions

This is a follow-up from the PCP research.

Stories

Troubleshooting a problematic process

Charlie Brown is a PHP developer at a Blockhead Industries, an all-in-one web company that helps businesses set up & maintain sites. Some customers have custom PHP code running and others use WordPress. In addition, all customers have access to several tools, such as phpMySQL and phpfm.

Most of the time, things are fine.

Twice a week, Charlie is "on call" and needs to fix customer issues. Today, a customer who runs a restaurant is trying to upload this week's menu as a PDF... but the website isn't working. Charlie needs to figure out what's wrong.

Charlie signs into Cockpit on the public-facing upload server. He begins to debug:

  • Did someone change something recently? Charlie checks the audit logs from session recording and discovers nothing was changed.
  • Is the service running? Charlie looks at services, filters for "Apache" and sees HTTPD is running.
  • Is it the firewall? Charlie looks at networking, sees the firewall is running, clicks to view the firewall rules, and checks HTTP — ports 80 and 443 (for SSL) are allowed and open.
  • Charlie happens to look at the system stats and sees that CPU load is quite high. "What's causing that?" he wonders.

After glancing at the CPU graph for the past 6 hours, he sees a spike 4 hours ago from a process and finds out phpfm hung due to an NFS mount going down.

Charlie remounts the NFS share, restarts Apache, and contacts the customer.

Incidental investigation

Lucy van Pelt is a system admin who has been using Cockpit for the past several months at work. She signed in to Cockpit to create a new account for a co-worker.

After signing in, she notices the Samba service has been using a bunch of CPU for the past 6 hours.

Using Cockpit, she's able to restart the offending Samba service and continue creating her co-worker's account.

Design

Goals

  1. Report overall health based on metrics & heuristics
  2. Visualize resources
  3. Show resource hogs
  4. Provide high-level review of recent historical metrics
  5. Install & enable metrics collection (with PCP)
  6. Simple "notifications" of common issues based on predefined rulesets (in-page, while visiting the machine in Cockpit)

Possible future goals

  1. Notifications sent via email or SMS (using existing tooling; nothing custom in Cockpit)
  2. Notifications via browser push mechanism

Non-goals

  1. Customizable dashboard
  2. Detailed metrics view

Mockups

[Work in Progress]

Challenges

  • Having a bunch of graphs may be useful for gauging overall system performance, but it's difficult to correlate various aspects of the system and tie that to system processes.
  • It's also usually difficult to get a good idea about the relationship of CPU, RAM, and IO usage.
  • Scrolling horizontally is problematic.
  • Navigating charts via shrinking the timeline or pagination are both problematic.
  • Most charts do not have a way to show top offenders, which is what many administrators want to see and quickly understand.
  • Many admins use customization for special views to figure out things as a work-around to how awful system-level graphs are.

Mockups

[Coming soon]

Prior art

Android

Memory android-memory-apps android-memory-apps-dropdown android-memory-apps-single
Process android-process-details android-process-overview
Network android-network-usage

GNOME

GNOME System Monitor

Processes gnome-system-monitor

Usage

Tasks gnome-usage-problematic-task gnome-usage-tasks
RAM gnome-usage-memory

Windows

Resource Monitor windows-resource-monitor windows-resource-monitor-ram
System Information
(Net & IO)
windows-system-information-network windows-system-information
Task Manager
Services & Details
windows-task-manager-services windows-task-manager-details
Task Manager
Apps
windows-task-manager-app-history
Task Manager
Perf
windows-task-manager-performance

OS X / macOS

[Coming soon]

Clone this wiki locally