Feature: PCP Integration

Intro

PCP, short for "Performance Co-Pilot", is a system performance and analysis framework. It collects performance-related statistics from multiple hosts and operating systems, including popular Linux distributions, UNIX variants, Windows, and macOS.

Through its plugin architecture, PCP records data not just about host information (disk, network, memory), but also collects stats from Apache, MySQL, Java VM, KVM, etc.

PCP is used for both live and historical data.

Use cases

Use PCP for viewing data in Cockpit

Without PCP, Cockpit only displays while it is running. When PCP is installed, statistics may be pulled from before Cockpit has been signed into.

This is already implemented to some extent.

However, there are a few issues currently:

It is not obvious when PCP is installed and active versus standard charts in Cockpit
It is not obvious that installing PCP will improve the charts in Cockpit

Time

Graphs in Cockpit extend to the past 5 minutes (this works with and without PCP installed).

PCP would let us look for specific historical events as well. We might want to consider checking the past 24 hours for problem activities.

Examples:

Low available memory
Active swap
High load average
High disk IO
Network-related issues (latency, outages, etc.)

Timeframe for these warnings should probably be between the past 24 hours to past week.

PCP dashboard

View simple PCP-based stats from current machine. We're not going to go into the ultra-configurable route like Grafana. (If people want that, Grafana exists and can be used in parallel.)

This would probably replace the separate CPU/Memory/Storage views.

Install PCP for usage with another tool

PCP is useful not just for Cockpit, but for other tools.

Should PCP be installable via the "Applications" section as well as through the upcoming PackageKit lib?

Combined statistics from multiple machines that use PCP

Simply combing all the data from multiple machines gets noisy. It should be possible to show exceptional events from various servers here as well, similar to the host-specific view.

New Concepts

In addition to modifying our charts, we might want to consider:

Review past 24 hours (week too?) in a sped-up playback
Show exceptional data (spikes and the times of spikes)
Instead of customization, have different modes of charts in tabs? (Example: Flip between representations of CPUs.)

What can you do with PCP?

Performance Co-Pilot can do a lot. It's highly modular and configurable.

We'll want to carefully pick some of these (seemingly common) tasks without going too overboard.

Live metrics

Display enabled performance metrics
- with short descriptions
- give detailed information about each performance metric and current values
Monitor metrics on a host
- disk write operations per partition
- CPU load
- memory usage
- disk write operations
Show process creation rate and unavailable versus available memory
Monitor metrics from multiple hosts
Compare metrics to help understand what happens on a system at a given time
- example: swap happens, IO spikes, CPU load increases, network traffic goes down
Display running processes in a given time window

Historical metrics

Everything that live metrics has, plus:

View host over a given time period (with the correct timezone)
- Translate timezone to another timezone (ex: viewing a server in Europe in US/Eastern time)
Adjust "zoom" level
- graphs could indicate 10 minutes, 1 hour, 12 hours, 1 day, 1 week, 1 month
Replay history in a given time range
- sped up (real-time would probablyhttps://flathub.org/repo/appstream/org.blender.Blender.flatpakref be too slow in most cases)
- scrubbing (moving back and forth in a timeline like an audio editor)
Show average/mean, min, max values of performance (CPU load, memory usage, disk IO, etc.) over a given timeframe
Summarize differences between different timeframes
- example: easily determine "Monday mornings have heaver load than Sunday evenings"

Filtered views (applies to live and historical)

Process-specific view filters
- process ID
- process name
Network interface(s) filters
Container-specific views
- cgroup accounting
  - disk IO: IOPs/bytes, service / wait time, aggregate / per device
  - CPU accounting: per-cgroup processer usage, aggregate CPU usage
  - memory: mapped anon pages, page cache, writeback, swap, active/inactive
- namespaces: show content & processes that differ inside versus outside of a container

Recommendations

Avoid using the phrase PCP in the interface; "PCP" is problematic and the common name for phencyclidine (aka: "angel dust")
Rework host summary page to de-emphasize charts and highlight essential machine information first
- Use the rest of available space for overview charts
Make it obvious when PCP isn't available (by default) versus when it is available
- Also make it easy to install PCP when it isn't available
Main graphs should have a quick overview of recent data
Secondary page should have more in-depth information