- monitoring applications, services and resources: use a tool that will get you 90% there
- record performance-related metrics appropriate for the service: e.g. db transactions, slow queries, i/o latency, http request throughput, service latency, etc
- Identify metrics that matter for your workload and record them.
- Identify the target, measurement approach, and priority to build alarms and notifications to proactively address performance-related issues.
- analyze metrics when events/incidents occur
- monitoring dashboards or reports to understand and diagnose the impact.
- write use cases for your architecture, include performance requirements and incident responses.
- establish KPIs to measure workload performance
- Identify the key performance indicators (KPIs) that indicate whether the workload is performing as intended.
- Document the performance experience of customers, and use these requirements to establish your KPIs
- use monitoring to generate alarm-based notifications
- using your KPIs, a monitoring system should automatically alert when measurements are outside of the baseline.
- review metrics at regular intervals
- review the metrics collected to identify which metrics were key in addressing issues
- Also ask which additional metrics would help to identify, address, or prevent issues.
- monitor and alarm proactively
- Use KPIs, combined with monitoring and alerting systems, to proactively address performance-related issues.
- automated alerts when thresholds are breached
- deploy monitoring agents to constantly monitor resource performance
- There is no substitute for measuring the performance of your full application
- i.e. rarely is it useful to measure in isolation
- key domains
- customer experience
- performance over time: system, costs, etc
- trends
- troubleshooting and remediation: identification, isolation and resolution, root cause analysis
- cost allocation
- learning and improvement: detecting and preventing problems
- When there are issues, you should be alerted immediately, either through:
- on-screen displays
- Text and emails automatically generated by the network monitoring solution
- every alert should contain
- when a problem occured and which threshold is being approached/breached
- information to identify the source, device, resource, etc
- the process of reviewing deployed resources and identifying opportunities to eliminate or downsize without compromising capacity or other requirements
- involves continually analyzing resource performance and usage needs and patterns, then turning off idle resources, removing unused capacity, and right-sizing resources that are over-provisioned or poorly matched to the workload.
- directly impacts performance and costs
- Noisy Neighbor effect: in multitenant systems with shared resources, the activity of one tenant can negatively impact another tenant's share of resources
- all about metrics, logging and tracing
- processes must be in place to capture logs and other useful artifacts
- captured logs and artifacts must be stored in a durable, searchable location
- alerts and automation
- The act of collecting, analyzing, and using data to make decisions or answer questions about your IT resources and systems
- monitoring tools: collects data generated by systems
- metrics: a datapoint consisting of a name and value
- dimensions: qualities that describe the context of a metric, consisting of a name and value
- statistics: metrics monitored over time
- logs: collect and aggregate files fomr resources and filterout actionable insights from background noise
- tracing: follow the path of a request as it passes through different services
- investigate how apps and their underlying services are performing
- important for troubleshooting the root cause of performance issues and errors
- monitor the availability, uptime, operation, and performance of complex networks
- reduce the mean time to repair and recover and solve real-time network performance issues.
- key tasks:
- Tracking and analyzing network components and the connections between them
- Surveilling different data layers, network endpoints, and links
- the health and performance of network interfaces for their faults helps to diagnose, optimize, and manage various network resources
- provide historical data and establish a baseline
- important for forensic analysis to identify the root cause after incidents.
- Data in the form of tables, charts, graphs, dashboards, and reports.
- for monitoring devices and network components
- identify performance metrics to be monitored
- deterime the monitoring interval:
- the frequency at which network devices are polled to identify performance and availability status
- choosing the right protocols for devices & network components
- SNMP: simple network management protocol
- HTTP: hyper text trasnfer protocol
- TCP: transmission control protocol
- IP: internet protocol
- ICMP: internet control message protocol
- WMI: windows management instrumentation
- set proactive thresholds
- alert, alert, alert!
- always need to be compared against historical baseline of network performance
- ping: some service providers have ping (ICMP echo packets) disabled by default, or cant be enabled at all
- traceroute: uses successive echo packets to display the path to the destination and the response time of each hop.
- Speedtest: useful in evaluating the performance of your internet access.
- packet analyzers: aka packet sniffers; logs each packet it intercepts, decodes the packet, and presents the values of the various fields within the packet for examination.
- benching tools: measure throughput and bandwidth;
- iperf/iperf3: tools for active measurements of the maximum achievable bandwidth on IP networks
- supports tuning of various parameters related to timing, buffers, and protocols (TCP, UDP, SCTP with IPv4 and IPv6)
- extrahop: monitoring solution for security, network performance, and the cloud
- netperf: a CLI tool similar to iPerf that measures throughput and benchmarking speeds.
- iperf/iperf3: tools for active measurements of the maximum achievable bandwidth on IP networks
- bandwidth capacity: the maximum data transmission rate possible on a network
- measures the theoretical limit of data transfer
- For optimal network operations, you want to get as close to your maximum bandwidth as possible without reaching critical levels
- indicates that your network is sending as much data as it can within a period of time, but isn’t being overloaded.
- throughput: measures your network’s actual data transmission rate
- measures the units such as megabyte or gigabyte per second of data packets that are successfully being sent.
- a high bandwidth connection but low throughput, that's an indicator of an underlying problem
- latency: delay between requesting data and when that data is finished being delivered.
- Consistent delays or odd spikes in delay time indicate a major performance issue
- packet loss: examines how many data packets are dropped during data transmissions on your network
- The more data packets that are lost, the longer it takes for a data request to be fulfilled
- A network’s TCP interprets when packets are dropped and takes steps to ensure that data packets can still be transmitted;
- retransmission: is when packets are lost, The network needs to retransmit them to complete a data request.
- retransmission rate lets your enterprise know how often packets are being dropped, which is an indication of congestion on your network.
- analyze retransmission delay (or the time it takes for a dropped packet to be retransmitted) to understand how long it takes your network to recover from packet loss.
- availability: i.e. uptime, the percentage of time the network is available.
- can never guarantee 100 percent availability, but you want to be aware of any downtime that happens on your network that you weren’t expecting
- connectivity: whether the connections between the nodes on your network are working properly
- jitter: a variation in delay or disruption that occurs while data packets travel across the network.
- congestion: occurs when network devices are unable to send the equivalent amount of traffic they receive.
- response times: measures the time it takes for a server to respond to a data request with application data
- when a system polls registered devices at established intervals to verify if they respond
- this is the simplest form of network monitoring
- monitor the users, applications, and other services on your network to see if any are draining the network
- the extent to which a system can be monitored
- you observe a system through metrics, logs and traces
- considerations
- storage costs
- data overload
- ensuring your system is outputting the correct data
- collects detailed information to help determine possible reasons for poor service performance
- historical data provides performance trends and helps with root cause analysis for any issues that may occur.
- When measuring performance:
- measure the time it takes for a service to complete an operation
- understand the units of measurement involved
- how to use these measurements to calculate performance.
- understand the actual performance and the optimal performance of your workload before you attempt to optimize it.
- establish a baseline measurement for each segment of a resource
- observe performance over at least a two-week period (ideally, over a one-month period) to capture the workload and business peaks.
- simulate real user traffic
- establish a baseline measurement for each segment of a resource
- actual: is what you see day to day
- optimal: the absolute best performance you can get based on the combined components that you are using
- latency: aka delay; amount of time between making a request to the storage system and receiving the response.
- How the storage is connected to the compute system
- Input/output operations per second: IOPS; a statistical storage measurement of the number of input/output (I/O) operations that can be performed per second
- used to measure the number of operations at a given type of workload and operation size can occur per second
- generally measured in KiB, and the underlying drive technology determines the maximum amount of data that a volume type counts as a single I/O.
- SSD vs HDD
- ssd: deliver consistent performance whether an I/O operation is random or sequential.
- handle small or random I/O operations more efficiently than HDD volumes.
- hdd: deliver optimal performance only when I/O operations are large and sequential.
- ssd: deliver consistent performance whether an I/O operation is random or sequential.
- SSD vs HDD
- throughput:how much data (in MiB/s) you can read/write per second when reading large sequential data files.
- Large files, such as video files, must be read from beginning to end.
- operations are measured in megabytes per second (MB/s).
- block size: certain workloads benefit from a smaller or larger block size
- file systems support non-default block sizes that you can specify when formatting the disk.
- its all about reducing latency: Anything that lengthens the time to get data to the user
- packet loss
- jitter: Variations in latency or time delay between packets
- bandwidth constraints
- inefficient protocol use
- physical distance
- When reducing latency, consider
- physical distance between two nodes
- quality of routes
- request origin location in relation to data
- average packet delay under network cost constraints
- memory resources
- traffic patterns and available node resources
- monitoring and troubleshooting of performance issues within applications
- permits applications to evaluate the network API communication from the application perspective
- set of activities that enable organizations to measure, optimize and plan costs as you grow your adoption of cloud services
- outcomes: CFM goals
- reduce unit costs as you scale
- reinvest wasteful spend and increase business agility
- improve financial predictability
- establish cost aware behaviors and culture
- FinOps: cross functional finance and technology team
- guage your organizations maturity in the following domains
- maturity level: 1 novice -> 5 expert
- only the expert level is listed, fk the rest
- who is responsible for driving CFM across the entire organization
- identify an owner (SME) or establish a cross-functional team with finance, technology and others
- define common goals and targets
- expert level
- active owner with goals graded against targets
- consistent executive sponsorship
- programmatic CFM activities
- establish strong partnerships between finance and tech
- financial stakeholders become more cloud & technology savvy
- understand how technology is being used to drive growth
- technology stakeholders become more finance savvy
- participating in finance activities like spend reviews, forecasting exercises and budget discussions
- initiate short, periodic meetings between all stakeholders
- weekly meetings, weekly email summarizing activities and latest metrics, etc
- spread awareness and publicize CFM activities across the organization
- Share knowledge across members of the CFM team
- financial stakeholders become more cloud & technology savvy
- expert level
- formal partnership between finance and technology
- finance org utilize each cloud providers financial toolset from an operational perspective
- they can pull latest metrics and stats relevant to finance activities in real time
- tech org is aware of and participates in finance organization activities
- tech supports finance initiatives
- finance and/or tech orgs educate external audiences
- the CFM team can present their activities in detail to relevant external stakeholders
- mechanisms to allocate consumption of cloud services back to the actual consumer of the cloud
- identify the most important dimensions for your specific business
- e.g. categorizing based on teams, lines of business, cost center, etc
- define a strategy for accounts and tags
- atleast separate accounts by env (prod, dev, etc)
- publish, evangelize and govern resource tagging across accounts
- allocate shared costs
- e.g. containers, clusters, storage, etc can all be allocated based on resource consumption
- identify the most important dimensions for your specific business
- expert level
- account segmentation strategy and tagging policy exists and enforced
- full account and tag governance
- cost allocations for discounts and shared resources
- there are always cross-cutting concerns that spans accounts, but still need to be allocated effectively
- special tag handling
- visibility into cloud spend
- define KPIs based on unit costs of cloud resources (e.g. X per hour, Y per transaction)
- set KPI goals and targets, and automate alarms to detect anomalies
- enable stakeholders: across finance and organization teams provide reports in a digestable manner
- some people need details, other people need summaries
- provide training to enable usage of reporting tools
- can trace spend back to resource usage across different dimensions
- tools and processes in place to view and analyze spend
- regularly report on cloud activities how they relate to KPIs to stakeholder groups
- data is used to influence technology group activities
- define KPIs based on unit costs of cloud resources (e.g. X per hour, Y per transaction)
- expert level
- custom dashboards utilizing each cloud providers APIs to generate reports specifically for cost and usage metrics
- stakeholders activtely use the dashboards
- efficiency KPIs drive decision making related to cloud adoption and consumption activities
- reduce existing costs and avoid unnecessary costs through upfront workload design and embedding cost optimization in all operational processes
- effective consumption planning
- avoid on-demand prices by utilizing purchase plans, commitment based discounts and other provider cost models
- right-size your resources: you must know your workload
- use automation to clean up waste: shutdown unused/idle resources
- prioritize high ROI investments: architect for cost optimization
- design fault tolerant systems and use cloud services that scale horizontally
- effective consumption planning
- expert level
- commitment based discounts, purchase plans and other non ondemand cost models are utilized for prod, dev and other workloads
- proactive optimization through design, architecture and resource selection
- not only identifying what you can do, but actively implementing strategies to reduce costs across the organization
- you must bake cost optimization into the design phase of the tech organization
- continuous and increased level of automated optimization
- mechanisms for forecasting future costs in order to improve the business and financial predictability
- use trend based forecasting for consistant usage
- use driver based forecasting to capture business changes
- track actuals vs forecast and understand variance drivers
- establish a period variance mitigation process and publish results to key stakeholders
- root cause analysis
- named stakeholders
- publish results to the org
- expert level
- trend and driver based forecasting
- recurring detailed variance analysis
- standard operating procedure or well defined playbook for mitigating variances
- high forecasting accuracy
- see: measurement and accountability
- save: cost optimization
- plan: planning and forecasting
- run: financial operations
- activities that establish cost and visibility to ensure transparency and accountability for spend
- activities that ensure your organization pays only for resources it needs
- design and service selection
- activities that allow your organization to better undertand costs associated with future cloud workloads
- variable cloud usage
- value articulation
- activites that enable your organization to scale CFM
- partnership between finance and technology organizations
- investing in the right things
- celebrating, rewarding and promoting good practices