analytics

links

noisy neighbor antipattern

best practices

monitoring applications, services and resources: use a tool that will get you 90% there
record performance-related metrics appropriate for the service: e.g. db transactions, slow queries, i/o latency, http request throughput, service latency, etc
- Identify metrics that matter for your workload and record them.
- Identify the target, measurement approach, and priority to build alarms and notifications to proactively address performance-related issues.
analyze metrics when events/incidents occur
- monitoring dashboards or reports to understand and diagnose the impact.
- write use cases for your architecture, include performance requirements and incident responses.
establish KPIs to measure workload performance
- Identify the key performance indicators (KPIs) that indicate whether the workload is performing as intended.
- Document the performance experience of customers, and use these requirements to establish your KPIs
use monitoring to generate alarm-based notifications
- using your KPIs, a monitoring system should automatically alert when measurements are outside of the baseline.
review metrics at regular intervals
- review the metrics collected to identify which metrics were key in addressing issues
- Also ask which additional metrics would help to identify, address, or prevent issues.
monitor and alarm proactively
- Use KPIs, combined with monitoring and alerting systems, to proactively address performance-related issues.
- automated alerts when thresholds are breached
deploy monitoring agents to constantly monitor resource performance
There is no substitute for measuring the performance of your full application
- i.e. rarely is it useful to measure in isolation

basics

key domains
- customer experience
- performance over time: system, costs, etc
- trends
- troubleshooting and remediation: identification, isolation and resolution, root cause analysis
- cost allocation
- learning and improvement: detecting and preventing problems

Automated Alerts

When there are issues, you should be alerted immediately, either through:
- on-screen displays
- Text and emails automatically generated by the network monitoring solution
every alert should contain
- when a problem occured and which threshold is being approached/breached
- information to identify the source, device, resource, etc

Right sizing resources

the process of reviewing deployed resources and identifying opportunities to eliminate or downsize without compromising capacity or other requirements
involves continually analyzing resource performance and usage needs and patterns, then turning off idle resources, removing unused capacity, and right-sizing resources that are over-provisioned or poorly matched to the workload.
directly impacts performance and costs
Noisy Neighbor effect: in multitenant systems with shared resources, the activity of one tenant can negatively impact another tenant's share of resources

monitoring

all about metrics, logging and tracing
- processes must be in place to capture logs and other useful artifacts
- captured logs and artifacts must be stored in a durable, searchable location
- alerts and automation
The act of collecting, analyzing, and using data to make decisions or answer questions about your IT resources and systems
monitoring tools: collects data generated by systems
metrics: a datapoint consisting of a name and value
dimensions: qualities that describe the context of a metric, consisting of a name and value
statistics: metrics monitored over time
logs: collect and aggregate files fomr resources and filterout actionable insights from background noise
tracing: follow the path of a request as it passes through different services
- investigate how apps and their underlying services are performing
- important for troubleshooting the root cause of performance issues and errors

Network Monitoring

monitor the availability, uptime, operation, and performance of complex networks
reduce the mean time to repair and recover and solve real-time network performance issues.
key tasks:
- Tracking and analyzing network components and the connections between them
- Surveilling different data layers, network endpoints, and links
- the health and performance of network interfaces for their faults helps to diagnose, optimize, and manage various network resources
  - provide historical data and establish a baseline
  - important for forensic analysis to identify the root cause after incidents.
- Data in the form of tables, charts, graphs, dashboards, and reports.

general process

for monitoring devices and network components
identify performance metrics to be monitored
deterime the monitoring interval:
- the frequency at which network devices are polled to identify performance and availability status
choosing the right protocols for devices & network components
- SNMP: simple network management protocol
- HTTP: hyper text trasnfer protocol
- TCP: transmission control protocol
- IP: internet protocol
- ICMP: internet control message protocol
- WMI: windows management instrumentation
set proactive thresholds
alert, alert, alert!

diagnostic tools

always need to be compared against historical baseline of network performance
ping: some service providers have ping (ICMP echo packets) disabled by default, or cant be enabled at all
traceroute: uses successive echo packets to display the path to the destination and the response time of each hop.
Speedtest: useful in evaluating the performance of your internet access.
packet analyzers: aka packet sniffers; logs each packet it intercepts, decodes the packet, and presents the values of the various fields within the packet for examination.
benching tools: measure throughput and bandwidth;
- iperf/iperf3: tools for active measurements of the maximum achievable bandwidth on IP networks
  - supports tuning of various parameters related to timing, buffers, and protocols (TCP, UDP, SCTP with IPv4 and IPv6)
- extrahop: monitoring solution for security, network performance, and the cloud
- netperf: a CLI tool similar to iPerf that measures throughput and benchmarking speeds.

common networking metrics

bandwidth capacity: the maximum data transmission rate possible on a network
- measures the theoretical limit of data transfer
- For optimal network operations, you want to get as close to your maximum bandwidth as possible without reaching critical levels
- indicates that your network is sending as much data as it can within a period of time, but isn’t being overloaded.
throughput: measures your network’s actual data transmission rate
- measures the units such as megabyte or gigabyte per second of data packets that are successfully being sent.
- a high bandwidth connection but low throughput, that's an indicator of an underlying problem
latency: delay between requesting data and when that data is finished being delivered.
- Consistent delays or odd spikes in delay time indicate a major performance issue
packet loss: examines how many data packets are dropped during data transmissions on your network
- The more data packets that are lost, the longer it takes for a data request to be fulfilled
- A network’s TCP interprets when packets are dropped and takes steps to ensure that data packets can still be transmitted;
retransmission: is when packets are lost, The network needs to retransmit them to complete a data request.
- retransmission rate lets your enterprise know how often packets are being dropped, which is an indication of congestion on your network.
- analyze retransmission delay (or the time it takes for a dropped packet to be retransmitted) to understand how long it takes your network to recover from packet loss.
availability: i.e. uptime, the percentage of time the network is available.
- can never guarantee 100 percent availability, but you want to be aware of any downtime that happens on your network that you weren’t expecting
connectivity: whether the connections between the nodes on your network are working properly
- jitter: a variation in delay or disruption that occurs while data packets travel across the network.
- congestion: occurs when network devices are unable to send the equivalent amount of traffic they receive.
response times: measures the time it takes for a server to respond to a data request with application data

network fault monitoring

when a system polls registered devices at established intervals to verify if they respond
this is the simplest form of network monitoring

network capacity monitoring

monitor the users, applications, and other services on your network to see if any are draining the network

observability

the extent to which a system can be monitored
you observe a system through metrics, logs and traces
considerations
- storage costs
- data overload
- ensuring your system is outputting the correct data

Performance Monitoring

collects detailed information to help determine possible reasons for poor service performance
historical data provides performance trends and helps with root cause analysis for any issues that may occur.
When measuring performance:
- measure the time it takes for a service to complete an operation
- understand the units of measurement involved
- how to use these measurements to calculate performance.

benchmarking

understand the actual performance and the optimal performance of your workload before you attempt to optimize it.
- establish a baseline measurement for each segment of a resource
  - observe performance over at least a two-week period (ideally, over a one-month period) to capture the workload and business peaks.
- simulate real user traffic
actual: is what you see day to day
optimal: the absolute best performance you can get based on the combined components that you are using

storage performance

latency: aka delay; amount of time between making a request to the storage system and receiving the response.
- How the storage is connected to the compute system
Input/output operations per second: IOPS; a statistical storage measurement of the number of input/output (I/O) operations that can be performed per second
- used to measure the number of operations at a given type of workload and operation size can occur per second
- generally measured in KiB, and the underlying drive technology determines the maximum amount of data that a volume type counts as a single I/O.
  - SSD vs HDD
    - ssd: deliver consistent performance whether an I/O operation is random or sequential.
      - handle small or random I/O operations more efficiently than HDD volumes.
    - hdd: deliver optimal performance only when I/O operations are large and sequential.
throughput:how much data (in MiB/s) you can read/write per second when reading large sequential data files.
- Large files, such as video files, must be read from beginning to end.
- operations are measured in megabytes per second (MB/s).
block size: certain workloads benefit from a smaller or larger block size
- file systems support non-default block sizes that you can specify when formatting the disk.

network performance

its all about reducing latency: Anything that lengthens the time to get data to the user
- packet loss
- jitter: Variations in latency or time delay between packets
- bandwidth constraints
- inefficient protocol use
- physical distance
When reducing latency, consider
- physical distance between two nodes
- quality of routes
- request origin location in relation to data
- average packet delay under network cost constraints
- memory resources
- traffic patterns and available node resources

Application Network Performance Optimization

monitoring and troubleshooting of performance issues within applications
permits applications to evaluate the network API communication from the application perspective

Cost Management

Cloud Financial Management (CFM)

set of activities that enable organizations to measure, optimize and plan costs as you grow your adoption of cloud services
outcomes: CFM goals
- reduce unit costs as you scale
- reinvest wasteful spend and increase business agility
- improve financial predictability
- establish cost aware behaviors and culture
FinOps: cross functional finance and technology team

capabilities

guage your organizations maturity in the following domains
maturity level: 1 novice -> 5 expert
- only the expert level is listed, fk the rest

ownership and accountability

who is responsible for driving CFM across the entire organization
- identify an owner (SME) or establish a cross-functional team with finance, technology and others
- define common goals and targets
expert level
- active owner with goals graded against targets
- consistent executive sponsorship
- programmatic CFM activities

finance and tech cooperation

establish strong partnerships between finance and tech
- financial stakeholders become more cloud & technology savvy
  - understand how technology is being used to drive growth
- technology stakeholders become more finance savvy
  - participating in finance activities like spend reviews, forecasting exercises and budget discussions
- initiate short, periodic meetings between all stakeholders
  - weekly meetings, weekly email summarizing activities and latest metrics, etc
- spread awareness and publicize CFM activities across the organization
- Share knowledge across members of the CFM team
expert level
- formal partnership between finance and technology
- finance org utilize each cloud providers financial toolset from an operational perspective
  - they can pull latest metrics and stats relevant to finance activities in real time
- tech org is aware of and participates in finance organization activities
  - tech supports finance initiatives
- finance and/or tech orgs educate external audiences
  - the CFM team can present their activities in detail to relevant external stakeholders

cost allocation

mechanisms to allocate consumption of cloud services back to the actual consumer of the cloud
- identify the most important dimensions for your specific business
  - e.g. categorizing based on teams, lines of business, cost center, etc
- define a strategy for accounts and tags
  - atleast separate accounts by env (prod, dev, etc)
  - publish, evangelize and govern resource tagging across accounts
- allocate shared costs
  - e.g. containers, clusters, storage, etc can all be allocated based on resource consumption
expert level
- account segmentation strategy and tagging policy exists and enforced
- full account and tag governance
- cost allocations for discounts and shared resources
  - there are always cross-cutting concerns that spans accounts, but still need to be allocated effectively
- special tag handling

cost visibility

visibility into cloud spend
- define KPIs based on unit costs of cloud resources (e.g. X per hour, Y per transaction)
  - set KPI goals and targets, and automate alarms to detect anomalies
- enable stakeholders: across finance and organization teams provide reports in a digestable manner
  - some people need details, other people need summaries
  - provide training to enable usage of reporting tools
- can trace spend back to resource usage across different dimensions
- tools and processes in place to view and analyze spend
- regularly report on cloud activities how they relate to KPIs to stakeholder groups
- data is used to influence technology group activities
expert level
- custom dashboards utilizing each cloud providers APIs to generate reports specifically for cost and usage metrics
- stakeholders activtely use the dashboards
- efficiency KPIs drive decision making related to cloud adoption and consumption activities

cost optimization

reduce existing costs and avoid unnecessary costs through upfront workload design and embedding cost optimization in all operational processes
- effective consumption planning
  - avoid on-demand prices by utilizing purchase plans, commitment based discounts and other provider cost models
- right-size your resources: you must know your workload
- use automation to clean up waste: shutdown unused/idle resources
- prioritize high ROI investments: architect for cost optimization
- design fault tolerant systems and use cloud services that scale horizontally
expert level
- commitment based discounts, purchase plans and other non ondemand cost models are utilized for prod, dev and other workloads
- proactive optimization through design, architecture and resource selection
  - not only identifying what you can do, but actively implementing strategies to reduce costs across the organization
  - you must bake cost optimization into the design phase of the tech organization
- continuous and increased level of automated optimization

forecasting

mechanisms for forecasting future costs in order to improve the business and financial predictability
- use trend based forecasting for consistant usage
- use driver based forecasting to capture business changes
- track actuals vs forecast and understand variance drivers
- establish a period variance mitigation process and publish results to key stakeholders
  - root cause analysis
  - named stakeholders
  - publish results to the org
expert level
- trend and driver based forecasting
- recurring detailed variance analysis
- standard operating procedure or well defined playbook for mitigating variances
- high forecasting accuracy

4 pillars of CFM

see: measurement and accountability
save: cost optimization
plan: planning and forecasting
run: financial operations

Measurement and Accountability (see)

activities that establish cost and visibility to ensure transparency and accountability for spend

account and tagging strategy

cost reporting and monitoring processes

cost show/chargeback

efficiency/value KPIs

Cost Optimization (save)

activities that ensure your organization pays only for resources it needs

cost aware architecture

design and service selection

match capacity with demand

purchase model selection

identifying waste resources

Planning and Forecasting (plan)

activities that allow your organization to better undertand costs associated with future cloud workloads

Files

analytics.md

Latest commit

History

analytics.md

File metadata and controls

analytics

links

best practices

basics

Automated Alerts

Right sizing resources

monitoring

Network Monitoring

general process

diagnostic tools

common networking metrics

network fault monitoring

network capacity monitoring

observability

Performance Monitoring

benchmarking

storage performance

network performance

Application Network Performance Optimization

Cost Management

Cloud Financial Management (CFM)

capabilities

ownership and accountability

finance and tech cooperation

cost allocation

cost visibility

cost optimization

forecasting

4 pillars of CFM

Measurement and Accountability (see)

account and tagging strategy

cost reporting and monitoring processes

cost show/chargeback

efficiency/value KPIs

Cost Optimization (save)

cost aware architecture

match capacity with demand

purchase model selection

identifying waste resources

Planning and Forecasting (plan)

budgeting & forcasting

POC based cost estimation

business case

strategic fit

Financial Operations (run)

secure executive sponsorship

Finance + Tech cooporation

People, Governance, Tools

Accomplishments