Skip to content

Latest commit

 

History

History
391 lines (306 loc) · 19 KB

analytics.md

File metadata and controls

391 lines (306 loc) · 19 KB

analytics

links

best practices

  • monitoring applications, services and resources: use a tool that will get you 90% there
  • record performance-related metrics appropriate for the service: e.g. db transactions, slow queries, i/o latency, http request throughput, service latency, etc
    • Identify metrics that matter for your workload and record them.
    • Identify the target, measurement approach, and priority to build alarms and notifications to proactively address performance-related issues.
  • analyze metrics when events/incidents occur
    • monitoring dashboards or reports to understand and diagnose the impact.
    • write use cases for your architecture, include performance requirements and incident responses.
  • establish KPIs to measure workload performance
    • Identify the key performance indicators (KPIs) that indicate whether the workload is performing as intended.
    • Document the performance experience of customers, and use these requirements to establish your KPIs
  • use monitoring to generate alarm-based notifications
    • using your KPIs, a monitoring system should automatically alert when measurements are outside of the baseline.
  • review metrics at regular intervals
    • review the metrics collected to identify which metrics were key in addressing issues
    • Also ask which additional metrics would help to identify, address, or prevent issues.
  • monitor and alarm proactively
    • Use KPIs, combined with monitoring and alerting systems, to proactively address performance-related issues.
    • automated alerts when thresholds are breached
  • deploy monitoring agents to constantly monitor resource performance
  • There is no substitute for measuring the performance of your full application
    • i.e. rarely is it useful to measure in isolation

basics

  • key domains
    • customer experience
    • performance over time: system, costs, etc
    • trends
    • troubleshooting and remediation: identification, isolation and resolution, root cause analysis
    • cost allocation
    • learning and improvement: detecting and preventing problems

Automated Alerts

  • When there are issues, you should be alerted immediately, either through:
    • on-screen displays
    • Text and emails automatically generated by the network monitoring solution
  • every alert should contain
    • when a problem occured and which threshold is being approached/breached
    • information to identify the source, device, resource, etc

Right sizing resources

  • the process of reviewing deployed resources and identifying opportunities to eliminate or downsize without compromising capacity or other requirements
  • involves continually analyzing resource performance and usage needs and patterns, then turning off idle resources, removing unused capacity, and right-sizing resources that are over-provisioned or poorly matched to the workload.
  • directly impacts performance and costs
  • Noisy Neighbor effect: in multitenant systems with shared resources, the activity of one tenant can negatively impact another tenant's share of resources

monitoring

  • all about metrics, logging and tracing
    • processes must be in place to capture logs and other useful artifacts
    • captured logs and artifacts must be stored in a durable, searchable location
    • alerts and automation
  • The act of collecting, analyzing, and using data to make decisions or answer questions about your IT resources and systems
  • monitoring tools: collects data generated by systems
  • metrics: a datapoint consisting of a name and value
  • dimensions: qualities that describe the context of a metric, consisting of a name and value
  • statistics: metrics monitored over time
  • logs: collect and aggregate files fomr resources and filterout actionable insights from background noise
  • tracing: follow the path of a request as it passes through different services
    • investigate how apps and their underlying services are performing
    • important for troubleshooting the root cause of performance issues and errors

Network Monitoring

  • monitor the availability, uptime, operation, and performance of complex networks
  • reduce the mean time to repair and recover and solve real-time network performance issues.
  • key tasks:
    • Tracking and analyzing network components and the connections between them
    • Surveilling different data layers, network endpoints, and links
    • the health and performance of network interfaces for their faults helps to diagnose, optimize, and manage various network resources
      • provide historical data and establish a baseline
      • important for forensic analysis to identify the root cause after incidents.
    • Data in the form of tables, charts, graphs, dashboards, and reports.

general process

  • for monitoring devices and network components
  • identify performance metrics to be monitored
  • deterime the monitoring interval:
    • the frequency at which network devices are polled to identify performance and availability status
  • choosing the right protocols for devices & network components
    • SNMP: simple network management protocol
    • HTTP: hyper text trasnfer protocol
    • TCP: transmission control protocol
    • IP: internet protocol
    • ICMP: internet control message protocol
    • WMI: windows management instrumentation
  • set proactive thresholds
  • alert, alert, alert!

diagnostic tools

  • always need to be compared against historical baseline of network performance
  • ping: some service providers have ping (ICMP echo packets) disabled by default, or cant be enabled at all
  • traceroute: uses successive echo packets to display the path to the destination and the response time of each hop.
  • Speedtest: useful in evaluating the performance of your internet access.
  • packet analyzers: aka packet sniffers; logs each packet it intercepts, decodes the packet, and presents the values of the various fields within the packet for examination.
  • benching tools: measure throughput and bandwidth;
    • iperf/iperf3: tools for active measurements of the maximum achievable bandwidth on IP networks
      • supports tuning of various parameters related to timing, buffers, and protocols (TCP, UDP, SCTP with IPv4 and IPv6)
    • extrahop: monitoring solution for security, network performance, and the cloud
    • netperf: a CLI tool similar to iPerf that measures throughput and benchmarking speeds.

common networking metrics

  • bandwidth capacity: the maximum data transmission rate possible on a network
    • measures the theoretical limit of data transfer
    • For optimal network operations, you want to get as close to your maximum bandwidth as possible without reaching critical levels
    • indicates that your network is sending as much data as it can within a period of time, but isn’t being overloaded.
  • throughput: measures your network’s actual data transmission rate
    • measures the units such as megabyte or gigabyte per second of data packets that are successfully being sent.
    • a high bandwidth connection but low throughput, that's an indicator of an underlying problem
  • latency: delay between requesting data and when that data is finished being delivered.
    • Consistent delays or odd spikes in delay time indicate a major performance issue
  • packet loss: examines how many data packets are dropped during data transmissions on your network
    • The more data packets that are lost, the longer it takes for a data request to be fulfilled
    • A network’s TCP interprets when packets are dropped and takes steps to ensure that data packets can still be transmitted;
  • retransmission: is when packets are lost, The network needs to retransmit them to complete a data request.
    • retransmission rate lets your enterprise know how often packets are being dropped, which is an indication of congestion on your network.
    • analyze retransmission delay (or the time it takes for a dropped packet to be retransmitted) to understand how long it takes your network to recover from packet loss.
  • availability: i.e. uptime, the percentage of time the network is available.
    • can never guarantee 100 percent availability, but you want to be aware of any downtime that happens on your network that you weren’t expecting
  • connectivity: whether the connections between the nodes on your network are working properly
    • jitter: a variation in delay or disruption that occurs while data packets travel across the network.
    • congestion: occurs when network devices are unable to send the equivalent amount of traffic they receive.
  • response times: measures the time it takes for a server to respond to a data request with application data

network fault monitoring

  • when a system polls registered devices at established intervals to verify if they respond
  • this is the simplest form of network monitoring

network capacity monitoring

  • monitor the users, applications, and other services on your network to see if any are draining the network

observability

  • the extent to which a system can be monitored
  • you observe a system through metrics, logs and traces
  • considerations
    • storage costs
    • data overload
    • ensuring your system is outputting the correct data

Performance Monitoring

  • collects detailed information to help determine possible reasons for poor service performance
  • historical data provides performance trends and helps with root cause analysis for any issues that may occur.
  • When measuring performance:
    • measure the time it takes for a service to complete an operation
    • understand the units of measurement involved
    • how to use these measurements to calculate performance.

benchmarking

  • understand the actual performance and the optimal performance of your workload before you attempt to optimize it.
    • establish a baseline measurement for each segment of a resource
      • observe performance over at least a two-week period (ideally, over a one-month period) to capture the workload and business peaks.
    • simulate real user traffic
  • actual: is what you see day to day
  • optimal: the absolute best performance you can get based on the combined components that you are using

storage performance

  • latency: aka delay; amount of time between making a request to the storage system and receiving the response.
    • How the storage is connected to the compute system
  • Input/output operations per second: IOPS; a statistical storage measurement of the number of input/output (I/O) operations that can be performed per second
    • used to measure the number of operations at a given type of workload and operation size can occur per second
    • generally measured in KiB, and the underlying drive technology determines the maximum amount of data that a volume type counts as a single I/O.
      • SSD vs HDD
        • ssd: deliver consistent performance whether an I/O operation is random or sequential.
          • handle small or random I/O operations more efficiently than HDD volumes.
        • hdd: deliver optimal performance only when I/O operations are large and sequential.
  • throughput:how much data (in MiB/s) you can read/write per second when reading large sequential data files.
    • Large files, such as video files, must be read from beginning to end.
    • operations are measured in megabytes per second (MB/s).
  • block size: certain workloads benefit from a smaller or larger block size
    • file systems support non-default block sizes that you can specify when formatting the disk.

network performance

  • its all about reducing latency: Anything that lengthens the time to get data to the user
    • packet loss
    • jitter: Variations in latency or time delay between packets
    • bandwidth constraints
    • inefficient protocol use
    • physical distance
  • When reducing latency, consider
    • physical distance between two nodes
    • quality of routes
    • request origin location in relation to data
    • average packet delay under network cost constraints
    • memory resources
    • traffic patterns and available node resources
Application Network Performance Optimization
  • monitoring and troubleshooting of performance issues within applications
  • permits applications to evaluate the network API communication from the application perspective

Cost Management

Cloud Financial Management (CFM)

  • set of activities that enable organizations to measure, optimize and plan costs as you grow your adoption of cloud services
  • outcomes: CFM goals
    • reduce unit costs as you scale
    • reinvest wasteful spend and increase business agility
    • improve financial predictability
    • establish cost aware behaviors and culture
  • FinOps: cross functional finance and technology team

capabilities

  • guage your organizations maturity in the following domains
  • maturity level: 1 novice -> 5 expert
    • only the expert level is listed, fk the rest
ownership and accountability
  • who is responsible for driving CFM across the entire organization
    • identify an owner (SME) or establish a cross-functional team with finance, technology and others
    • define common goals and targets
  • expert level
    • active owner with goals graded against targets
    • consistent executive sponsorship
    • programmatic CFM activities
finance and tech cooperation
  • establish strong partnerships between finance and tech
    • financial stakeholders become more cloud & technology savvy
      • understand how technology is being used to drive growth
    • technology stakeholders become more finance savvy
      • participating in finance activities like spend reviews, forecasting exercises and budget discussions
    • initiate short, periodic meetings between all stakeholders
      • weekly meetings, weekly email summarizing activities and latest metrics, etc
    • spread awareness and publicize CFM activities across the organization
    • Share knowledge across members of the CFM team
  • expert level
    • formal partnership between finance and technology
    • finance org utilize each cloud providers financial toolset from an operational perspective
      • they can pull latest metrics and stats relevant to finance activities in real time
    • tech org is aware of and participates in finance organization activities
      • tech supports finance initiatives
    • finance and/or tech orgs educate external audiences
      • the CFM team can present their activities in detail to relevant external stakeholders
cost allocation
  • mechanisms to allocate consumption of cloud services back to the actual consumer of the cloud
    • identify the most important dimensions for your specific business
      • e.g. categorizing based on teams, lines of business, cost center, etc
    • define a strategy for accounts and tags
      • atleast separate accounts by env (prod, dev, etc)
      • publish, evangelize and govern resource tagging across accounts
    • allocate shared costs
      • e.g. containers, clusters, storage, etc can all be allocated based on resource consumption
  • expert level
    • account segmentation strategy and tagging policy exists and enforced
    • full account and tag governance
    • cost allocations for discounts and shared resources
      • there are always cross-cutting concerns that spans accounts, but still need to be allocated effectively
    • special tag handling
cost visibility
  • visibility into cloud spend
    • define KPIs based on unit costs of cloud resources (e.g. X per hour, Y per transaction)
      • set KPI goals and targets, and automate alarms to detect anomalies
    • enable stakeholders: across finance and organization teams provide reports in a digestable manner
      • some people need details, other people need summaries
      • provide training to enable usage of reporting tools
    • can trace spend back to resource usage across different dimensions
    • tools and processes in place to view and analyze spend
    • regularly report on cloud activities how they relate to KPIs to stakeholder groups
    • data is used to influence technology group activities
  • expert level
    • custom dashboards utilizing each cloud providers APIs to generate reports specifically for cost and usage metrics
    • stakeholders activtely use the dashboards
    • efficiency KPIs drive decision making related to cloud adoption and consumption activities
cost optimization
  • reduce existing costs and avoid unnecessary costs through upfront workload design and embedding cost optimization in all operational processes
    • effective consumption planning
      • avoid on-demand prices by utilizing purchase plans, commitment based discounts and other provider cost models
    • right-size your resources: you must know your workload
    • use automation to clean up waste: shutdown unused/idle resources
    • prioritize high ROI investments: architect for cost optimization
    • design fault tolerant systems and use cloud services that scale horizontally
  • expert level
    • commitment based discounts, purchase plans and other non ondemand cost models are utilized for prod, dev and other workloads
    • proactive optimization through design, architecture and resource selection
      • not only identifying what you can do, but actively implementing strategies to reduce costs across the organization
      • you must bake cost optimization into the design phase of the tech organization
    • continuous and increased level of automated optimization
forecasting
  • mechanisms for forecasting future costs in order to improve the business and financial predictability
    • use trend based forecasting for consistant usage
    • use driver based forecasting to capture business changes
    • track actuals vs forecast and understand variance drivers
    • establish a period variance mitigation process and publish results to key stakeholders
      • root cause analysis
      • named stakeholders
      • publish results to the org
  • expert level
    • trend and driver based forecasting
    • recurring detailed variance analysis
    • standard operating procedure or well defined playbook for mitigating variances
    • high forecasting accuracy

4 pillars of CFM

  • see: measurement and accountability
  • save: cost optimization
  • plan: planning and forecasting
  • run: financial operations
Measurement and Accountability (see)
  • activities that establish cost and visibility to ensure transparency and accountability for spend
account and tagging strategy
cost reporting and monitoring processes
cost show/chargeback
efficiency/value KPIs
Cost Optimization (save)
  • activities that ensure your organization pays only for resources it needs
cost aware architecture
  • design and service selection
match capacity with demand
purchase model selection
identifying waste resources
Planning and Forecasting (plan)
  • activities that allow your organization to better undertand costs associated with future cloud workloads
budgeting & forcasting
  • variable cloud usage
POC based cost estimation
business case
  • value articulation
strategic fit
Financial Operations (run)
  • activites that enable your organization to scale CFM
secure executive sponsorship
Finance + Tech cooporation
  • partnership between finance and technology organizations
People, Governance, Tools
  • investing in the right things
Accomplishments
  • celebrating, rewarding and promoting good practices