Skip to content

Latest commit

 

History

History
169 lines (117 loc) · 7.52 KB

NOTES.md

File metadata and controls

169 lines (117 loc) · 7.52 KB

Notes

kubernetes architecture

reference_kube_arch

chaos experiments

As an industry, we are quick to adopt practices that increase the flexibility of developement and velocity of deployment. But how much confidence do we have on the complex systems we put out in production. Chaos engineering is the discipline of experimenting on a distributed system in order to gain confidence in a system's capability to withstand turbulent conditions in production. To address this uncertainty, we will be running a set of experiments to uncover systemic weaknesses.

In this experiment, we will be deploying two instances of the data pipeline, a control group and an experimental group. We also will define a 'steady state' which will indicate some sort of normal behavior for the pipeline. We will be running a set of tests on the experimental group which will try to disrupt the steady state, giving us more confidence in our system.

Experiment 1: Resource exhaustion of containers

hypothesis:

  • increased latency in incoming requests
  • load balancer routes traffic away from availability zone 2
  • receive alert message

Experiment 2: Kill Stateful Replica Pod

scenario: Master postres replica pod is killed

hypothesis:

  • brief unavailability of data for a x duration of time
  • replica should get promoted (slave to master)
  • new clone should kick off and system recovers

questions they may ask

Why run stateful applications like Postgres on Kubernetes?

What type of load balancer am I using?

  • Amazon EKS supports the Network Load Balancer and the Classic Load Balancer through the Kubernetes service of type LoadBalancer

What other type of chaos testing can I apply on my deployment?

Generic Chaos on Kubernetes resources:

  • container kill
  • pod kill
  • network delay
  • network loss
  • cpu hog

What does multi-tenancy mean?

reference_kube_enterprise_multitenancy

The users of the cluster are divided into three different roles, depending on their privilege:

  • Cluster administrator: This role is for administrators of the entire cluster, who manage all tenants. Cluster administrators can create, read, update, and delete any policy object. They can create namespaces and assign them to namespace administrators.
  • Namespace administrator: This role is for administrators of specific, single tenants. A namespace administrator can manage the users in their namespace.
  • Developer: Members of this role can create, read, update, and delete namespaced non-policy objects like Pods, Jobs, and Ingresses. Developers only have these privileges in the namespaces they have access to.

Why am I running the Spark cluster in one availability zone?

  • performance?
  • latency?

How would I improve my infrastructure?

  • knowing what I know now, how would I change the components (spark, postgres, flask) of my deployment?
  • LINK - apache-spark-alternatives

How does Spark on Kubernetes compare with yarn and mesos?

What metrics to collect for Chaos Engineering?

  • Infrastructure Monitoring Metrics
    • Resource: CPU, IO, Disk & Memory
    • State: Shutdown, Processes, Clock Time
    • Network: DNS, Latency, Packet Loss
  • Alerting and On-Call Metrics
    • Total alert counts by service per week
    • Time to resolution for alerts per service
    • Noisy alerts by service per week (self-resolving)
    • Top 20 most frequent alerts per week for each service.
  • High Severity Incident (SEV) Metrics
    • Total count of incidents per week by SEV level
    • Total count of SEVs per week by service
    • MTTD, MTTR and MTBF for SEVs by service
  • Application Metrics
    • Events
    • Stack traces
    • Context
    • Breadcrumbs

What challenges did you run into containerizing the application?

  • refactor code for the existing data pipeline including flask application code and pyspark code

Any challenges on postgres stateful set?

  • engineering challenge in readme

What did you use to do chaos experiments? How many pods were you running at one time?

  • I used kube monkey to terminate pods

In terms of blast radius, were there any unexpected incidents?

  • self heal?

Did you fine tune or reconfigure anything?

  • autoscaling?

What about auto scaling? How do you scale up your existing infrastructure? What about checking to see if autoscaling does scale down correctly?

  • The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can’t be scaled, for example, DaemonSets.
  • HPA will increase and decrease the number of replicas (via the deployment) to maintain an average CPU utilization across all Pods of 50%
# example of how to autoscale base off cpu
# using flask app w/ a min of 3, max of 9, and 50% cpu utilization
kubectl autoscale deployment scale-app --cpu-percent=50 --min=3 --max=9
  • Here is an example for a StatefulSet (able to autoscale since Kubernetes 1.9):
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: YOUR_HPA_NAME
spec:
  maxReplicas: 3
  minReplicas: 1
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: YOUR_STATEFUL_SET_NAME
  targetCPUUtilizationPercentage: 80

What about pod resource limits?

things to keep in mind for presentation

  • make sure to mention that I am working on top of someone elses project

things to think about

  • be very valuable by mastering open source tools to help companies with vendor lock in
  • give kube more than it can handle by oversubscription (best practices)

references