Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Milestone 2023.5: Refresh Testground's EKS support #1529

Open
laurentsenta opened this issue Nov 23, 2022 · 2 comments
Open

Milestone 2023.5: Refresh Testground's EKS support #1529

laurentsenta opened this issue Nov 23, 2022 · 2 comments
Labels
starmaps https://www.starmaps.app/

Comments

@laurentsenta
Copy link
Contributor

laurentsenta commented Nov 23, 2022

eta: 2024-06

description:
Testground can simulate small networks in CI, but it covers more use cases when it lives in a larger cluster. When we run Testgroun in Kubernetes, we can support whole organizations through the Testground As A Service product.

Using a managed service (Amazon's Elastic Kubernetes Service) means our maintenance costs are lower, and the team can focus on improvements.


Deliverables

  • An up-to-date and reliable Kubernetes runner that runs on Amazon's EKS.

Tasks:

@laurentsenta
Copy link
Contributor Author

laurentsenta commented Dec 7, 2022

A note: we're going to remove k8s tests (using kind) as dead code; they will be available in git history, we can port them to EKS if it makes sense (I believe updating the docker tests so they can also be submitted an EKS test cluster might be more useful)

(see also #1515)

@laurentsenta
Copy link
Contributor Author

laurentsenta commented Dec 9, 2022

(triage session with @sysrex and @Bidon15, I'll re-phrase and create issues on Monday)

  • cluster autoscaler:

    • (almost) fixed on Celestia side
    • cost efficiency + automation is required
    • kind of works:
      • delays a bit the execution (warm up)
      • issue w/Infra node group
        • the networking is deployed in a "fixed" node group.
    • change the deployment or have the cluster autoscaler balance the node group.
    • rework:
      • node groups or autoscaler service (so that the network grows with the executors)
  • Shell Scripts Improvements

    • when big plans are executed, we run out of ips (each pod is assigned IPs)
    • the script is designed to run in a single availability zone
    • use all three az (3x ips)
    • support with different instance types required
      • use spot instances (cheaper)
    • heavily modified:
  • Modified version of the script:

    • using 3 az + cluster auto-scaler + several types of instances + spot requests (balancing between spot + on demand)
    • AWS terraform blueprints scripts
      • modified dashboards
    • 2 structures
      • networks + subnets
      • nodegroups, etc.
    • helm charts
    • argo CD
    • (too much work to apply)
  • Deploying custom branches

  • Authentication

  • Config Map for the testground daemon

    • test task timeout tweaks
    • having ways to reconfigure the daemon
  • on scale up

    • the cluster might delete a test pod
    • network init hangs (because 1 / x pods are not here to signal readiness)
  • networking issue (network flakiness)

    • 2 (stable) / 5 / 6 / 7 pods out of 1000 -
    • error no route to host

(ipv6 is not an issue right now)

  • Stability Dashboard & Flakiness

    • cross reference with another cluster (like GCP)
    • => stopped on price issue
    • how to troubleshoot networking:
      • intuition "against" weave
      • devkit cluster with raspberry
      • use influxdb to detect crashes
      • => maybe we're launching pods before they are ready
      • => post hooks to the nodes
    • stress testing also: (8k nodes)
    • not running tests in CI right now, maybe Q2
  • EKS & Sync Service Stability

    • when we run a lot of tests, sync service might get overwhelmed
    • context canceled, EOF, etc.

Note: for Celestia, killer feature of Testground is having large scale networks.

@laurentsenta laurentsenta changed the title Milestone 2023.2: Refresh Testground's EKS support Milestone 2023.5: Refresh Testground's EKS support Jan 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
starmaps https://www.starmaps.app/
Projects
Status: Backlog
Development

No branches or pull requests

2 participants