Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POC][K8S][INFRA] Enable resource limited Spark on K8S integration tests in Github Action #35830

Closed
wants to merge 2 commits into from

Conversation

Yikun
Copy link
Member

@Yikun Yikun commented Mar 13, 2022

What changes were proposed in this pull request?

Enable K8S integration tests in Github Action.
Note that this PR didn't add Volcano test due to limited resource of github action.

For current implementations, we enable 2 style k8s ci:

image

Why are the changes needed?

This will also improve the efficiency of K8S development and guarantee the quality of spark on K8S in some level.

  • Why setting driver 0.5 cpu, executor 0.2 cpu?
    Github-hosted runner hardware limited: 2U7G
    IT Job available CPU = 2U - 0.85U(K8S deploy) = 1.15U
    There are 1.15 cpu left after k8s installation, to meet the requirement of K8S tests (one driver + max to 3 executors).

  • Time cost info:

    • 14 mins to compile related code.
    • 3 mins to build docker images.
    • 20-30 mins to test
    • Total: about 30-40 mins
  • Stablity:
    I also do a stablity validation based on scheduled job:
    https://github.com/spark-volcano/spark/actions/workflows/k8s_integration_test.yml
    All test passed always passed at each half hours since enable.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI passed

@Yikun
Copy link
Member Author

Yikun commented Mar 13, 2022

cc @dongjoon-hyun @HyukjinKwon @holdenk @yaooqinn @attilapiros

WDYT? Do you think this is a work that can go on?

@Yikun
Copy link
Member Author

Yikun commented Mar 13, 2022

[info] Run completed in 22 minutes, 38 seconds.
[info] Total number of tests run: 23
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 19, failed 4, canceled 0, ignored 0, pending 0
[info] *** 4 TESTS FAILED ***

As expected.

@@ -191,6 +191,8 @@ class KubernetesSuite extends SparkFunSuite
.set("spark.kubernetes.driver.pod.name", driverPodName)
.set("spark.kubernetes.driver.label.spark-app-locator", appLocator)
.set("spark.kubernetes.executor.label.spark-app-locator", appLocator)
.set("spark.kubernetes.executor.request.cores", "0.5")
.set("spark.kubernetes.driver.request.cores", "0.5")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is only for tests, I'd like not to recommend this way. The driver has many threads and the flakiness is increasing when the driver CPU are capped.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR. Although I understand your intention, I'm a little negative for this approach for now.

@Yikun
Copy link
Member Author

Yikun commented Mar 13, 2022

 - Run SparkRemoteFileTest using a remote data file *** FAILED ***
[info] Run completed in 19 minutes, 18 seconds.
[info] Total number of tests run: 23
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 22, failed 1, canceled 0, ignored 0, pending 0
[info] *** 1 TEST FAILED ***

https://github.com/Yikun/spark/runs/5525624508?check_suite_focus=true

With driver 0.5 CPU and executors 0.2 CPU only SparkRemoteFileTest failed.

Kubernetes is using 0.85 CPU 0.25G memeory, so only 1.15 CPU left, we can give 0.5 for driver, 0.2 for others executor (3 is max limited).

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 13, 2022

@Yikun . Fine tuning is just to aim to launch the tests, not for guaranteeing the stability. The minimum requirement of CI is stability. As you see, we are already hitting many flakiness issues in GitHub Action due to the memory issues and the committers are manually mitigating them by re-triggering . Unless we can get a bigger VM machine, this addition will add another complexity and instability to Apache Spark GitHub Action.

Kubernetes is using 0.85 CPU 0.25G memeory, so only 1.15 CPU left, we can give 0.5 for driver, 0.2 for others executor (3 is max limited).

@Yikun
Copy link
Member Author

Yikun commented Mar 13, 2022

@dongjoon-hyun Thanks for your comments, this is only an initial exploring and share to enable K8S github action IT, I also think it is very important to ensure stability if we want to get this merged.

@dongjoon-hyun
Copy link
Member

If ASF allows the minikube job, we may revisit this and your Volcano test resource reducing PR at Apache Spark 3.4 timeline with a longer monitoring period.

@Yikun
Copy link
Member Author

Yikun commented Mar 18, 2022

I think recent k8s ci breaks have proved that this CI is useful, it helps found 2 [1] [2] master bugs within 3 days, and also help some on regression test.

@dongjoon-hyun I also contact with ASF, minikube is allowed in Apache projects, see: INFRA-23000.

I also update the PR description to show demo

@Yikun Yikun closed this Mar 18, 2022
@Yikun Yikun reopened this Mar 18, 2022
@Yikun Yikun changed the title [WIP][K8S][INFRA] Enable K8S integration tests in Github Action [SPARK-38597][K8S][INFRA] Enable K8S integration tests in Github Action Mar 18, 2022
@Yikun Yikun marked this pull request as ready for review March 18, 2022 14:37
@Yikun
Copy link
Member Author

Yikun commented Mar 18, 2022

cc @dongjoon-hyun @HyukjinKwon @holdenk @martin-g

Would you mind take a look? I think now it's ready for review.

@Yikun
Copy link
Member Author

Yikun commented Apr 2, 2022

@dongjoon-hyun Sure, next 3 days is day off in China, but I will try to some time address this.

btw, other concerns are

  1. should we enable this in every pr ci?
  2. should we keep IT in a separate file (it's for easy to enable workflow dispatch, but if we enable the every pr ci I guess this might no need)

@dongjoon-hyun
Copy link
Member

Sure. This isn't urgent. Take your time.

We didn't decide this main PR about new CI. We will discuss new spinoff PRs first.

@Yikun
Copy link
Member Author

Yikun commented Apr 7, 2022

Split two PRs as independent ones.

  • Support spark.kubernetes.test.(driver|executor)RequestCores
  • Support spark.kubernetes.test.minioRequestCores Set minio cpu to 250m (0.25) in K8s IT

also enable minio test with 250m cpu, that means all kubernetes tests (execpt volcano batch suite) are running in current PR.

dongjoon-hyun pushed a commit that referenced this pull request Apr 7, 2022
…ver|executor)RequestCores`

### What changes were proposed in this pull request?
This patch adds support for `spark.kubernetes.test.(driver|executor)RequestCores`, this help devs set specific cores info for (driver|executor)RequestCores.

### Why are the changes needed?
In some cases (such as resource limited case), we want to set specifc `RequestCores`.
See also: #35830 (review)

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
IT passed

Closes #36087 from Yikun/SPARK-38802.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Apr 8, 2022
…s IT

### What changes were proposed in this pull request?
This PR aims to set minio request cpu to `250m` (0.25).
- This value also recommand in [link](https://docs.gitlab.com/charts/charts/minio/#installation-command-line-options).
- There are [no cpu request limitation](https://github.com/minio/minio/blob/a3e317773a2b90a433136e1ff2a8394bc5017c75/helm/minio/values.yaml#L251) on current minio.

### Why are the changes needed?
In some cases (such as resource limited case), we reduce request cpu of minio.
See also: #35830 (review)

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
IT passsed

Closes #36096 from Yikun/minioRequestCores.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@Yikun
Copy link
Member Author

Yikun commented Apr 16, 2022

Here is the complete spark on k8s test on Github Action and all test passed (also include volcano IT):
https://github.com/Yikun/spark/runs/6046699187?check_suite_focus=true

It depends two patches:

@dongjoon-hyun
Copy link
Member

Sorry for being late revisiting. Since we start to discuss Apache Spark 3.4 planning, shall we restart this for Apache Spark 3.4, @Yikun ?

@dongjoon-hyun
Copy link
Member

Also, cc @HyukjinKwon , too

@Yikun
Copy link
Member Author

Yikun commented Jul 21, 2022

@dongjoon-hyun Thanks for reminder! It's definate YES for me. I will revisit it soon to make it work with latest CI infra.

Given that we already have the docker image as part of the spark release, there is no doubt that it will not only help spark on k8s and also the docker image stable.

@Yikun
Copy link
Member Author

Yikun commented Jul 21, 2022

To avoid the redundant of historical information, I submitted a new PR: #37244

@Yikun Yikun changed the title [SPARK-38597][K8S][INFRA] Enable resource limited Spark on K8S integration tests in Github Action [POC][K8S][INFRA] Enable resource limited Spark on K8S integration tests in Github Action Jul 21, 2022
Yikun added a commit to Yikun/spark that referenced this pull request Sep 20, 2022
…ver|executor)RequestCores`

### What changes were proposed in this pull request?
This patch adds support for `spark.kubernetes.test.(driver|executor)RequestCores`, this help devs set specific cores info for (driver|executor)RequestCores.

### Why are the changes needed?
In some cases (such as resource limited case), we want to set specifc `RequestCores`.
See also: apache#35830 (review)

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
IT passed

Closes apache#36087 from Yikun/SPARK-38802.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Sep 20, 2022
…ver|executor)RequestCores`

### What changes were proposed in this pull request?
This patch adds support for `spark.kubernetes.test.(driver|executor)RequestCores`, this help devs set specific cores info for (driver|executor)RequestCores.

### Why are the changes needed?
In some cases (such as resource limited case), we want to set specifc `RequestCores`.
See also: #35830 (review)

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
IT passed

Closes #36087 from Yikun/SPARK-38802.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 8396382)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun pushed a commit that referenced this pull request Sep 20, 2022
…s IT

### What changes were proposed in this pull request?
This PR aims to set minio request cpu to `250m` (0.25).
- This value also recommand in [link](https://docs.gitlab.com/charts/charts/minio/#installation-command-line-options).
- There are [no cpu request limitation](https://github.com/minio/minio/blob/a3e317773a2b90a433136e1ff2a8394bc5017c75/helm/minio/values.yaml#L251) on current minio.

### Why are the changes needed?
In some cases (such as resource limited case), we reduce request cpu of minio.
See also: #35830 (review)

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
IT passsed

Closes #36096 from Yikun/minioRequestCores.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 5ea2b38)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@dongjoon-hyun dongjoon-hyun removed their assignment Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants