Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Yunikorn taskgroups memory minResources is not including spark.executor.pyspark.memory #2176

Closed
1 task done
tcassaert opened this issue Sep 17, 2024 · 2 comments · Fixed by #2178
Closed
1 task done

Comments

@tcassaert
Copy link
Contributor

Description

When submitting an application with in sparkConf a value for spark.executor.pyspark.memory, this memory is added to the request of the executor, but it's not included in the yunikorn.apache.org/task-groups annotation. This makes the executor being stuck in Pending and Yunikorn reporting that the request for the pod is bigger than what the placeholder reserved.

  • ✋ I have searched the open/closed issues and my issue is not listed.

Reproduction Code [Required]

Steps to reproduce the behavior:

Submit following SparkApplication:

---
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-pi-python
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: spark:3.5.2
  imagePullPolicy: IfNotPresent
  mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
  sparkVersion: 3.5.2
  sparkConf:
    spark.executor.pyspark.memory: "2000"
  driver:
    annotations:
      yunikorn.apache.org/schedulingPolicyParameters: 'gangSchedulingStyle=Hard'
    cores: 1
    memory: 4096m
    memoryOverhead: 1200m
    serviceAccount: spark-operator-spark
  executor:
    instances: 2
    cores: 1
    memory: 4096m
    memoryOverhead: 1200m
  batchScheduler: yunikorn
  batchSchedulerOptions:
    queue: root.default

Expected behavior

The task-group for the executors should have included the spark.executor.pyspark.memory in the minResources.memory.

Actual behavior

The task-group looks like this:

yunikorn.apache.org/task-groups: '[{"name":"spark-driver","minMember":1,"minResource":{"cpu":"1","memory":"5296Mi"}},{"name":"spark-executor","minMember":2,"minResource":{"cpu":"1","memory":"5296Mi"}}]'

Where minResource.memory is a sum of memory and memoryOverhead.
The request of the executor pods is however:

    resources:
      limits:
        memory: 7296Mi
      requests:
        cpu: "1"
        memory: 7296Mi

Where the resources.requests.memory is a sum of memory, memoryOverhead and spark.executor.pyspark.memory.

This results in pods that never get scheduled.

Terminal Output Screenshot(s)

Pod events of an unschedulable executor:

Events:
  Type    Reason          Age    From      Message
  ----    ------          ----   ----      -------
  Normal  Scheduling      7m22s  yunikorn  default/pythonpi-ce8d8891ff1c8bd7-exec-1 is queued and waiting for allocation
  Normal  GangScheduling  7m22s  yunikorn  Pod belongs to the taskGroup spark-executor, it will be scheduled as a gang member

Example log in Yunikorn:

2024-09-17T08:32:29.119Z        WARN    core.scheduler.application      objects/application.go:1130     releasing placeholder: real allocation is larger than placeholder     {"requested resource": "map[memory:7650410496 pods:1 vcore:1000]", "placeholderID": "b978035d-e27c-4e2e-b3bf-4cd5f10b6fdb-0", "placeholder resource": "map[memory:5553258496 pods:1 vcore:1000]"}

Environment & Versions

Spark Operator App version: v2.0.0-rc.0
Helm Chart Version: v2.0.0-rc.0
Kubernetes Version: 1.25.7
Apache Spark version: 3.5.2

@ChenYi015
Copy link
Contributor

@tcassaert Thanks for reporting the bug.
@jacobsalway Could you take a look at this issue? I think we have missed this spark.executor.pyspark.memory conf when calculating the memory needed for yunikorn task group.

@jacobsalway
Copy link
Member

jacobsalway commented Sep 18, 2024

I can replicate the issue locally on Kind with the provided instructions, thanks. Looks like this file in apache/spark contains the logic you've flagged. I'll put together a fix for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants