Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The task-topology plugin cannot work with the tasks with empty resource request #2941

Closed
loheagn opened this issue Jun 27, 2023 · 7 comments · Fixed by #2955
Closed

The task-topology plugin cannot work with the tasks with empty resource request #2941

loheagn opened this issue Jun 27, 2023 · 7 comments · Fixed by #2955
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@loheagn
Copy link

loheagn commented Jun 27, 2023

What happened:

Assume that we have a kubernetes cluster with two nodes and a simple job with topology annotations:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    volcano.sh/task-topology-anti-affinity: "nginx"
  name: example-job-1
spec:
  minAvailable: 5
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 2
      name: nginx
      template:
        spec:
          containers:
            - image: nginx
              name: nginx-main
          restartPolicy: OnFailure
    - replicas: 3
      name: mysql
      template:
        spec:
          containers:
            - env:
                - name: MYSQL_ROOT_PASSWORD
                  value: "123456"
              image: mysql
              name: mysql-main
          restartPolicy: OnFailure

And the scheduler config file:

actions: "enqueue, backfill"
tiers:
  - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: task-topology

We create two kind of tasks and we want the two nginx tasks are scheduled into different nodes using the task topology antiAffinity annotations.

But the two nginx tasks are scheduled into the same node in most time:

image

If we add resource requests to the job spec, all things work well as expected:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  annotations:
    volcano.sh/task-topology-anti-affinity: "nginx"
  name: example-job-1
spec:
  minAvailable: 5
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  tasks:
    - replicas: 2
      name: nginx
      template:
        spec:
          containers:
            - image: nginx
              name: nginx-main
              resources:
                requests:
                  cpu: "0.1"
          restartPolicy: OnFailure
    - replicas: 3
      name: mysql
      template:
        spec:
          containers:
            - env:
                - name: MYSQL_ROOT_PASSWORD
                  value: "123456"
              image: mysql
              name: mysql-main
              resources:
                requests:
                  cpu: "0.3"
          restartPolicy: OnFailure
image

What you expected to happen:

The task-topology plugin should work no matter whether the tasks have resource requests.

How to reproduce it (as minimally and precisely as possible):

As described above.

Anything else we need to know?:

After reading the code, I found that the allocate action will be skipped if the task has empty resource request:

// Skip BestEffort task in 'allocate' action.
if task.Resreq.IsEmpty() {
klog.V(4).Infof("Task <%v/%v> is BestEffort task, skip it.",
task.Namespace, task.Name)
continue

This will cause the TaskOrderFn and NodeOrderFn in topology.go be not executed, which may cause the tasks are scheduled into incorrect nodes.

Environment:

  • Volcano Version: v1.7.0
  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.2", GitCommit:"7f6f68fdabc4df88cfea2dcf9a19b2b830f1e647", GitTreeState:"clean", BuildDate:"2023-05-17T14:20:07Z", GoVersion:"go1.20.4", Compiler:"gc", Platform:"darwin/amd64"}
    Kustomize Version: v5.0.1
    Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:33:12Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: minikube using hyperkit driver on macOS(Intel)
  • Kernel (e.g. uname -a): Linux volcano-demo 5.10.57 #1 SMP Mon Apr 3 23:35:10 UTC 2023 x86_64 GNU/Linux
@loheagn loheagn added the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2023
@loheagn
Copy link
Author

loheagn commented Jun 27, 2023

Again, I'm not sure it's a bug or feature. If's designed as this, please let me know. Thanks :)

@lowang-bh
Copy link
Member

besteffort pod are scheduled in backfill actions. Currently, there is no priorize process in backfill.

@loheagn
Copy link
Author

loheagn commented Jun 27, 2023

So the task-topology plugin cannot work with besteffort tasks by now? And this feature will be completed in the future?

@lowang-bh
Copy link
Member

/assign

@lowang-bh
Copy link
Member

@loheagn please try with PR #2955. It need to run make images and update volcano-scheduler with this image.

@loheagn
Copy link
Author

loheagn commented Jul 8, 2023

Thanks. I'll have a try.

btw, can you help to trigger the ci #2940 ?

@stale
Copy link

stale bot commented Oct 15, 2023

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants