Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reclaim Enhancement: The logic of judging whether a pod can be evicted and judging whether a queue is overused in proportion is inconsistent, which may lead to ping-pong during reclaim action #561

Closed
sivanzcw opened this issue Nov 26, 2019 · 1 comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@sivanzcw
Copy link
Contributor

/kind feature

  • cluster status
serial node name resource
1 node1 4c8g
2 node2 4c8g
  • queue status
serial queue name weight
1 default 1
2 queue1 100000
  • enabel reclaim action

  • create job1 with 7pods, each pod with 1c1g resource request

root@c-rlnrdybm-muamumvq-2:~/reclaim/case-pipeline# cat default.yaml 
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mxnet-job-default
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: zjh-higher
  queue: default
  plugins:
    svc: []
  tasks:
  - replicas: 1
    name: worker
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          args:
          - --kv-store=dist_sync
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "worker"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  - replicas: 2
    name: server
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "server"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  - replicas: 4
    name: scheduler
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "scheduler"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  • waiting pods of job1 running
root@c-rlnrdybm-muamumvq-2:~/reclaim/case-pipeline# kubectl create -f default.yaml 
job.batch.volcano.sh/mxnet-job-default created
root@c-rlnrdybm-muamumvq-2:~/reclaim/case-pipeline# kubectl get pod 
NAME                            READY   STATUS      RESTARTS   AGE
mxnet-job-default-scheduler-0   1/1     Running     0          28s
mxnet-job-default-scheduler-1   1/1     Running     0          28s
mxnet-job-default-scheduler-2   1/1     Running     0          28s
mxnet-job-default-scheduler-3   1/1     Running     0          28s
mxnet-job-default-server-0      1/1     Running     0          28s
mxnet-job-default-server-1      1/1     Running     0          28s
mxnet-job-default-worker-0      1/1     Running     0          28s
  • create job2 with 7pods, each pod with 1c1g resource request
root@c-rlnrdybm-muamumvq-2:~/reclaim/case-pipeline# cat queue1.yaml 
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mxnet-job-queue1
spec:
  minAvailable: 1
  schedulerName: volcano
  priorityClassName: zjh-higher
  queue: queue1
  policies:
  - event: PodEvicted
    action: RestartJob
  - event: PodFailed
    action: RestartJob
  plugins:
    svc: []
  tasks:
  - replicas: 1
    name: worker
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          args:
          - --kv-store=dist_sync
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "worker"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  - replicas: 2
    name: server
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "server"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure
  - replicas: 4
    name: scheduler
    template:
      spec:
        imagePullSecrets:
        - name: default-secret
        containers:
        - image: volcanosh/mxnet-train-mnist-cpu:v1
          imagePullPolicy: IfNotPresent
          name: mxnet
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
            requests:
              cpu: "1"
              memory: "1Gi"
          env:
          - name: DMLC_PS_ROOT_PORT
            value: "9000"
          - name: DMLC_PS_ROOT_URI
            value: mxnet-job-scheduler-0.mxnet-job
          - name: DMLC_NUM_SERVER
            value: "2"
          - name: DMLC_NUM_WORKER
            value: "2"
          - name: DMLC_ROLE
            value: "scheduler"
          - name: DMLC_USE_KUBERNETES
            value: "1"
        restartPolicy: OnFailure

root@c-rlnrdybm-muamumvq-2:~/reclaim/case-pipeline# kubectl create -f queue1.yaml 
job.batch.volcano.sh/mxnet-job-queue1 created
  • expected that six pods of job1 were evicted, actually, all pods of job1 were evicted

  • in the next loop, the queue of job1 was not overused, and pods of job1 were traversed to dispatch. Eventually, one of the pods was scheduled successfully, however, the pod will be evicted during reclaim action

  • pods of job1 were evicted

root@c-rlnrdybm-muamumvq-2:~/reclaim/case-pipeline# kubectl get pod 
NAME                            READY   STATUS        RESTARTS   AGE
mxnet-job-default-scheduler-0   1/1     Terminating   0          57s
mxnet-job-default-scheduler-1   1/1     Terminating   0          57s
mxnet-job-default-scheduler-2   1/1     Terminating   0          57s
mxnet-job-default-scheduler-3   1/1     Terminating   0          57s
mxnet-job-default-server-0      1/1     Terminating   0          57s
mxnet-job-default-server-1      1/1     Terminating   0          57s
mxnet-job-default-worker-0      1/1     Terminating   0          57s
mxnet-job-queue1-scheduler-0    0/1     Pending       0          13s
mxnet-job-queue1-scheduler-1    0/1     Pending       0          13s
mxnet-job-queue1-scheduler-2    0/1     Pending       0          13s
mxnet-job-queue1-scheduler-3    0/1     Pending       0          13s
mxnet-job-queue1-server-0       0/1     Pending       0          13s
mxnet-job-queue1-server-1       0/1     Pending       0          13s
mxnet-job-queue1-worker-0       0/1     Pending       0          13s
  • ping-pong between allocate and reclaim
root@c-rlnrdybm-muamumvq-2:~/reclaim/case-pipeline# kubectl get pod 
NAME                            READY   STATUS              RESTARTS   AGE
cube-0                          1/1     Running             0          20d
cube-1                          1/1     Running             0          20d
cube-transfer                   0/1     Pending             0          20d
file-server                     1/1     Running             1          41d
mxnet-job-default-scheduler-1   0/1     Pending             0          12s
mxnet-job-default-scheduler-2   0/1     ContainerCreating   0          12s
mxnet-job-default-server-0      0/1     Terminating         0          79s
mxnet-job-default-server-1      0/1     Terminating         0          79s
mxnet-job-queue1-scheduler-0    0/1     Pending             0          49s
mxnet-job-queue1-scheduler-1    0/1     ContainerCreating   0          49s
mxnet-job-queue1-scheduler-2    0/1     ContainerCreating   0          49s
mxnet-job-queue1-scheduler-3    1/1     Running             0          49s
mxnet-job-queue1-server-0       0/1     Pending             0          49s
mxnet-job-queue1-server-1       1/1     Running             0          49s
mxnet-job-queue1-worker-0       0/1     Pending             0          49s
pi-f6nq4                        0/1     Completed           0          21d
root@c-rlnrdybm-muamumvq-2:~/reclaim/case-pipeline# kubectl get pod 
NAME                            READY   STATUS        RESTARTS   AGE
cube-0                          1/1     Running       0          20d
cube-1                          1/1     Running       0          20d
cube-transfer                   0/1     Pending       0          20d
file-server                     1/1     Running       1          41d
mxnet-job-default-scheduler-0   0/1     Pending       0          19s
mxnet-job-default-scheduler-1   0/1     Pending       0          19s
mxnet-job-default-scheduler-2   1/1     Terminating   0          19s
mxnet-job-default-server-0      0/1     Pending       0          19s
mxnet-job-queue1-scheduler-0    1/1     Running       0          56s
mxnet-job-queue1-scheduler-1    1/1     Running       0          56s
mxnet-job-queue1-scheduler-2    1/1     Running       0          56s
mxnet-job-queue1-scheduler-3    1/1     Running       0          56s
mxnet-job-queue1-server-0       1/1     Running       0          56s
mxnet-job-queue1-server-1       1/1     Running       0          56s
mxnet-job-queue1-worker-0       0/1     Pending       0          56s
pi-f6nq4                        0/1     Completed     0          21d
  • Finally, pods of job1 were all pending, and pods of job2 were all Running
root@c-rlnrdybm-muamumvq-2:~# kubectl get pod 
NAME                            READY   STATUS      RESTARTS   AGE
mxnet-job-default-scheduler-0   0/1     Pending     0          86m
mxnet-job-default-scheduler-1   0/1     Pending     0          87m
mxnet-job-default-scheduler-2   0/1     Pending     0          86m
mxnet-job-default-scheduler-3   0/1     Pending     0          86m
mxnet-job-default-server-0      0/1     Pending     0          86m
mxnet-job-default-server-1      0/1     Pending     0          86m
mxnet-job-default-worker-0      0/1     Pending     0          87m
mxnet-job-queue1-scheduler-0    1/1     Running     0          87m
mxnet-job-queue1-scheduler-1    1/1     Running     0          87m
mxnet-job-queue1-scheduler-2    1/1     Running     0          87m
mxnet-job-queue1-scheduler-3    1/1     Running     0          87m
mxnet-job-queue1-server-0       1/1     Running     0          87m
mxnet-job-queue1-server-1       1/1     Running     0          87m
mxnet-job-queue1-worker-0       1/1     Running     0          87m
  • During judging whether a pod can be evicted in proportion plugin, LessEqual function was used to compare the sizes of deserved resource and allocated resources. Since the weight of default queue is too small that the deserve resource of default queue is very little, about <cpu: 0.08, memory 159045.22>. When there are one pod left in default queue, the allocated resources of queue is <cpu: 1000, memory: 1073741824>, if the pod was evicted, the allocated resources of default queue will be <cpu: 0, memory: 0>. <cpu: 0.08, memory: 159045>.LessEqual<cpu: 0, memory: 0> will return true, the pod was added to victims to be evicted.
	ssn.AddReclaimableFn(pp.Name(), func(reclaimer *api.TaskInfo, reclaimees []*api.TaskInfo) []*api.TaskInfo {
		var victims []*api.TaskInfo
		allocations := map[api.QueueID]*api.Resource{}

		for _, reclaimee := range reclaimees {
			job := ssn.Jobs[reclaimee.Job]
			attr := pp.queueOpts[job.Queue]

			if _, found := allocations[job.Queue]; !found {
				allocations[job.Queue] = attr.allocated.Clone()
			}
			allocated := allocations[job.Queue]
			if allocated.Less(reclaimee.Resreq) {
				glog.V(3).Infof("Failed to allocate resource for Task <%s/%s> in Queue <%s>, not enough resource.",
					reclaimee.Namespace, reclaimee.Name, job.Queue)
				continue
			}

			allocated.Sub(reclaimee.Resreq)
			if attr.deserved.LessEqual(allocated) {
				victims = append(victims, reclaimee)
			}
		}

		return victims
	})
// LessEqual checks whether a resource is less than other resource
func (r *Resource) LessEqual(rr *Resource) bool {
	lessEqualFunc := func(l, r, diff float64) bool {
		if l < r || math.Abs(l-r) < diff {
			return true
		}
		return false
	}

	if !lessEqualFunc(r.MilliCPU, rr.MilliCPU, minMilliCPU) {
		return false
	}
	if !lessEqualFunc(r.Memory, rr.Memory, minMemory) {
		return false
	}

	if r.ScalarResources == nil {
		return true
	}

	for rName, rQuant := range r.ScalarResources {
		if rQuant <= minMilliScalarResources {
			continue
		}
		if rr.ScalarResources == nil {
			return false
		}

		rrQuant := rr.ScalarResources[rName]
		if !lessEqualFunc(rQuant, rrQuant, minMilliScalarResources) {
			return false
		}
	}

	return true
}
  • During allocate action, when judging whether a queue is overused, the loggic is as below
overused := !attr.allocated.LessEqual(attr.deserved)

since allocated resource of default queue is <cpu: 0, memory: 0>, deserved resouces of default queue is <cpu: 0.08, memory: 159045>, <cpu: 0, memory: 0>.LessEqual( <cpu: 0.08, memory: 159045>) is true, overused was false, pods under the default queue were traversed to dispatch.

@volcano-sh-bot volcano-sh-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 26, 2019
@sivanzcw sivanzcw changed the title Reclaim Enhancement: The logic of juding whether a pod can be evicted and juding whether a queue is overused in proportion is inconsistent, which may lead to ping-pong during reclaim action Reclaim Enhancement: The logic of judging whether a pod can be evicted and judging whether a queue is overused in proportion is inconsistent, which may lead to ping-pong during reclaim action Nov 26, 2019
@k82cn
Copy link
Member

k82cn commented Dec 19, 2019

fixed by #599

@k82cn k82cn closed this as completed Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants