Bug: Can not re-apply a same job when the old pods are Terminating #2284

kongjibai · 2022-06-08T08:10:42Z

What happened:
I use kubectl apply -f hvd-job-torch-fairmot-2gpu.yaml applied a job with 1 master pod(cpu task) and 2 worker pods(gpu task), and kubectl get pod can output the pods. After the job run, the pods status is Running. Then I use kubectl delete -f hvd-job-torch-fairmot-2gpu.yaml delete the job, the pods status are Terminating. Before all the pod deleted comletely, I re-apply the same job as before, then the old pods will be deleted with a few moment, but the new pods can not started. Although I delete the job and re-apply the same job, the pods can not start.

The output of kubectl describe -f hvd-job-torch-fairmot-2gpu.yaml as fllow,

Name:         lm-horovod-job-2gpu
Namespace:    default
Labels:       volcano.sh/job-type=Horovod
Annotations:  <none>
API Version:  batch.volcano.sh/v1alpha1
Kind:         Job
Metadata:
  Creation Timestamp:  2022-06-08T03:14:55Z
  Generation:          1
  Managed Fields:
    API Version:  batch.volcano.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:labels:
          .:
          f:volcano.sh/job-type:
      f:spec:
        .:
        f:minAvailable:
        f:plugins:
          .:
          f:ssh:
          f:svc:
        f:policies:
        f:schedulerName:
        f:tasks:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-06-08T03:14:55Z
    API Version:  batch.volcano.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:controlledResources:
          .:
          f:plugin-ssh:
          f:plugin-svc:
        f:minAvailable:
        f:retryCount:
        f:state:
          .:
          f:lastTransitionTime:
          f:phase:
        f:taskStatusCount:
          .:
          f:master:
            .:
            f:phase:
              .:
              f:Unknown:
          f:worker:
            .:
            f:phase:
              .:
              f:Unknown:
        f:unknown:
        f:version:
    Manager:         vc-controller-manager
    Operation:       Update
    Time:            2022-06-08T03:15:19Z
  Resource Version:  6177786
  UID:               710a642c-9e98-4d8b-a9c3-55ac2740bc18
Spec:
  Max Retry:      3
  Min Available:  2
  Plugins:
    Ssh:
    Svc:
  Policies:
    Action:        RestartJob
    Event:         PodEvicted
  Queue:           default
  Scheduler Name:  volcano
  Tasks:
    Depends On:
      Name:
        worker
    Max Retry:      3
    Min Available:  1
    Name:           master
    Policies:
      Action:  CompleteJob
      Event:   TaskCompleted
    Replicas:  1
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            /bin/sh
            -c
            WORKER_HOST='cat /home/etc-volcano/volcano/worker-2gpu.host | tr "\n" ","';
			mkdir -p /var/run/sshd; /usr/sbin/sshd;
			cd /home/FairMOT-master/src;
			dir;
			mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed direct -np 2 python train.py mot --exp_id crowdhuman_dla34_vc_hvd_2gpu --cuda True --batch_size 16 --load_model '../models/ctdet_coco_dla_2x.pth' --num_epochs 60 --lr_step '50' --data_cfg '../src/lib/cfg/crowdhuman.json';
          Image:  vc-hvd-fairmot:v1.7
          Name:   master
          Ports:
            Container Port:  22
            Name:            job-port
            Protocol:        TCP
          Resources:
            Limits:
              Cpu:     500m
              Memory:  1Gi
            Requests:
              Cpu:     500m
              Memory:  1Gi
          Volume Mounts:
            Mount Path:  /home
            Name:        vc-hvd-home
            Mount Path:  /dev/shm
            Name:        dshm
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
        Volumes:
          Name:  vc-hvd-home
          Persistent Volume Claim:
            Claim Name:  nfs-pvc02
          Empty Dir:
            Medium:      Memory
            Size Limit:  5Gi
          Name:          dshm
    Max Retry:           3
    Min Available:       2
    Name:                worker
    Replicas:            2
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            /bin/sh
            -c
            mkdir -p /var/run/sshd; /usr/sbin/sshd -D;

          Image:  vc-hvd-fairmot:v1.7
          Name:   worker
          Ports:
            Container Port:  22
            Name:            job-port
            Protocol:        TCP
          Resources:
            Limits:
              nvidia.com/gpu:  1
          Volume Mounts:
            Mount Path:  /home
            Name:        vc-hvd-home
            Mount Path:  /dev/shm
            Name:        dshm
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
        Volumes:
          Name:  vc-hvd-home
          Persistent Volume Claim:
            Claim Name:  nfs-pvc02
          Empty Dir:
            Medium:      Memory
            Size Limit:  5Gi
          Name:          dshm
Status:
  Conditions:
    Last Transition Time:  2022-06-08T03:14:57Z
    Status:                Pending
    Last Transition Time:  2022-06-08T03:15:20Z
    Status:                Restarting
    Last Transition Time:  2022-06-08T03:15:20Z
    Status:                Pending
  Controlled Resources:
    Plugin - Ssh:  ssh
    Plugin - Svc:  svc
  Min Available:   2
  Retry Count:     1
  State:
    Last Transition Time:  2022-06-08T03:15:20Z
    Phase:                 Pending
  Task Status Count:
    Master:
      Phase:
        Unknown:  1
    Worker:
      Phase:
        Unknown:  2
  Unknown:        3
  Version:        2
Events:
  Type     Reason           Age                   From                   Message
  ----     ------           ----                  ----                   -------
  Normal   ExecuteAction    2m4s                  vc-controller-manager  Start to execute action RestartJob
  Warning  PodGroupPending  2m2s (x4 over 2m26s)  vc-controller-manager  PodGroup default:lm-horovod-job-2gpu unschedule,reason: 2/3 tasks in gang unschedulable: pod group is not ready, 2 minAvailable, 3 Releasing

What you expected to happen:
Can apply the same job and kubectl get pods normally output pods status when the old pods status are Terminating.

How to reproduce it (as minimally and precisely as possible):

apply the job by kubectl apply -f hvd-job-torch-fairmot-2gpu.yaml, then the job can run normally, and pods status are Running.
delete the job by kubectl delete -f hvd-job-torch-fairmot-2gpu.yaml, then the pods status are Terminating.
before all the pods are deleted, re-apply the same job as before.
then the old pods will deleted with a few moment, but the new pods can not started.

Anything else we need to know?:

This Bug may not reproduce one time.
Notice the reproduce step 3, before all the pods are deleted.

Environment:

Volcano Version: 1.5.1
Kubernetes version (use kubectl version): 1.21.3
Cloud provider or hardware configuration: server machines with 4 Nvidia V100
OS (e.g. from /etc/os-release): CentOS Linux release 7.6.1810 (Core)
Kernel (e.g. uname -a): 3.10.0-957.5.1.el7.x86_64
Install tools: kubectl apply -f volcano-development.yaml
Others:

The text was updated successfully, but these errors were encountered:

hwdef · 2022-06-08T08:23:27Z

Yes, this is a bug about dependsOn.
There are two temporary workarounds:

Do not use vcjob with the same name
restart the volcano-controller.

hwdef · 2022-06-08T08:26:11Z

/assign @hwdef

hwdef · 2022-06-08T08:26:55Z

same issue #2130

kongjibai · 2022-06-09T07:26:05Z

Yes, this is a bug about dependsOn. There are two temporary workarounds:

Do not use vcjob with the same name

restart the volcano-controller.

Oh, thx. Can you tell me how to restart the volcano-controller?

hwdef · 2022-06-09T07:54:47Z

kubectl delete po -n volcano-system -l app=volcano-controller

hwdef · 2022-07-21T02:29:52Z

/close

volcano-sh-bot · 2022-07-21T02:29:56Z

@hwdef: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kongjibai added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2022

volcano-sh-bot assigned hwdef Jun 8, 2022

This was referenced Jul 8, 2022

Delete a job with dependencies between task will cause the controller to report an error repeatedly #2345

Closed

fix bug in task dependsOn #2351

Merged

volcano-sh-bot closed this as completed Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Can not re-apply a same job when the old pods are Terminating #2284

Bug: Can not re-apply a same job when the old pods are Terminating #2284

kongjibai commented Jun 8, 2022

hwdef commented Jun 8, 2022

hwdef commented Jun 8, 2022

hwdef commented Jun 8, 2022

kongjibai commented Jun 9, 2022

hwdef commented Jun 9, 2022

hwdef commented Jul 21, 2022

volcano-sh-bot commented Jul 21, 2022

Bug: Can not re-apply a same job when the old pods are Terminating #2284

Bug: Can not re-apply a same job when the old pods are Terminating #2284

Comments

kongjibai commented Jun 8, 2022

hwdef commented Jun 8, 2022

hwdef commented Jun 8, 2022

hwdef commented Jun 8, 2022

kongjibai commented Jun 9, 2022

hwdef commented Jun 9, 2022

hwdef commented Jul 21, 2022

volcano-sh-bot commented Jul 21, 2022