Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Can not re-apply a same job when the old pods are Terminating #2284

Closed
kongjibai opened this issue Jun 8, 2022 · 7 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@kongjibai
Copy link

What happened:
I use kubectl apply -f hvd-job-torch-fairmot-2gpu.yaml applied a job with 1 master pod(cpu task) and 2 worker pods(gpu task), and kubectl get pod can output the pods. After the job run, the pods status is Running. Then I use kubectl delete -f hvd-job-torch-fairmot-2gpu.yaml delete the job, the pods status are Terminating. Before all the pod deleted comletely, I re-apply the same job as before, then the old pods will be deleted with a few moment, but the new pods can not started. Although I delete the job and re-apply the same job, the pods can not start.

The output of kubectl describe -f hvd-job-torch-fairmot-2gpu.yaml as fllow,

Name:         lm-horovod-job-2gpu
Namespace:    default
Labels:       volcano.sh/job-type=Horovod
Annotations:  <none>
API Version:  batch.volcano.sh/v1alpha1
Kind:         Job
Metadata:
  Creation Timestamp:  2022-06-08T03:14:55Z
  Generation:          1
  Managed Fields:
    API Version:  batch.volcano.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:labels:
          .:
          f:volcano.sh/job-type:
      f:spec:
        .:
        f:minAvailable:
        f:plugins:
          .:
          f:ssh:
          f:svc:
        f:policies:
        f:schedulerName:
        f:tasks:
    Manager:      kubectl-client-side-apply
    Operation:    Update
    Time:         2022-06-08T03:14:55Z
    API Version:  batch.volcano.sh/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:controlledResources:
          .:
          f:plugin-ssh:
          f:plugin-svc:
        f:minAvailable:
        f:retryCount:
        f:state:
          .:
          f:lastTransitionTime:
          f:phase:
        f:taskStatusCount:
          .:
          f:master:
            .:
            f:phase:
              .:
              f:Unknown:
          f:worker:
            .:
            f:phase:
              .:
              f:Unknown:
        f:unknown:
        f:version:
    Manager:         vc-controller-manager
    Operation:       Update
    Time:            2022-06-08T03:15:19Z
  Resource Version:  6177786
  UID:               710a642c-9e98-4d8b-a9c3-55ac2740bc18
Spec:
  Max Retry:      3
  Min Available:  2
  Plugins:
    Ssh:
    Svc:
  Policies:
    Action:        RestartJob
    Event:         PodEvicted
  Queue:           default
  Scheduler Name:  volcano
  Tasks:
    Depends On:
      Name:
        worker
    Max Retry:      3
    Min Available:  1
    Name:           master
    Policies:
      Action:  CompleteJob
      Event:   TaskCompleted
    Replicas:  1
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            /bin/sh
            -c
            WORKER_HOST='cat /home/etc-volcano/volcano/worker-2gpu.host | tr "\n" ","';
			mkdir -p /var/run/sshd; /usr/sbin/sshd;
			cd /home/FairMOT-master/src;
			dir;
			mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed direct -np 2 python train.py mot --exp_id crowdhuman_dla34_vc_hvd_2gpu --cuda True --batch_size 16 --load_model '../models/ctdet_coco_dla_2x.pth' --num_epochs 60 --lr_step '50' --data_cfg '../src/lib/cfg/crowdhuman.json';
          Image:  vc-hvd-fairmot:v1.7
          Name:   master
          Ports:
            Container Port:  22
            Name:            job-port
            Protocol:        TCP
          Resources:
            Limits:
              Cpu:     500m
              Memory:  1Gi
            Requests:
              Cpu:     500m
              Memory:  1Gi
          Volume Mounts:
            Mount Path:  /home
            Name:        vc-hvd-home
            Mount Path:  /dev/shm
            Name:        dshm
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
        Volumes:
          Name:  vc-hvd-home
          Persistent Volume Claim:
            Claim Name:  nfs-pvc02
          Empty Dir:
            Medium:      Memory
            Size Limit:  5Gi
          Name:          dshm
    Max Retry:           3
    Min Available:       2
    Name:                worker
    Replicas:            2
    Template:
      Metadata:
      Spec:
        Containers:
          Command:
            /bin/sh
            -c
            mkdir -p /var/run/sshd; /usr/sbin/sshd -D;

          Image:  vc-hvd-fairmot:v1.7
          Name:   worker
          Ports:
            Container Port:  22
            Name:            job-port
            Protocol:        TCP
          Resources:
            Limits:
              nvidia.com/gpu:  1
          Volume Mounts:
            Mount Path:  /home
            Name:        vc-hvd-home
            Mount Path:  /dev/shm
            Name:        dshm
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
        Volumes:
          Name:  vc-hvd-home
          Persistent Volume Claim:
            Claim Name:  nfs-pvc02
          Empty Dir:
            Medium:      Memory
            Size Limit:  5Gi
          Name:          dshm
Status:
  Conditions:
    Last Transition Time:  2022-06-08T03:14:57Z
    Status:                Pending
    Last Transition Time:  2022-06-08T03:15:20Z
    Status:                Restarting
    Last Transition Time:  2022-06-08T03:15:20Z
    Status:                Pending
  Controlled Resources:
    Plugin - Ssh:  ssh
    Plugin - Svc:  svc
  Min Available:   2
  Retry Count:     1
  State:
    Last Transition Time:  2022-06-08T03:15:20Z
    Phase:                 Pending
  Task Status Count:
    Master:
      Phase:
        Unknown:  1
    Worker:
      Phase:
        Unknown:  2
  Unknown:        3
  Version:        2
Events:
  Type     Reason           Age                   From                   Message
  ----     ------           ----                  ----                   -------
  Normal   ExecuteAction    2m4s                  vc-controller-manager  Start to execute action RestartJob
  Warning  PodGroupPending  2m2s (x4 over 2m26s)  vc-controller-manager  PodGroup default:lm-horovod-job-2gpu unschedule,reason: 2/3 tasks in gang unschedulable: pod group is not ready, 2 minAvailable, 3 Releasing

What you expected to happen:
Can apply the same job and kubectl get pods normally output pods status when the old pods status are Terminating.

How to reproduce it (as minimally and precisely as possible):

  1. apply the job by kubectl apply -f hvd-job-torch-fairmot-2gpu.yaml, then the job can run normally, and pods status are Running.
  2. delete the job by kubectl delete -f hvd-job-torch-fairmot-2gpu.yaml, then the pods status are Terminating.
  3. before all the pods are deleted, re-apply the same job as before.
  4. then the old pods will deleted with a few moment, but the new pods can not started.

Anything else we need to know?:

  1. This Bug may not reproduce one time.
  2. Notice the reproduce step 3, before all the pods are deleted.

Environment:

  • Volcano Version: 1.5.1
  • Kubernetes version (use kubectl version): 1.21.3
  • Cloud provider or hardware configuration: server machines with 4 Nvidia V100
  • OS (e.g. from /etc/os-release): CentOS Linux release 7.6.1810 (Core)
  • Kernel (e.g. uname -a): 3.10.0-957.5.1.el7.x86_64
  • Install tools: kubectl apply -f volcano-development.yaml
  • Others:
@kongjibai kongjibai added the kind/bug Categorizes issue or PR as related to a bug. label Jun 8, 2022
@hwdef
Copy link
Member

hwdef commented Jun 8, 2022

Yes, this is a bug about dependsOn.
There are two temporary workarounds:

  1. Do not use vcjob with the same name
  2. restart the volcano-controller.

@hwdef
Copy link
Member

hwdef commented Jun 8, 2022

/assign @hwdef

@hwdef
Copy link
Member

hwdef commented Jun 8, 2022

same issue #2130

@kongjibai
Copy link
Author

Yes, this is a bug about dependsOn. There are two temporary workarounds:

  1. Do not use vcjob with the same name
  2. restart the volcano-controller.

Oh, thx. Can you tell me how to restart the volcano-controller?

@hwdef
Copy link
Member

hwdef commented Jun 9, 2022

kubectl delete po -n volcano-system -l app=volcano-controller

@hwdef
Copy link
Member

hwdef commented Jul 21, 2022

/close

@volcano-sh-bot
Copy link
Contributor

@hwdef: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants