when pod CrashLoopBackOff,the job status should be failed #436

davidstack · 2019-09-05T05:50:34Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
i test create tf-sample ,when i got this status

tensorflow-benchmark-ps-0       0/1     CrashLoopBackOff   5          3m46s
tensorflow-benchmark-worker-0   1/1     Running            0          3m46s
tensorflow-benchmark-worker-1   1/1     Running            0          3m46s

but the job status is running

[root@node1` tf-sample]# kubectl describe jobs.batch.volcano.sh tensorflow-benchmark
Name:         tensorflow-benchmark
Namespace:    default
Labels:       volcano.sh/job-type=Tensorflow
Annotations:  <none>
API Version:  batch.volcano.sh/v1alpha1
Kind:         Job
Metadata:
  Creation Timestamp:  2019-09-05T03:24:41Z
  Generation:          1
  Resource Version:    17317553
  Self Link:           /apis/batch.volcano.sh/v1alpha1/namespaces/default/jobs/tensorflow-benchmark
  UID:                 b3189fdd-cf8c-11e9-84e3-6c92bf8b7a92
Spec:
  Min Available:  3
  Plugins:
    Env:
    Svc:
  Policies:
    Action:        RestartJob
    Event:         PodEvicted
  Queue:           default
  Scheduler Name:  volcano
  Tasks:
    Name:      ps
    Replicas:  1
    Template:
      Spec:
        Containers:
          Command:
            sh
            -c
            PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`;
python tf_cnn_benchmarks1.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=ps --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST}

          Image:  volcanosh/example-tf:0.0.1
          Name:   tensorflow
          Ports:
            Container Port:  2222
            Name:            tfjob-port
          Resources:
            Limits:
              Cpu:     1000m
              Memory:  2048Mi
            Requests:
              Cpu:      1000m
              Memory:   2048Mi
          Working Dir:  /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
    Name:                worker
    Policies:
      Action:  CompleteJob
      Event:   TaskCompleted
    Replicas:  2
    Template:
      Spec:
        Containers:
          Command:
            sh
            -c
            PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`;
python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=worker --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST}

          Image:  volcanosh/example-tf:0.0.1
          Name:   tensorflow
          Ports:
            Container Port:  2222
            Name:            tfjob-port
          Resources:
            Limits:
              Cpu:     2000m
              Memory:  4096Mi
            Requests:
              Cpu:      2000m
              Memory:   2048Mi
          Working Dir:  /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
Status:
  Controlled Resources:
    Plugin - Env:  env
    Plugin - Svc:  svc
  Min Available:   3
  Running:         3
  State:
    Last Transition Time:  2019-09-05T03:24:44Z
    Phase:                 Running
Events:                    <none>

What you expected to happen:

when one task failed, the job should in failed status
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Volcano Version:
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

hzxuzhonghu · 2019-09-05T07:16:45Z

Fair point, In this case we should have set a fixed number of retry, and mark it fail to prevent useless running.

hzxuzhonghu · 2019-09-05T07:17:30Z

@davidstack Can you try with #412 ?

davidstack · 2019-09-05T08:50:47Z

@hzxuzhonghu after set the maxRetry, it has the same problem.

 apiVersion: batch.volcano.sh/v1alpha1
 kind: Job
 metadata:
   name: tensorflow-benchmark
   labels:
     "volcano.sh/job-type": "Tensorflow"
 spec:
   minAvailable: 3
   schedulerName: volcano
   maxRetry: 2
   plugins:
     env: []
     svc: []

k82cn · 2019-09-05T12:22:16Z

/assign @hzxuzhonghu

hzxuzhonghu · 2019-09-06T01:11:44Z

ok, let me investigate.

hzxuzhonghu · 2019-09-23T08:27:15Z

@davidstack It is because when crashloopback, the pod phase is still running. The RestartPolicy of pod is ALways by default.

hzxuzhonghu · 2019-09-23T08:29:53Z

To enhance this scenario, in volcano we should prevent the endless and meaningless retries.

stale · 2020-08-18T07:32:37Z

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale · 2020-10-17T08:17:12Z

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 5, 2019

volcano-sh-bot assigned hzxuzhonghu Sep 5, 2019

hzxuzhonghu mentioned this issue Sep 23, 2019

Mark job failed when its pod restart count exceed a fixed number #464

Closed

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2020

stale bot closed this as completed Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when pod CrashLoopBackOff,the job status should be failed #436

when pod CrashLoopBackOff,the job status should be failed #436

davidstack commented Sep 5, 2019 •

edited by hzxuzhonghu

Loading

hzxuzhonghu commented Sep 5, 2019

hzxuzhonghu commented Sep 5, 2019

davidstack commented Sep 5, 2019 •

edited by k82cn

Loading

k82cn commented Sep 5, 2019

hzxuzhonghu commented Sep 6, 2019

hzxuzhonghu commented Sep 23, 2019

hzxuzhonghu commented Sep 23, 2019

stale bot commented Aug 18, 2020

stale bot commented Oct 17, 2020

when pod CrashLoopBackOff,the job status should be failed #436

when pod CrashLoopBackOff,the job status should be failed #436

Comments

davidstack commented Sep 5, 2019 • edited by hzxuzhonghu Loading

hzxuzhonghu commented Sep 5, 2019

hzxuzhonghu commented Sep 5, 2019

davidstack commented Sep 5, 2019 • edited by k82cn Loading

k82cn commented Sep 5, 2019

hzxuzhonghu commented Sep 6, 2019

hzxuzhonghu commented Sep 23, 2019

hzxuzhonghu commented Sep 23, 2019

stale bot commented Aug 18, 2020

stale bot commented Oct 17, 2020

davidstack commented Sep 5, 2019 •

edited by hzxuzhonghu

Loading

davidstack commented Sep 5, 2019 •

edited by k82cn

Loading