Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when pod CrashLoopBackOff,the job status should be failed #436

Closed
davidstack opened this issue Sep 5, 2019 · 9 comments
Closed

when pod CrashLoopBackOff,the job status should be failed #436

davidstack opened this issue Sep 5, 2019 · 9 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@davidstack
Copy link
Contributor

davidstack commented Sep 5, 2019

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

/kind feature

What happened:
i test create tf-sample ,when i got this status

tensorflow-benchmark-ps-0       0/1     CrashLoopBackOff   5          3m46s
tensorflow-benchmark-worker-0   1/1     Running            0          3m46s
tensorflow-benchmark-worker-1   1/1     Running            0          3m46s

but the job status is running

[root@node1` tf-sample]# kubectl describe jobs.batch.volcano.sh tensorflow-benchmark
Name:         tensorflow-benchmark
Namespace:    default
Labels:       volcano.sh/job-type=Tensorflow
Annotations:  <none>
API Version:  batch.volcano.sh/v1alpha1
Kind:         Job
Metadata:
  Creation Timestamp:  2019-09-05T03:24:41Z
  Generation:          1
  Resource Version:    17317553
  Self Link:           /apis/batch.volcano.sh/v1alpha1/namespaces/default/jobs/tensorflow-benchmark
  UID:                 b3189fdd-cf8c-11e9-84e3-6c92bf8b7a92
Spec:
  Min Available:  3
  Plugins:
    Env:
    Svc:
  Policies:
    Action:        RestartJob
    Event:         PodEvicted
  Queue:           default
  Scheduler Name:  volcano
  Tasks:
    Name:      ps
    Replicas:  1
    Template:
      Spec:
        Containers:
          Command:
            sh
            -c
            PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`;
python tf_cnn_benchmarks1.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=ps --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST}

          Image:  volcanosh/example-tf:0.0.1
          Name:   tensorflow
          Ports:
            Container Port:  2222
            Name:            tfjob-port
          Resources:
            Limits:
              Cpu:     1000m
              Memory:  2048Mi
            Requests:
              Cpu:      1000m
              Memory:   2048Mi
          Working Dir:  /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
    Name:                worker
    Policies:
      Action:  CompleteJob
      Event:   TaskCompleted
    Replicas:  2
    Template:
      Spec:
        Containers:
          Command:
            sh
            -c
            PS_HOST=`cat /etc/volcano/ps.host | sed 's/$/&:2222/g' | tr "\n" ","`;
WORKER_HOST=`cat /etc/volcano/worker.host | sed 's/$/&:2222/g' | tr "\n" ","`;
python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=1 --local_parameter_device=cpu --device=cpu --data_format=NHWC --job_name=worker --task_index=${VK_TASK_INDEX} --ps_hosts=${PS_HOST} --worker_hosts=${WORKER_HOST}

          Image:  volcanosh/example-tf:0.0.1
          Name:   tensorflow
          Ports:
            Container Port:  2222
            Name:            tfjob-port
          Resources:
            Limits:
              Cpu:     2000m
              Memory:  4096Mi
            Requests:
              Cpu:      2000m
              Memory:   2048Mi
          Working Dir:  /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
        Image Pull Secrets:
          Name:          default-secret
        Restart Policy:  OnFailure
Status:
  Controlled Resources:
    Plugin - Env:  env
    Plugin - Svc:  svc
  Min Available:   3
  Running:         3
  State:
    Last Transition Time:  2019-09-05T03:24:44Z
    Phase:                 Running
Events:                    <none>

What you expected to happen:

when one task failed, the job should in failed status
How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Volcano Version:
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 5, 2019
@hzxuzhonghu
Copy link
Collaborator

Fair point, In this case we should have set a fixed number of retry, and mark it fail to prevent useless running.

@hzxuzhonghu
Copy link
Collaborator

@davidstack Can you try with #412 ?

@davidstack
Copy link
Contributor Author

davidstack commented Sep 5, 2019

@hzxuzhonghu after set the maxRetry, it has the same problem.

 apiVersion: batch.volcano.sh/v1alpha1
 kind: Job
 metadata:
   name: tensorflow-benchmark
   labels:
     "volcano.sh/job-type": "Tensorflow"
 spec:
   minAvailable: 3
   schedulerName: volcano
   maxRetry: 2
   plugins:
     env: []
     svc: []

@k82cn
Copy link
Member

k82cn commented Sep 5, 2019

/assign @hzxuzhonghu

@hzxuzhonghu
Copy link
Collaborator

ok, let me investigate.

@hzxuzhonghu
Copy link
Collaborator

@davidstack It is because when crashloopback, the pod phase is still running. The RestartPolicy of pod is ALways by default.

@hzxuzhonghu
Copy link
Collaborator

To enhance this scenario, in volcano we should prevent the endless and meaningless retries.

@stale
Copy link

stale bot commented Aug 18, 2020

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 18, 2020
@stale
Copy link

stale bot commented Oct 17, 2020

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Oct 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants