volcano-scheduler start failed #628

jianxingzhe · 2019-12-19T09:59:12Z

Is this a BUG REPORT or FEATURE REQUEST?:

Uncomment only one, leave it on its own line:

/kind bug

What happened:
when deploy volcano witeh the installer/volcano-deployment.yaml, the vocano-shceduler start failed, the logs as follows:

I1219 09:41:11.334295       1 session.go:135] Open Session b0d84829-2243-11ea-9587-a67222a5a7fa with <1> Job and <1> Queues
I1219 09:41:11.334540       1 enqueue.go:55] Enter Enqueue ...
I1219 09:41:11.334549       1 enqueue.go:70] Added Queue <default> for Job <nzk/nzkcluster>
I1219 09:41:11.334566       1 panic.go:679] Leaving Enqueue ...
I1219 09:41:11.334601       1 session.go:154] Close Session b0d84829-2243-11ea-9587-a67222a5a7fa
E1219 09:41:11.334655       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/panic.go:679
/home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/panic.go:199
/home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/signal_unix.go:394
/home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/actions/enqueue/enqueue.go:78
/home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:84
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/asm_amd64.s:1357
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x138 pc=0x11c583a]

goroutine 201 [running]:
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x132f580, 0x217a760)
        /home/travis/.gimme/versions/go1.13.5.linux.amd64/src/runtime/panic.go:679 +0x1b2
volcano.sh/volcano/pkg/scheduler/actions/enqueue.(*enqueueAction).Execute(0xc00016c098, 0xc000b26140)
        /home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/actions/enqueue/enqueue.go:78 +0x32a
volcano.sh/volcano/pkg/scheduler.(*Scheduler).runOnce(0xc00055aa80)
        /home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:84 +0x294
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0003c13a0)
        /home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x5e
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0003c13a0, 0x3b9aca00, 0x0, 0x1, 0x0)
        /home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xf8
volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc0003c13a0, 0x3b9aca00, 0x0)
        /home/travis/gopath/src/volcano.sh/volcano/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by volcano.sh/volcano/pkg/scheduler.(*Scheduler).Run
        /home/travis/gopath/src/volcano.sh/volcano/pkg/scheduler/scheduler.go:68 +0xd4

root@paas-operator-0:~# kubectl get Job -n nzk
No resources found.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Volcano Version:
volcanosh/vc-scheduler:latest
Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:12:15Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:04:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel (e.g. uname -a):
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

hzxuzhonghu · 2019-12-19T11:19:10Z

It seems that when Enqueue action executes, the job.PodGroup is still not populated. This is a bug.

/kind bug

/area scheduler

volcano-sh-bot · 2019-12-19T11:19:16Z

@hzxuzhonghu: The label(s) area/scheduler cannot be applied. These labels are supported: ``

In response to this:

It seems that when Enqueue action executes, the job.PodGroup is still not populated. This is a bug.

/kind bug

/area scheduler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k82cn · 2019-12-19T11:45:14Z

/area scheduling
/priority important-soon

k82cn · 2019-12-24T06:15:48Z

@jianxingzhe , what kind of workload did you submit to volcano? could you share the yaml file?

k82cn · 2019-12-24T06:19:39Z

It seems that when Enqueue action executes, the job.PodGroup is still not populated. This is a bug.

@hzxuzhonghu , in Snapshot of cache.go, we did not return job without PodGroup, is there any place we set it back to nil?

hzxuzhonghu · 2019-12-24T09:50:14Z

in Snapshot of cache.go, we did not return job without PodGroup

Can you link the lines?

hzxuzhonghu · 2019-12-26T02:21:01Z

BTW, i donot think we should leave a critical bug to 0.4, which is 3 months away.

jianxingzhe · 2019-12-27T02:07:49Z

@jianxingzhe , what kind of workload did you submit to volcano? could you share the yaml file?

i submit nothing to volcano. when i deploy volcano in my cluster, the scheduler panic.

hzxuzhonghu · 2019-12-27T02:46:21Z

What's this <nzk/nzkcluster>?

jianxingzhe · 2019-12-27T02:57:59Z

What's this <nzk/nzkcluster>?

this is my crd resource, which has already in my cluster befor deploy volcano.

root@paas-operator-0:~# kubectl get nzkcluster -n nzk
NAME         READY   NODEPORT   AGE
nzkcluster   0       31685      16d

root@paas-operator-0:~# kubectl get Job -n nzk
No resources found.

hzxuzhonghu · 2019-12-27T06:16:49Z

I thought nzk/nzkcluster should be a pod with scheduler name set volcano, otherwise it should not be watched by volcano

hzxuzhonghu · 2019-12-27T06:17:31Z

try this kubectl get vcjob -n nzk

jianxingzhe · 2019-12-29T03:04:27Z

try this kubectl get vcjob -n nzk

I'm sure there is not a pod with scheduler name set to volcano. when i delete the ns nzk， the volcano scheduler start successfully.

root@paas-operator-0:~# kubectl get pod -n nzk
NAME                            READY   STATUS             RESTARTS   AGE
nzk-operator-6d68d5756b-6bfd8   1/1     Running            0          18d
nzk-operator-6d68d5756b-7mgns   0/1     Evicted            0          18d
nzk-operator-6d68d5756b-7qmzp   0/1     CrashLoopBackOff   3323       18d
nzk-operator-6d68d5756b-lc6wz   0/1     Evicted            0          17d
nzk-operator-6d68d5756b-t6bqs   0/1     CrashLoopBackOff   2780       17d
nzkcluster-0                    0/1     CrashLoopBackOff   2787       17d

root@paas-operator-0:~# kubectl get pod -n nzk
NAME                            READY   STATUS             RESTARTS   AGE
nzk-operator-6d68d5756b-6bfd8   1/1     Running            0          18d
nzk-operator-6d68d5756b-7mgns   0/1     Evicted            0          18d
nzk-operator-6d68d5756b-7qmzp   0/1     CrashLoopBackOff   3323       18d
nzk-operator-6d68d5756b-lc6wz   0/1     Evicted            0          17d
nzk-operator-6d68d5756b-t6bqs   0/1     CrashLoopBackOff   2780       17d
nzkcluster-0                    0/1     CrashLoopBackOff   2787       17d
root@paas-operator-0:~# kubectl get pod -n nzk nzkcluster-0 -o yaml | grep scheduler
  schedulerName: default-scheduler
root@paas-operator-0:~# kubectl get nzkcluster -n nzk -o yaml | grep scheduler
root@paas-operator-0:~# kubectl get vcjob -n nzk
No resources found.

jianxingzhe · 2019-12-29T05:37:12Z

in another cluster, i have the same error. i add some logs in volcano scheduler:

        for _, value := range sc.Jobs {
                // If no scheduling spec, does not handle it.
                if value.PodGroup == nil && value.PDB == nil {
                        klog.V(4).Infof("The scheduling spec of Job <%v:%s/%s> is nil, ignore it.",
                                value.UID, value.Namespace, value.Name)

                        continue
                }

                if _, found := snapshot.Queues[value.Queue]; !found {
                        klog.V(3).Infof("The Queue <%v> of Job <%v/%v> does not exist, ignore it.",
                                value.Queue, value.Namespace, value.Name)
                        continue
                }

                klog.V(3).Infof("add Job: %v", value) // print the joninfo

                wg.Add(1)
                go cloneJob(value)
        }

i found the volcano scheduler add some jobs automatically，and theses jobs can not be found in my cluster. i cannot find the code where these jobs were added to the scheduler cache . @hzxuzhonghu

I1229 04:59:21.663117       1 shared_informer.go:123] caches populated
I1229 04:59:21.663152       1 scheduler.go:72] Start scheduling ...
I1229 04:59:21.663940       1 cache.go:781] add Job: Job (d60d44fe-29f7-11ea-8524-246e9627db94): namespace default (default), name nzk-chaos, minAvailable 0, podGroup <nil>
I1229 04:59:21.664005       1 cache.go:781] add Job: Job (cbdb9912-1c94-11ea-bf89-246e9627db94): namespace panther (default), name demo1-es-default, minAvailable 1, podGroup <nil>
I1229 04:59:21.664026       1 cache.go:781] add Job: Job (1730e308-2791-11ea-8524-246e9627db94): namespace hw-elasticsearch-new9 (default), name panther-sample-02-es-default, minAvailable 7, podGroup <nil>
I1229 04:59:21.664043       1 cache.go:781] add Job: Job (87592197-2855-11ea-8524-246e9627db94): namespace nes-elasticsearch (default), name panther-sample-es-default, minAvailable 15, podGroup <nil>
I1229 04:59:21.664058       1 cache.go:781] add Job: Job (e8d6d77c-209c-11ea-8524-246e9627db94): namespace default (default), name example, minAvailable 0, podGroup <nil>
I1229 04:59:21.664075       1 cache.go:781] add Job: Job (f5c5ce78-1662-11ea-bf89-246e9627db94): namespace zookeeper (default), name example, minAvailable 0, podGroup <nil>
I1229 04:59:21.664099       1 cache.go:788] There are <6> Jobs, <1> Queues and <4> Nodes in total for scheduling.
I1229 04:59:21.664123       1 session.go:135] Open Session fa06dd0a-29f7-11ea-bf37-ce2c7f6a1947 with <6> Job and <1> Queues
I1229 04:59:21.664411       1 proportion.go:67] The total resource is <cpu 180000.00, memory 1343110746112.00, hugepages-1Gi 0.00, hugepages-2Mi 8589934592000.00>
I1229 04:59:21.664451       1 proportion.go:71] Considering Job <panther/demo1-es-default>.
I1229 04:59:21.664467       1 proportion.go:85] Added Queue <default> attributes.
I1229 04:59:21.664479       1 proportion.go:71] Considering Job <hw-elasticsearch-new9/panther-sample-02-es-default>.
I1229 04:59:21.664489       1 proportion.go:71] Considering Job <nes-elasticsearch/panther-sample-es-default>.
I1229 04:59:21.664497       1 proportion.go:71] Considering Job <default/example>.
I1229 04:59:21.664505       1 proportion.go:71] Considering Job <zookeeper/example>.
I1229 04:59:21.664514       1 proportion.go:71] Considering Job <default/nzk-chaos>.
I1229 04:59:21.664529       1 proportion.go:127] Considering Queue <default>: weight <1>, total weight <1>.
I1229 04:59:21.664550       1 proportion.go:144] The attributes of queue <default> in proportion: deserved <cpu 180000.00, memory 1343110746112.00, hugepages-2Mi 8589934592000.00, hugepages-1Gi 0.00>, allocate <cpu 0.00, memory 0.00>, request <cpu 0.00, memory 0.00>, share <0.00>
I1229 04:59:21.664698       1 proportion.go:154] Exiting when remaining is empty:  <cpu 0.00, memory 0.00, hugepages-2Mi 0.00, hugepages-1Gi 0.00>
I1229 04:59:21.664895       1 binpack.go:158] Enter binpack plugin ...
I1229 04:59:21.664907       1 binpack.go:177] resources [] record in weight but not found on any node
I1229 04:59:21.664920       1 binpack.go:161] Leaving binpack plugin. binpack.weight[1], binpack.cpu[1], binpack.memory[1], no extend resources. ...
I1229 04:59:21.664947       1 enqueue.go:55] Enter Enqueue ...
I1229 04:59:21.664960       1 enqueue.go:70] Added Queue <default> for Job <default/example>
I1229 04:59:21.664989       1 panic.go:522] Leaving Enqueue ...
I1229 04:59:21.665066       1 panic.go:522] End scheduling ...
E1229 04:59:21.665228       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)

root@test3:~# kubectl get vcjob -n default
No resources found.
root@test3:~# kubectl get vcjob -n panther
No resources found.
root@test3:~# kubectl get vcjob -n hw-elasticsearch-new9
No resources found.

hzxuzhonghu · 2019-12-30T02:32:35Z

How did you deploy volcano? And which version do you use?

jianxingzhe · 2019-12-30T02:59:24Z

how did you deploy volcano? And which version do you use?

@hzxuzhonghu i deploy volcano with this yaml file
https://github.com/volcano-sh/volcano/blob/master/installer/volcano-development.yaml

hzxuzhonghu · 2019-12-30T03:12:12Z

On your own k8s?

jianxingzhe · 2019-12-30T05:42:56Z

On your own k8s?

yes, on our dev k8s cluster

Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:12:15Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.12", GitCommit:"a8b52209ee172232b6db7a6e0ce2adc77458829f", GitTreeState:"clean", BuildDate:"2019-10-15T12:04:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

hzxuzhonghu · 2019-12-30T06:22:44Z

I1229 04:59:21.664005       1 cache.go:781] add Job: Job (cbdb9912-1c94-11ea-bf89-246e9627db94): namespace panther (default), name demo1-es-default, minAvailable 1, podGroup <nil>
I1229 04:59:21.664026       1 cache.go:781] add Job: Job (1730e308-2791-11ea-8524-246e9627db94): namespace hw-elasticsearch-new9 (default), name panther-sample-02-es-default, minAvailable 7, podGroup <nil>
I1229 04:59:21.664043       1 cache.go:781] add Job: Job (87592197-2855-11ea-8524-246e9627db94): namespace nes-elasticsearch (default), name panther-sample-es-default, minAvailable 15, podGroup <nil>
I1229 04:59:21.664058       1 cache.go:781] add Job: Job (e8d6d77c-209c-11ea-8524-246e9627db94): namespace default (default), name example, minAvailable 0, podGroup <nil>
I1229 04:59:21.664075       1 cache.go:781] add Job: Job (f5c5ce78-1662-11ea-bf89-246e9627db94): namespace zookeeper (default), name example, minAvailable 0, podGroup <nil>
I1229 04:59:21.664099       1 cache.go:788] The

I am curious what's these jobs are

volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 19, 2019

volcano-sh-bot added area/scheduling priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 19, 2019

k82cn added this to the v0.4 milestone Dec 20, 2019

hzxuzhonghu mentioned this issue Dec 30, 2019

Remove pdb support #654

Merged

volcano-sh-bot closed this as completed in #654 Dec 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

volcano-scheduler start failed #628

volcano-scheduler start failed #628

jianxingzhe commented Dec 19, 2019

hzxuzhonghu commented Dec 19, 2019

volcano-sh-bot commented Dec 19, 2019

k82cn commented Dec 19, 2019

k82cn commented Dec 24, 2019

k82cn commented Dec 24, 2019

hzxuzhonghu commented Dec 24, 2019

hzxuzhonghu commented Dec 26, 2019

jianxingzhe commented Dec 27, 2019

hzxuzhonghu commented Dec 27, 2019

jianxingzhe commented Dec 27, 2019 •

edited

Loading

hzxuzhonghu commented Dec 27, 2019

hzxuzhonghu commented Dec 27, 2019

jianxingzhe commented Dec 29, 2019

jianxingzhe commented Dec 29, 2019

hzxuzhonghu commented Dec 30, 2019

jianxingzhe commented Dec 30, 2019 •

edited

Loading

hzxuzhonghu commented Dec 30, 2019

jianxingzhe commented Dec 30, 2019

hzxuzhonghu commented Dec 30, 2019

volcano-scheduler start failed #628

volcano-scheduler start failed #628

Comments

jianxingzhe commented Dec 19, 2019

hzxuzhonghu commented Dec 19, 2019

volcano-sh-bot commented Dec 19, 2019

k82cn commented Dec 19, 2019

k82cn commented Dec 24, 2019

k82cn commented Dec 24, 2019

hzxuzhonghu commented Dec 24, 2019

hzxuzhonghu commented Dec 26, 2019

jianxingzhe commented Dec 27, 2019

hzxuzhonghu commented Dec 27, 2019

jianxingzhe commented Dec 27, 2019 • edited Loading

hzxuzhonghu commented Dec 27, 2019

hzxuzhonghu commented Dec 27, 2019

jianxingzhe commented Dec 29, 2019

jianxingzhe commented Dec 29, 2019

hzxuzhonghu commented Dec 30, 2019

jianxingzhe commented Dec 30, 2019 • edited Loading

hzxuzhonghu commented Dec 30, 2019

jianxingzhe commented Dec 30, 2019

hzxuzhonghu commented Dec 30, 2019

jianxingzhe commented Dec 27, 2019 •

edited

Loading

jianxingzhe commented Dec 30, 2019 •

edited

Loading