-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: refine scheduler error message #1373
Conversation
h := &ha{ | ||
kubeCli: kubeCli, | ||
cli: cli, | ||
recorder: recorder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will emit Event
in scheduler
package, not here
if component == label.PDLabelVal { | ||
maxPodsPerNode := 0 | ||
|
||
if component == label.PDLabelVal { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this if ... else ...
block calculate the maxPodsPerNode
var, it is only related to replicas
var. so i move these codes out of the for ... range nodeMap
block.
and also we want to use the maxPodsPerNode
in error message.
p.recorder.Event(pod, apiv1.EventTypeWarning, UnableToRunOnPreviousNodeReason, msg) | ||
} else { | ||
glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName) | ||
return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return an error
here to let scheduler
package emit an Event
. @cofyc
@@ -115,7 +119,10 @@ func (s *scheduler) Filter(args *schedulerapiv1.ExtenderArgs) (*schedulerapiv1.E | |||
glog.Infof("entering predicate: %s, nodes: %v", predicate.Name(), predicates.GetNodeNames(kubeNodes)) | |||
kubeNodes, err = predicate.Filter(instanceName, pod, kubeNodes) | |||
if err != nil { | |||
return nil, err | |||
s.recorder.Event(pod, apiv1.EventTypeWarning, predicate.Name(), err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emit an Event
if the error
is not nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Co-Authored-By: onlymellb <luolibin@pingcap.com>
pkg/scheduler/scheduler.go
Outdated
@@ -115,7 +119,10 @@ func (s *scheduler) Filter(args *schedulerapiv1.ExtenderArgs) (*schedulerapiv1.E | |||
glog.Infof("entering predicate: %s, nodes: %v", predicate.Name(), predicates.GetNodeNames(kubeNodes)) | |||
kubeNodes, err = predicate.Filter(instanceName, pod, kubeNodes) | |||
if err != nil { | |||
return nil, err | |||
s.recorder.Event(pod, apiv1.EventTypeWarning, predicate.Name(), err.Error()) | |||
return &schedulerapiv1.ExtenderFilterResult{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an error occurs, we need to determine whether kubeNodes
is empty. If it is empty, it should return an error normally. If it is not empty, it should continue to schedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed @onlymellb PTAL
p.recorder.Event(pod, apiv1.EventTypeWarning, UnableToRunOnPreviousNodeReason, msg) | ||
} else { | ||
glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName) | ||
return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName) | |
return nodes, fmt.Errorf("cannot run %s/%s on its previous node %q", ns, podName, nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the log message, there is no node for scheduling, but the following returns the nodes
and the error is nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this happens when we can't find the previous node of this TiDB member. e.g. the tidb pod is new, so we can run it on any node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I got it. But we should use warning level log and also emit this event to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the change in this PR, the error returned here will be propagated to users. @weekface can you help change the log level to warn and return the log message as an error too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(kubeNodes) == 0 { | ||
// do not return error to k8s: https://github.com/pingcap/tidb-operator/issues/1353 | ||
return nil, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does nil
semantically equal to filter result with empty nodes? If it is, these three lines seem unnecessary because, at the end of this function, it will return filter results with an empty node list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does nil semantically equal to filter result with empty nodes?
Yes.
If it is, these three lines seem unnecessary because, at the end of this function, it will return filter results with an empty node list.
This return
is in a for range
loop. No need to step the next predicate when the kubeNodes
is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about to change return nil, nil
to break?
if len(kubeNodes) == 0 {
break
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/run-e2e-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
Your auto merge job has been accepted, waiting for 1385, 1383 |
/run-all-tests |
/run-all-tests |
/merge |
/run-all-tests |
What problem does this PR solve?
This PR fixes: #1353 , refines the
tidb-scheduler
error messages.Events
Before this PR:
Now:
What is changed and how does it work?
Check List
Tests
Code changes
Side effects
Related changes
Does this PR introduce a user-facing change?: