Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: refine scheduler error message #1373

Merged
merged 11 commits into from
Dec 23, 2019

Conversation

weekface
Copy link
Contributor

What problem does this PR solve?

This PR fixes: #1353 , refines the tidb-scheduler error messages.

Before this PR:

image

Now:

image
image

What is changed and how does it work?

Check List

Tests

  • Unit test

Code changes

  • Has Go code change
  • Has documents change

Side effects

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Does this PR introduce a user-facing change?:

scheduler: refine scheduler error message

@weekface weekface requested review from cofyc and tennix December 19, 2019 08:40
@weekface weekface added the status/PTAL PR needs to be reviewed label Dec 19, 2019
h := &ha{
kubeCli: kubeCli,
cli: cli,
recorder: recorder,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will emit Event in scheduler package, not here

if component == label.PDLabelVal {
maxPodsPerNode := 0

if component == label.PDLabelVal {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this if ... else ... block calculate the maxPodsPerNode var, it is only related to replicas var. so i move these codes out of the for ... range nodeMap block.

and also we want to use the maxPodsPerNode in error message.

@cofyc

p.recorder.Event(pod, apiv1.EventTypeWarning, UnableToRunOnPreviousNodeReason, msg)
} else {
glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName)
return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return an error here to let scheduler package emit an Event. @cofyc

@@ -115,7 +119,10 @@ func (s *scheduler) Filter(args *schedulerapiv1.ExtenderArgs) (*schedulerapiv1.E
glog.Infof("entering predicate: %s, nodes: %v", predicate.Name(), predicates.GetNodeNames(kubeNodes))
kubeNodes, err = predicate.Filter(instanceName, pod, kubeNodes)
if err != nil {
return nil, err
s.recorder.Event(pod, apiv1.EventTypeWarning, predicate.Name(), err.Error())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emit an Event if the error is not nil.

cofyc
cofyc previously approved these changes Dec 19, 2019
Copy link
Contributor

@cofyc cofyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@weekface weekface requested a review from onlymellb December 19, 2019 09:22
Co-Authored-By: onlymellb <luolibin@pingcap.com>
@@ -115,7 +119,10 @@ func (s *scheduler) Filter(args *schedulerapiv1.ExtenderArgs) (*schedulerapiv1.E
glog.Infof("entering predicate: %s, nodes: %v", predicate.Name(), predicates.GetNodeNames(kubeNodes))
kubeNodes, err = predicate.Filter(instanceName, pod, kubeNodes)
if err != nil {
return nil, err
s.recorder.Event(pod, apiv1.EventTypeWarning, predicate.Name(), err.Error())
return &schedulerapiv1.ExtenderFilterResult{
Copy link
Contributor

@onlymellb onlymellb Dec 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an error occurs, we need to determine whether kubeNodes is empty. If it is empty, it should return an error normally. If it is not empty, it should continue to schedule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed @onlymellb PTAL

p.recorder.Event(pod, apiv1.EventTypeWarning, UnableToRunOnPreviousNodeReason, msg)
} else {
glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName)
return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName)
return nodes, fmt.Errorf("cannot run %s/%s on its previous node %q", ns, podName, nodeName)

cofyc
cofyc previously approved these changes Dec 19, 2019
Copy link
Contributor

@cofyc cofyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the log message, there is no node for scheduling, but the following returns the nodes and the error is nil.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this happens when we can't find the previous node of this TiDB member. e.g. the tidb pod is new, so we can run it on any node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I got it. But we should use warning level log and also emit this event to users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the change in this PR, the error returned here will be propagated to users. @weekface can you help change the log level to warn and return the log message as an error too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments addressed, @tennix @cofyc PTAL

if len(kubeNodes) == 0 {
// do not return error to k8s: https://github.com/pingcap/tidb-operator/issues/1353
return nil, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nil semantically equal to filter result with empty nodes? If it is, these three lines seem unnecessary because, at the end of this function, it will return filter results with an empty node list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nil semantically equal to filter result with empty nodes?

Yes.

If it is, these three lines seem unnecessary because, at the end of this function, it will return filter results with an empty node list.

This return is in a for range loop. No need to step the next predicate when the kubeNodes is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about to change return nil, nil to break?

if len(kubeNodes) == 0 {
	break
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed PTAL

cofyc
cofyc previously approved these changes Dec 23, 2019
Copy link
Contributor

@cofyc cofyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@cofyc cofyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@weekface
Copy link
Contributor Author

/run-e2e-tests

Copy link
Member

@tennix tennix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tennix
Copy link
Member

tennix commented Dec 23, 2019

/merge

@sre-bot
Copy link
Contributor

sre-bot commented Dec 23, 2019

Your auto merge job has been accepted, waiting for 1385, 1383

@sre-bot
Copy link
Contributor

sre-bot commented Dec 23, 2019

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Dec 23, 2019

/run-all-tests

@tennix
Copy link
Member

tennix commented Dec 23, 2019

/merge

@sre-bot
Copy link
Contributor

sre-bot commented Dec 23, 2019

/run-all-tests

@sre-bot sre-bot merged commit 72f4453 into pingcap:master Dec 23, 2019
weekface added a commit to weekface/tidb-operator that referenced this pull request Dec 24, 2019
weekface added a commit to weekface/tidb-operator that referenced this pull request Dec 25, 2019
@weekface weekface deleted the refine-scheduler branch January 3, 2020 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/can-merge status/PTAL PR needs to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler: instead of return an error, return empty nodes when no node can be scheduled to
5 participants