Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failover: emit events when pd failover #1466

Merged
merged 11 commits into from
Jan 8, 2020

Conversation

weekface
Copy link
Contributor

@weekface weekface commented Jan 2, 2020

What problem does this PR solve?

fixes: #1465

This PR tries to emit three k8s Events: PDMemberUnhealthy, PDMemberMarkedAsFailure and PDMemberDeleted to indicate the pd failover procedure when the use type kubectl describe tc:

image

But there is a problem that there are many not very useful Events: SuccessfulUpdate, I will open another PR to remove them. @cofyc @aylei PTAL.

What is changed and how does it work?

Check List

Tests

  • Unit test

Code changes

  • Has Go code change

Side effects

Related changes

  • Need to cherry-pick to the release branch

Does this PR introduce a user-facing change?:

Emit events when PD failover

@@ -76,11 +76,11 @@ func NewController(
tikvFailoverPeriod time.Duration,
tidbFailoverPeriod time.Duration,
) *Controller {
eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(glog.Infof)
eventBroadcaster := record.NewBroadcasterWithCorrelatorOptions(record.CorrelatorOptions{QPS: 10})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default QPS is too low: 1/ 300.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 events per object per seconds seems too large, it might generate a lot of load on the API server or unable to do other CRUD requests because of exceeding the total throttling limit. Most burstable tokens (25) are consumed by unnecessary events. We can increase the burst and QPS together but not too much at the first.

Copy link
Contributor Author

@weekface weekface Jan 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 10 is too large.

I will set them to proper values after removing those events.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the current rate of events?

Copy link
Contributor Author

@weekface weekface Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rate of events, does it mean the default value of QPS and BurstSize?

	// by default, allow a source to send 25 events about an object
	// but control the refill rate to 1 new event every 5 minutes
	// this helps control the long-tail of events for things that are always
	// unhealthy
	defaultSpamBurst = 25
	defaultSpamQPS   = 1. / 300.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an issue about setting the rate to proper values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1494

an issue opened

eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(glog.Infof)
eventBroadcaster := record.NewBroadcasterWithCorrelatorOptions(record.CorrelatorOptions{QPS: 10})
eventBroadcaster.StartLogging(glog.V(4).Infof)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

glog.Info is too noisy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

events are often the most important messages that end users care, I'd prefer to have it on in logs by default, how about v(2)?
If the logging of the event is too noisy, we should reduce the rate of events we reported to the API server.
Another reason is the event can be dropped, in that case, we can find the full events from the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reasonable

eventBroadcaster.StartRecordingToSink(&eventv1.EventSinkImpl{
Interface: eventv1.New(kubeCli.CoreV1().RESTClient()).Events("")})
recorder := eventBroadcaster.NewRecorder(v1alpha1.Scheme, corev1.EventSource{Component: "tidbcluster"})
recorder := eventBroadcaster.NewRecorder(v1alpha1.Scheme, corev1.EventSource{Component: "tidb-controller-manager"})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like k8s does

@weekface weekface marked this pull request as ready for review January 3, 2020 09:16
onlymellb
onlymellb previously approved these changes Jan 6, 2020
Copy link
Contributor

@onlymellb onlymellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM

Comment on lines +85 to +86
pf.recorder.Eventf(tc, apiv1.EventTypeWarning, "PDMemberUnhealthy",
"%s(%s) is unhealthy", podName, pdMember.ID)
Copy link
Contributor

@aylei aylei Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, this is better to consider as a status instead of event, recording this will be too noisy (despite the flow control, the controller will emit an event in each round of control loop if there is an unhealthy PD member).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is already a .status.pd[].health attribute.

Emit it as an event is good for kubectl describe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems like it is already in the describe result?

    Members:
      Yutengqiu - Demo - Pd - 0:
        Client URL:            http://yutengqiu-demo-pd-0.yutengqiu-demo-pd-peer.yutengqiu.svc:2379
        Health:                true
        Id:                    12697782363740270066
        Last Transition Time:  2020-01-06T01:48:58Z
        Name:                  yutengqiu-demo-pd-0
      Yutengqiu - Demo - Pd - 3:
        Client URL:            http://yutengqiu-demo-pd-3.yutengqiu-demo-pd-peer.yutengqiu.svc:2379
        Health:                true
        Id:                    10833084519111696661
        Last Transition Time:  2020-01-06T05:37:58Z
        Name:                  yutengqiu-demo-pd-3
      Yutengqiu - Demo - Pd - 4:
        Client URL:            http://yutengqiu-demo-pd-4.yutengqiu-demo-pd-peer.yutengqiu.svc:2379
        Health:                true
        Id:                    10563190389194377650
        Last Transition Time:  2020-01-06T05:44:25Z
        Name:                  yutengqiu-demo-pd-4
      Yutengqiu - Demo - Pd - 6:
        Client URL:            http://yutengqiu-demo-pd-6.yutengqiu-demo-pd-peer.yutengqiu.svc:2379
        Health:                true
        Id:                    6735927804110166558
        Last Transition Time:  2020-01-06T05:32:10Z
        Name:                  yutengqiu-demo-pd-6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emit an event is more user-friendly, it is a progress of failover.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or change PDMemberUnhealthy to a more proper name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the progress of failover makes sense, it is okay for me to emit events based on the the status of PD when we cannot actually capture the the "PD turning from healthy to unhealthy" event.

The naming issue is just trivial, I think current name is ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to report the event on state change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion, we can do this in the syncTidbClusterStatus method, not failover method, an issue opened: #1495

aylei
aylei previously approved these changes Jan 6, 2020
Copy link
Contributor

@aylei aylei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@cofyc cofyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@onlymellb onlymellb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@onlymellb
Copy link
Contributor

/merge

@sre-bot
Copy link
Contributor

sre-bot commented Jan 6, 2020

Your auto merge job has been accepted, waiting for 1486, 1484, 1493

@sre-bot
Copy link
Contributor

sre-bot commented Jan 6, 2020

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Jan 6, 2020

@weekface merge failed.

@weekface
Copy link
Contributor Author

weekface commented Jan 7, 2020

/run-all-tests

@aylei
Copy link
Contributor

aylei commented Jan 8, 2020

/merge

@sre-bot
Copy link
Contributor

sre-bot commented Jan 8, 2020

Your auto merge job has been accepted, waiting for 1486

@sre-bot
Copy link
Contributor

sre-bot commented Jan 8, 2020

/run-all-tests

@sre-bot sre-bot merged commit 984b597 into pingcap:master Jan 8, 2020
@sre-bot
Copy link
Contributor

sre-bot commented Jan 8, 2020

cherry pick to release-1.1 in PR #1507

cofyc pushed a commit that referenced this pull request Feb 5, 2020
* emit event when pd failover

* fix gofmt

* fix gofmt

* fix gofmt

* address comment

* address comment

* fix CI issue
yahonda pushed a commit that referenced this pull request Dec 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emit k8s events when pd failover
5 participants