failover: emit events when pd failover #1466

weekface · 2020-01-02T08:08:31Z

What problem does this PR solve?

This PR tries to emit three k8s Events: PDMemberUnhealthy, PDMemberMarkedAsFailure and PDMemberDeleted to indicate the pd failover procedure when the use type kubectl describe tc:

But there is a problem that there are many not very useful Events: SuccessfulUpdate, I will open another PR to remove them. @cofyc @aylei PTAL.

What is changed and how does it work?

Check List

Tests

Unit test

Code changes

Has Go code change

Side effects

Related changes

Need to cherry-pick to the release branch

Does this PR introduce a user-facing change?:

Emit events when PD failover

…operator into pd-failover-event

weekface · 2020-01-03T08:43:20Z

pkg/controller/tidbcluster/tidb_cluster_controller.go

@@ -76,11 +76,11 @@ func NewController(
 	tikvFailoverPeriod time.Duration,
 	tidbFailoverPeriod time.Duration,
 ) *Controller {
-	eventBroadcaster := record.NewBroadcaster()
-	eventBroadcaster.StartLogging(glog.Infof)
+	eventBroadcaster := record.NewBroadcasterWithCorrelatorOptions(record.CorrelatorOptions{QPS: 10})


the default QPS is too low: 1/ 300.

10 events per object per seconds seems too large, it might generate a lot of load on the API server or unable to do other CRUD requests because of exceeding the total throttling limit. Most burstable tokens (25) are consumed by unnecessary events. We can increase the burst and QPS together but not too much at the first.

Yeah, 10 is too large.

I will set them to proper values after removing those events.

what's the current rate of events?

rate of events, does it mean the default value of QPS and BurstSize?

// by default, allow a source to send 25 events about an object // but control the refill rate to 1 new event every 5 minutes // this helps control the long-tail of events for things that are always // unhealthy defaultSpamBurst = 25 defaultSpamQPS = 1. / 300.

create an issue about setting the rate to proper values?

#1494

an issue opened

weekface · 2020-01-03T08:43:58Z

pkg/controller/tidbcluster/tidb_cluster_controller.go

-	eventBroadcaster := record.NewBroadcaster()
-	eventBroadcaster.StartLogging(glog.Infof)
+	eventBroadcaster := record.NewBroadcasterWithCorrelatorOptions(record.CorrelatorOptions{QPS: 10})
+	eventBroadcaster.StartLogging(glog.V(4).Infof)


glog.Info is too noisy.

events are often the most important messages that end users care, I'd prefer to have it on in logs by default, how about v(2)?
If the logging of the event is too noisy, we should reduce the rate of events we reported to the API server.
Another reason is the event can be dropped, in that case, we can find the full events from the logs.

weekface · 2020-01-03T08:44:26Z

pkg/controller/tidbcluster/tidb_cluster_controller.go

 	eventBroadcaster.StartRecordingToSink(&eventv1.EventSinkImpl{
 		Interface: eventv1.New(kubeCli.CoreV1().RESTClient()).Events("")})
-	recorder := eventBroadcaster.NewRecorder(v1alpha1.Scheme, corev1.EventSource{Component: "tidbcluster"})
+	recorder := eventBroadcaster.NewRecorder(v1alpha1.Scheme, corev1.EventSource{Component: "tidb-controller-manager"})


like k8s does

onlymellb

LGTM

aylei

The rest LGTM

aylei · 2020-01-06T04:20:18Z

pkg/manager/member/pd_failover.go

+			pf.recorder.Eventf(tc, apiv1.EventTypeWarning, "PDMemberUnhealthy",
+				"%s(%s) is unhealthy", podName, pdMember.ID)


IMHO, this is better to consider as a status instead of event, recording this will be too noisy (despite the flow control, the controller will emit an event in each round of control loop if there is an unhealthy PD member).

The is already a .status.pd[].health attribute.

Emit it as an event is good for kubectl describe

Yes, seems like it is already in the describe result?

Members: Yutengqiu - Demo - Pd - 0: Client URL: http://yutengqiu-demo-pd-0.yutengqiu-demo-pd-peer.yutengqiu.svc:2379 Health: true Id: 12697782363740270066 Last Transition Time: 2020-01-06T01:48:58Z Name: yutengqiu-demo-pd-0 Yutengqiu - Demo - Pd - 3: Client URL: http://yutengqiu-demo-pd-3.yutengqiu-demo-pd-peer.yutengqiu.svc:2379 Health: true Id: 10833084519111696661 Last Transition Time: 2020-01-06T05:37:58Z Name: yutengqiu-demo-pd-3 Yutengqiu - Demo - Pd - 4: Client URL: http://yutengqiu-demo-pd-4.yutengqiu-demo-pd-peer.yutengqiu.svc:2379 Health: true Id: 10563190389194377650 Last Transition Time: 2020-01-06T05:44:25Z Name: yutengqiu-demo-pd-4 Yutengqiu - Demo - Pd - 6: Client URL: http://yutengqiu-demo-pd-6.yutengqiu-demo-pd-peer.yutengqiu.svc:2379 Health: true Id: 6735927804110166558 Last Transition Time: 2020-01-06T05:32:10Z Name: yutengqiu-demo-pd-6

emit an event is more user-friendly, it is a progress of failover.

or change PDMemberUnhealthy to a more proper name?

the progress of failover makes sense, it is okay for me to emit events based on the the status of PD when we cannot actually capture the the "PD turning from healthy to unhealthy" event.

The naming issue is just trivial, I think current name is ok

is it possible to report the event on state change?

good suggestion, we can do this in the syncTidbClusterStatus method, not failover method, an issue opened: #1495

aylei

LGTM

…operator into pd-failover-event

cofyc

LGTM

onlymellb

LGTM

onlymellb · 2020-01-06T07:29:47Z

/merge

sre-bot · 2020-01-06T07:29:51Z

Your auto merge job has been accepted, waiting for 1486, 1484, 1493

sre-bot · 2020-01-06T09:50:11Z

/run-all-tests

sre-bot · 2020-01-06T10:38:39Z

@weekface merge failed.

weekface · 2020-01-07T08:06:37Z

/run-all-tests

aylei · 2020-01-08T10:04:46Z

/merge

sre-bot · 2020-01-08T10:04:50Z

Your auto merge job has been accepted, waiting for 1486

sre-bot · 2020-01-08T10:50:41Z

/run-all-tests

sre-bot · 2020-01-08T11:36:39Z

cherry pick to release-1.1 in PR #1507

* emit event when pd failover * fix gofmt * fix gofmt * fix gofmt * address comment * address comment * fix CI issue

weekface added 6 commits January 2, 2020 16:03

emit event when pd failover

4414b36

fix gofmt

0bd5ff3

fix gofmt

d7a215d

Merge branch 'pd-failover-event' of https://github.com/weekface/tidb-…

7a2091b

…operator into pd-failover-event

fix gofmt

c8ed4a0

address comment

304ed5b

weekface commented Jan 3, 2020

View reviewed changes

weekface requested review from cofyc, tennix, aylei and onlymellb January 3, 2020 09:07

weekface force-pushed the pd-failover-event branch from deddce8 to 304ed5b Compare January 3, 2020 09:14

Merge branch 'master' into pd-failover-event

24aad0a

weekface marked this pull request as ready for review January 3, 2020 09:16

weekface mentioned this pull request Jan 3, 2020

Remove some not very useful Events #1471

Closed

onlymellb previously approved these changes Jan 6, 2020

View reviewed changes

weekface mentioned this pull request Jan 6, 2020

Compare k8s pod running info with pd client health info , improve inspection mechanism #1484

Merged

aylei reviewed Jan 6, 2020

View reviewed changes

aylei previously approved these changes Jan 6, 2020

View reviewed changes

weekface added 2 commits January 6, 2020 14:36

address comment

7c6fd11

Merge branch 'pd-failover-event' of https://github.com/weekface/tidb-…

bb412ef

…operator into pd-failover-event

weekface dismissed stale reviews from aylei and onlymellb via bb412ef January 6, 2020 06:37

weekface mentioned this pull request Jan 6, 2020

setting the events BurstSize/QPS to proper values #1494

Closed

cofyc approved these changes Jan 6, 2020

View reviewed changes

weekface mentioned this pull request Jan 6, 2020

exposing the tidb cluster states/changes as k8s events to users #1375

Closed

7 tasks

onlymellb approved these changes Jan 6, 2020

View reviewed changes

sre-bot added the status/can-merge label Jan 6, 2020

Merge branch 'master' into pd-failover-event

5f63d61

aylei added the needs-cherry-pick-1.1 label Jan 8, 2020

Merge branch 'master' into pd-failover-event

2d413cd

sre-bot merged commit 984b597 into pingcap:master Jan 8, 2020

sre-bot mentioned this pull request Jan 8, 2020

failover: emit events when pd failover (#1466) #1507

Merged

sre-bot added a commit that referenced this pull request Jan 11, 2020

failover: emit events when pd failover (#1466) (#1507)

6ca1d83

weekface mentioned this pull request Feb 3, 2020

Automated cherry pick of #1466: emit event when pd failover #1611

Merged

cofyc pushed a commit that referenced this pull request Feb 5, 2020

Automated cherry pick of #1466: emit event when pd failover (#1611)

0cbe402

* emit event when pd failover * fix gofmt * fix gofmt * fix gofmt * address comment * address comment * fix CI issue

yahonda pushed a commit that referenced this pull request Dec 27, 2021

en,zh: Use mysql --comments everywhere (#1466)

e0732e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failover: emit events when pd failover #1466

failover: emit events when pd failover #1466

weekface commented Jan 2, 2020 •

edited by aylei

Loading

weekface Jan 3, 2020

cofyc Jan 3, 2020

weekface Jan 3, 2020 •

edited

Loading

cofyc Jan 6, 2020

weekface Jan 6, 2020 •

edited

Loading

cofyc Jan 6, 2020

weekface Jan 6, 2020

weekface Jan 3, 2020

cofyc Jan 6, 2020

weekface Jan 6, 2020

weekface Jan 3, 2020

onlymellb left a comment

aylei left a comment

aylei Jan 6, 2020 •

edited

Loading

weekface Jan 6, 2020

aylei Jan 6, 2020

weekface Jan 6, 2020

weekface Jan 6, 2020

aylei Jan 6, 2020

cofyc Jan 6, 2020

weekface Jan 6, 2020

aylei left a comment

cofyc left a comment

onlymellb left a comment

onlymellb commented Jan 6, 2020

sre-bot commented Jan 6, 2020

sre-bot commented Jan 6, 2020

sre-bot commented Jan 6, 2020

weekface commented Jan 7, 2020

aylei commented Jan 8, 2020

sre-bot commented Jan 8, 2020

sre-bot commented Jan 8, 2020

sre-bot commented Jan 8, 2020

		pf.recorder.Eventf(tc, apiv1.EventTypeWarning, "PDMemberUnhealthy",
		"%s(%s) is unhealthy", podName, pdMember.ID)

failover: emit events when pd failover #1466

failover: emit events when pd failover #1466

Conversation

weekface commented Jan 2, 2020 • edited by aylei Loading

What problem does this PR solve?

What is changed and how does it work?

Check List

Does this PR introduce a user-facing change?:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weekface Jan 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weekface Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onlymellb left a comment

Choose a reason for hiding this comment

aylei left a comment

Choose a reason for hiding this comment

aylei Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aylei left a comment

Choose a reason for hiding this comment

cofyc left a comment

Choose a reason for hiding this comment

onlymellb left a comment

Choose a reason for hiding this comment

onlymellb commented Jan 6, 2020

sre-bot commented Jan 6, 2020

sre-bot commented Jan 6, 2020

sre-bot commented Jan 6, 2020

weekface commented Jan 7, 2020

aylei commented Jan 8, 2020

sre-bot commented Jan 8, 2020

sre-bot commented Jan 8, 2020

sre-bot commented Jan 8, 2020

weekface commented Jan 2, 2020 •

edited by aylei

Loading

weekface Jan 3, 2020 •

edited

Loading

weekface Jan 6, 2020 •

edited

Loading

aylei Jan 6, 2020 •

edited

Loading