Random performance issue on etcd 3.4 #11884

Sh4d1 · 2020-05-12T14:33:00Z

Hello,

We are running a 5 node etcd cluster:

$ etcdctl endpoint status 
etcd01, 507905ef22a349ce, 3.4.7, 3.1 GB, true, false, 785, 17694912855, 17694912855, 
etcd02, 96622104eaa8652d, 3.4.7, 3.1 GB, false, false, 785, 17694912881, 17694912880, 
ectd03, e91fce12ee84c080, 3.4.7, 3.1 GB, false, false, 785, 17694912903, 17694912903, 
etcd04, 400fc14411f50272, 3.4.7, 3.1 GB, false, false, 785, 17694912989, 17694912985, 
etcd05, 87c46f0b178dc777, 3.4.7, 3.1 GB, false, false, 785, 17694913043, 17694913028,

And we're having some weird performance issue eg:

# etcdctl endpoint health
etcd01 is healthy: successfully committed proposal: took = 12.462058ms
etcd03 is healthy: successfully committed proposal: took = 18.826686ms
etcd02 is healthy: successfully committed proposal: took = 19.418745ms
etcd04 is healthy: successfully committed proposal: took = 24.314474ms
etcd05 is healthy: successfully committed proposal: took = 244.761598ms

# etcdctl endpoint health
etcd01 is healthy: successfully committed proposal: took = 13.505405ms
etcd03 is healthy: successfully committed proposal: took = 21.905048ms
etcd04 is healthy: successfully committed proposal: took = 22.569332ms
etcd02 is healthy: successfully committed proposal: took = 23.10597ms
etcd05 is healthy: successfully committed proposal: took = 24.182998ms

# etcdctl endpoint health
etcd05is healthy: successfully committed proposal: took = 24.854541ms
etcd01 is healthy: successfully committed proposal: took = 86.045049ms
etcd03 is healthy: successfully committed proposal: took = 171.771975ms
etcd04 is healthy: successfully committed proposal: took = 576.218846ms
etcd02 is healthy: successfully committed proposal: took = 1.06666032s

Not sure how to debug it, it looks pretty random. Feel free to ask for more info!

xiang90 · 2020-05-12T23:40:26Z

I looks like disk I/O spikes.

Sh4d1 · 2020-05-13T06:25:57Z

This is also what I'm thinking, I also have periodically slow read only range request. Though I can't see any problems when looking at my metrics 🤔 @xiang90 any ideas on where to look?

Sh4d1 · 2020-05-13T07:13:44Z

Commit durations looks fine

etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 168386
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 576993
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 757708
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 782904
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 787763
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 788436
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 788584
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 788953
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 789118
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 789148
etcd_disk_backend_commit_duration_seconds_sum 1461.2600709299697
etcd_disk_backend_commit_duration_seconds_count 789148

fsync looks fine as well

etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 9.397643e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 9.4252083e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 9.4467886e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 9.4633426e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 9.4712732e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 9.4718483e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 9.4718972e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 9.4719722e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 9.4719978e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 9.4719985e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_sum 18283.75915433406
etcd_disk_wal_fsync_duration_seconds_count 9.4719986e+07

rtt seems ok too:

etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0001"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0004"} 2
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0008"} 1813
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0016"} 2275
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0032"} 2307
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0064"} 2311
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0128"} 2312
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0256"} 4610
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0512"} 4631
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.1024"} 4634
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.2048"} 4635
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.4096"} 4635
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.8192"} 4636
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="1.6384"} 4636
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="3.2768"} 4636
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="+Inf"} 4636
etcd_network_peer_round_trip_time_seconds_sum{To="400fc14411f50272"} 46.492993348000034
etcd_network_peer_round_trip_time_seconds_count{To="400fc14411f50272"} 4636
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0004"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0008"} 1742
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0016"} 2228
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0032"} 2276
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0064"} 2284
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0128"} 2291
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0256"} 4495
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0512"} 4554
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.1024"} 4581
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.2048"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.4096"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.8192"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="1.6384"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="3.2768"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="+Inf"} 4594
etcd_network_peer_round_trip_time_seconds_sum{To="87c46f0b178dc777"} 49.258597341999945
etcd_network_peer_round_trip_time_seconds_count{To="87c46f0b178dc777"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0001"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0004"} 175
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0008"} 1974
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0016"} 2246
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0032"} 2277
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0064"} 2285
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0128"} 2288
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0256"} 4533
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0512"} 4568
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.1024"} 4582
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.2048"} 4586
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.4096"} 4587
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.8192"} 4588
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="1.6384"} 4588
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="3.2768"} 4588
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="+Inf"} 4588
etcd_network_peer_round_trip_time_seconds_sum{To="96622104eaa8652d"} 44.96468762199984
etcd_network_peer_round_trip_time_seconds_count{To="96622104eaa8652d"} 4588
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0001"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0004"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0008"} 1719
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0016"} 2238
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0032"} 2271
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0064"} 2278
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0128"} 2280
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0256"} 4529
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0512"} 4558
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.1024"} 4567
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.2048"} 4573
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.4096"} 4575
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.8192"} 4575
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="1.6384"} 4575
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="3.2768"} 4575
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="+Inf"} 4575
etcd_network_peer_round_trip_time_seconds_sum{To="e91fce12ee84c080"} 46.92858499100008
etcd_network_peer_round_trip_time_seconds_count{To="e91fce12ee84c080"} 4575

Sh4d1 · 2020-05-13T10:10:44Z

we are also seeing some /health error; QGET failed etcdserver: request timed out (status code 503)

xiang90 · 2020-05-13T17:10:33Z

Was there any leader election happening? Any warning/errors in the log?

Sh4d1 · 2020-05-13T17:12:57Z

No election leader, and I see the spikes in latency every 5-10s on 1 node for 1-2s maybe. No errors, but I think the read only range requests are at around the same time

xiang90 · 2020-05-13T17:18:31Z

but I think the read only range requests are at around the same time

If there is an expensive range request, it will affect latency. But etcd should log something out when that happens.

Sh4d1 · 2020-05-13T17:32:15Z

It only logs that the request took longer than 100ms. And I think the request are not that big 🤔 I'll paste some examples tomorrow!

xiang90 · 2020-05-13T17:51:01Z

It only logs that the request took longer than 100ms. And I think the request are not that big 🤔 I'll paste some examples tomorrow!

Probably that was what going on.

Sh4d1 · 2020-05-13T17:53:59Z

@xiang90 I'm also seeing some high p99 grpc latency (above 2s) on for instance Range, LeaseGrant and Txn. I'm not sure which is causing which 🤔

tangcong · 2020-05-14T05:55:07Z

can you enable trace feature? ( set --logger=zap). you can specify --metrics=extensive to get all grpcMethod latency. can you provide all etcd metrics info and trace log? @Sh4d1

Sh4d1 · 2020-05-14T06:30:08Z

@tangcong here are the metrics for one of the node https://gist.github.com/Sh4d1/bbfba21f3176865cf69c4533d5e39286
Do you want for the 4 other nodes too?

Can't set the logger right now, I'll do it when I can

tangcong · 2020-05-14T07:31:19Z

It seems that the number of connections is very large, how about the etcd server load?

process_open_fds 76494
go_goroutines 224164
etcd_server_health_failures 1900
etcd_server_heartbeat_send_failures_total 904

Sh4d1 · 2020-05-14T07:41:14Z

Yes there is around ~ 1000 k8s api servers connected to the cluster.
The cpu load is okay 4.49, 5.18, 5.81 and cpu usage of etcd is between 200 and 600% (on 12 cores)

tangcong · 2020-05-14T08:07:43Z

do you also use etcd v2? What is the current cluster read and write QPS for etcd v2,v3? in the k8s, listing pods and events are expensive, in my opinion, supporting 1000 k8s cluster by one etcd cluster is very risky.

etcd_debugging_store_reads_total{action="get"} 505432
etcd_debugging_store_reads_total{action="getRecursive"} 4

Sh4d1 · 2020-05-14T08:11:13Z

nop it's only v3 for all clusters. How can I get the QPS ? 🤔

tangcong · 2020-05-14T08:15:33Z

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/monitoring.md @Sh4d1 (Grafana Rpc Rate)

Sh4d1 · 2020-05-14T08:19:58Z

@tangcong around 7k

tangcong · 2020-05-14T08:23:34Z

I guess performance maybe reached the bottleneck and look forward to your etcd server trace log, it can give us detailed info.

Sh4d1 · 2020-05-14T09:32:14Z

I have a bunch of these

May 14 11:29:06 etcd05 etcd[11500]: {"level":"warn","ts":"2020-05-14T11:29:06.092+0200","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"476.300053ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/leases/kube-system/kube-controller-manager\" ","response":"range_response_count:1 size:365"}

May 14 11:29:06 etcd05 etcd[11500]: {"level":"info","ts":"2020-05-14T11:29:06.092+0200","caller":"traceutil/trace.go:145","msg":"trace[203066865] range","detail":"{range_begin:/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/leases/kube-system/kube-controller-manager; range_end:; response_count:1; response_revision:16398215814; }","duration":"476.321764ms","start":"2020-05-14T11:29:05.616+0200","end":"2020-05-14T11:29:06.092+0200","steps":["trace[203066865] 'agreement among raft nodes before linearized reading'  (duration: 475.660494ms)"]}

Sh4d1 · 2020-05-14T09:38:20Z

Other long ones

May 14 11:35:51 etcd[11500]: {"level":"info","ts":"2020-05-14T11:35:51.795+0200","caller":"traceutil/trace.go:145","msg":"trace[233128838] range","detail":"{range_begin:/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/configmaps/kube-system/ingress-controller-leader-nginx; range_end:; response_count:1; response_revision:16398946939; }","duration":"1.318159848s","start":"2020-05-14T11:35:50.477+0200","end":"2020-05-14T11:35:51.795+0200","steps":["trace[233128838] 'get authentication metadata'  (duration: 1.317348866s)"]}

@tangcong

tangcong · 2020-05-14T10:08:45Z

Other long ones

May 14 11:35:51 etcd[11500]: {"level":"info","ts":"2020-05-14T11:35:51.795+0200","caller":"traceutil/trace.go:145","msg":"trace[233128838] range","detail":"{range_begin:/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/configmaps/kube-system/ingress-controller-leader-nginx; range_end:; response_count:1; response_revision:16398946939; }","duration":"1.318159848s","start":"2020-05-14T11:35:50.477+0200","end":"2020-05-14T11:35:51.795+0200","steps":["trace[233128838] 'get authentication metadata'  (duration: 1.317348866s)"]}

@tangcong

func (as *authStore) isOpPermitted(userName string, revision uint64, key, rangeEnd []byte, permTyp authpb.Permission_Type) error {
	// TODO(mitake): this function would be costly so we need a caching mechanism
	if !as.IsAuthEnabled() {
		return nil
	}

	// only gets rev == 0 when passed AuthInfo{}; no user given
	if revision == 0 {
		return ErrUserEmpty
	}
	rev := as.Revision()
	if revision < rev {
		if as.lg != nil {
			as.lg.Warn("request auth revision is less than current node auth revision",
				zap.Uint64("current node auth revision", rev),
				zap.Uint64("request auth revision", revision),
				zap.ByteString("request key", key),
				zap.Error(ErrAuthOldRevision))
		} else {
			plog.Warningf("request auth revision is less than current node auth revision,"+
				"current node auth revision is %d,"+
				"request auth revision is %d,"+
				"request key is %s, "+
				"err is %v", rev, revision, key, ErrAuthOldRevision)
		}
		return ErrAuthOldRevision
	}

	tx := as.be.BatchTx()
	tx.Lock()
	defer tx.Unlock()

	user := getUser(as.lg, tx, userName)
	if user == nil {
		if as.lg != nil {
			as.lg.Warn("cannot find a user for permission check", zap.String("user-name", userName))
		} else {
			plog.Errorf("invalid user name %s for permission checking", userName)
		}
		return ErrPermissionDenied
	}

	// root role should have permission on all ranges
	if hasRootRole(user) {
		return nil
	}

	if as.isRangeOpPermitted(tx, userName, key, rangeEnd, permTyp) {
		return nil
	}

	return ErrPermissionDenied
}

do you enable auth? I guess that tx.Lock() is expensive and I will take a try to optimize it(such as cache, using rwmutex).

Sh4d1 · 2020-05-14T10:19:15Z

Yep we use tls auth, (which is pretty expensive indeed 😅 )

tangcong · 2020-05-14T10:34:51Z

I prefer to use cache mechanism to improve isOpPermitted performance. do you have any suggestions? thanks.
@xiang90 @mitake

Sh4d1 · 2020-05-14T11:12:23Z

@tangcong is this called on each request?

tangcong · 2020-05-14T11:18:27Z

yes，if there are no safety issues，can you disable auth to verify it in your test environment?  

…

---Original--- From: "Patrik"<notifications@github.com> Date: Thu, May 14, 2020 19:12 PM To: "etcd-io/etcd"<etcd@noreply.github.com>; Cc: "Mention"<mention@noreply.github.com>;"tangcong"<tangcong506@foxmail.com>; Subject: Re: [etcd-io/etcd] Random performance issue on etcd 3.4 (#11884) @tangcong is this called on each request? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Sh4d1 · 2020-05-14T11:39:55Z

Not easily no :(
But I'm not sure why it would only happend for ~1s every 5-10s 🤔

tangcong · 2020-05-14T11:48:45Z

it may be related to the snapshot？it will increase the load. I get info from your metrics file that your etcd cluster has v2 keys. how much v2 keys?

Sh4d1 · 2020-05-14T11:56:04Z

That's what I thought but it didn't exactly match the pattern (though it may be it true).
Do I have to restart etcd with --enable-v2=true ? 🤔

Sh4d1 · 2020-05-14T11:56:51Z

And I think we don't have v2 keys though 🤔

Sh4d1 · 2020-07-02T18:40:57Z

@tangcong any news on this?

retinio · 2020-07-30T06:09:10Z

I have a similar problem.
Yesterday, disk sync delay occurred only on one node (etcd leader from 2ms to ~16-37ms)

Today, it is on randomly node

Logs from node at time long dysk sync:

$ master-01
Jul 30 08:03:59 master-01 etcd[15005]: store.index: compact 37973635
Jul 30 08:03:59 master-01 etcd[15005]: finished scheduled compaction at 37973635 (took 46.989429ms)
Jul 30 08:08:59 master-01 etcd[15005]: store.index: compact 37975488
Jul 30 08:08:59 master-01 etcd[15005]: finished scheduled compaction at 37975488 (took 47.586964ms)
Jul 30 08:13:59 master-01 etcd[15005]: store.index: compact 37977341
Jul 30 08:13:59 master-01 etcd[15005]: finished scheduled compaction at 37977341 (took 51.217841ms)
Jul 30 08:18:59 master-01 etcd[15005]: store.index: compact 37979192
Jul 30 08:18:59 master-01 etcd[15005]: finished scheduled compaction at 37979192 (took 48.051738ms)
Jul 30 08:23:59 master-01 etcd[15005]: store.index: compact 37981047
Jul 30 08:23:59 master-01 etcd[15005]: finished scheduled compaction at 37981047 (took 46.45166ms)
Jul 30 08:28:59 master-01 etcd[15005]: store.index: compact 37982901
Jul 30 08:28:59 master-01 etcd[15005]: finished scheduled compaction at 37982901 (took 47.246082ms)

$ master-02
Jul 30 08:03:59 master-02 etcd[13317]: store.index: compact 37973635
Jul 30 08:03:59 master-02 etcd[13317]: finished scheduled compaction at 37973635 (took 46.375912ms)
Jul 30 08:08:59 master-02 etcd[13317]: store.index: compact 37975488
Jul 30 08:08:59 master-02 etcd[13317]: finished scheduled compaction at 37975488 (took 45.726045ms)
Jul 30 08:13:59 master-02 etcd[13317]: store.index: compact 37977341
Jul 30 08:13:59 master-02 etcd[13317]: finished scheduled compaction at 37977341 (took 45.845867ms)
Jul 30 08:18:59 master-02 etcd[13317]: store.index: compact 37979192
Jul 30 08:18:59 master-02 etcd[13317]: finished scheduled compaction at 37979192 (took 46.424525ms)
Jul 30 08:23:59 master-02 etcd[13317]: store.index: compact 37981047
Jul 30 08:23:59 master-02 etcd[13317]: finished scheduled compaction at 37981047 (took 45.964387ms)
Jul 30 08:28:59 master-02 etcd[13317]: store.index: compact 37982901
Jul 30 08:28:59 master-02 etcd[13317]: finished scheduled compaction at 37982901 (took 46.377512ms)
Jul 30 08:33:59 master-02 etcd[13317]: store.index: compact 37984757
Jul 30 08:33:59 master-02 etcd[13317]: finished scheduled compaction at 37984757 (took 46.59736ms)

$master-03
Jul 30 08:03:59 master-03 etcd[17249]: store.index: compact 37973635
Jul 30 08:03:59 master-03 etcd[17249]: finished scheduled compaction at 37973635 (took 45.501156ms)
Jul 30 08:08:59 master-03 etcd[17249]: store.index: compact 37975488
Jul 30 08:08:59 master-03 etcd[17249]: finished scheduled compaction at 37975488 (took 45.430944ms)
Jul 30 08:13:59 master-03 etcd[17249]: store.index: compact 37977341
Jul 30 08:13:59 master-03 etcd[17249]: finished scheduled compaction at 37977341 (took 46.691081ms)
Jul 30 08:18:59 master-03 etcd[17249]: store.index: compact 37979192
Jul 30 08:18:59 master-03 etcd[17249]: finished scheduled compaction at 37979192 (took 45.734024ms)
Jul 30 08:23:59 master-03 etcd[17249]: store.index: compact 37981047
Jul 30 08:23:59 master-03 etcd[17249]: finished scheduled compaction at 37981047 (took 47.319429ms)
Jul 30 08:28:59 master-03 etcd[17249]: store.index: compact 37982901
Jul 30 08:28:59 master-03 etcd[17249]: finished scheduled compaction at 37982901 (took 45.726165ms)
Jul 30 08:33:59 master-03 etcd[17249]: store.index: compact 37984757
Jul 30 08:33:59 master-03 etcd[17249]: finished scheduled compaction at 37984757 (took 46.440745ms)

etcd (version 3.4.7) uses separate ssd disk

$ etcdctl - endpoint health
https://172.16.192.22:2379 is healthy: successfully committed proposal: took = 4.109557ms
https://172.16.192.23:2379 is healthy: successfully committed proposal: took = 4.625822ms
https://172.16.192.24:2379 is healthy: successfully committed proposal: took = 5.467934ms

$ etcdctl - endpoint status
https://172.16.192.22:2379, 84888d87b4780bbe, 3.4.7, 46 MB, true, false, 842, 42078843, 42078843,
https://172.16.192.23:2379, 74d87de777c36d9b, 3.4.7, 46 MB, false, false, 842, 42078843, 42078843,
https://172.16.192.24:2379, d00e45ffe6a40a1b, 3.4.7, 46 MB, false, false, 842, 42078843, 42078843,

fio
fsync/fdatasync/sync_file_range:
sync (usec): min=303, max=5445, avg=396.53, stdev=154.94
sync percentiles (usec):
| 1.00th=[ 318], 5.00th=[ 322], 10.00th=[ 326], 20.00th=[ 334],
| 30.00th=[ 338], 40.00th=[ 347], 50.00th=[ 359], 60.00th=[ 367],
| 70.00th=[ 383], 80.00th=[ 408], 90.00th=[ 486], 95.00th=[ 709],
| 99.00th=[ 766], 99.50th=[ 799], 99.90th=[ 2474], 99.95th=[ 3458],
| 99.99th=[ 5473]

I don't understand why this is happening. Should I be worried about this?

PierreZ · 2020-08-12T09:20:36Z

Hi 👋

We are also experiencing some performance issues on most of our (loaded) ETCD's clusters. we have a setup very similar to @Sh4d1:

3 nodes running 3.4.9
hundred of active k8s api servers connected to it
~1.5k range/s and 500 txn/s
load average: 2.83 3.49 3.50 out of 16 cores
disk commit duration is fine thanks to expensive nvme disks
auth disabled
metrics are also fine except etcd_server_slow_apply_total (~70/s):
- no heartbeat fail
- no etcd_server_slow_read_indexes_total or etcd_server_read_indexes_failed_total
- no network issues
nothing in the logs, except this one which correlates with etcd_server_slow_apply_total:

{
  "level": "info",
  "ts": "2020-08-12T08:24:56.181Z",
  "caller": "traceutil/trace.go:145",
  "msg": "trace[677217921] range",
  "detail": "{range_begin:/...redacted...; range_end:; response_count:1; response_revision:2725080604; }",
  "duration": "1.553047811s",
  "start": "2020-08-12T08:24:54.628Z",
  "end": "2020-08-12T08:24:56.181Z",
  "steps": [
    "trace[677217921] 'agreement among raft nodes before linearized reading'  (duration: 1.534322015s)"
  ]
}

Duration seems to move between 150ms to 2s.

@tangcong did you have time to dig into it? I'm considering adding more trace. I was thinking about broadcasting the trace through the readwaitc channel in order to trace things inside the linearizableReadLoop. This will allow us to know the timing of s.r.ReadIndex vs s.applyWait.Wait(rs.Index). What are your thoughts about this?

tangcong · 2020-08-12T09:42:41Z

yes. we have to add more trace in linearized reading. can you do it? I do not reproduce it. thanks. @PierreZ

PierreZ · 2020-08-12T09:45:05Z

Thanks for the quick reply @tangcong, working on it 🚀

PierreZ · 2020-08-13T06:28:11Z

Yesterday, I made a patch(green in CI) based on master branch, which seems to be different compared to release-3.4. I will cherry-pick it on our ETCD's version to test it under high QPS before submit the PR.

I will keep you posted 😃

tangcong · 2020-08-13T10:06:41Z

good. Looking forward to your analysis results. @PierreZ

PierreZ · 2020-08-21T15:20:42Z

I didn't have time to deploy the patch because of KubeCon, I'll do it once I'm back from holidays!

PierreZ · 2020-09-22T13:28:39Z

@tangcong I reworked my patch and I will push it towards master soon, but here's an extract of some traces for linearized reading(taken from a follower):

{
  "level": "info",
  "ts": "2020-09-22T12:54:01.021Z",
  "caller": "traceutil/trace.go:152",
  "msg": "trace[2138445431] linearizableReadLoop",
  "detail": "",
  "duration": "855.447896ms",
  "start": "2020-09-22T12:54:00.166Z",
  "end": "2020-09-22T12:54:01.021Z",
  "steps": [
    "trace[2138445431] read index received  (duration: 824.408µs)",
    "trace[2138445431] applied index is now lower than readState.Index  (duration: 854.622058ms)"
  ]
}

More here.

Seems like we are spending most of our time waiting for applied indexes.

FYI this patch has been tested on a ssd-based cluster and nvme-based cluster, same behavior.

tangcong · 2020-09-25T00:58:31Z

thanks @PierreZ When etcd spends most of its time waiting for application indexing, is there a warning apply log(too slow)? What is the network delay between nodes？

PierreZ · 2020-09-25T14:08:53Z

Yes, we have a high number of slow applies, this is the max for several clusters with nvme disks:

Network seems fine(max(rate(etcd_network_peer_round_trip_time_seconds_sum):

I will be happy to investigate to find the bottlenecks here, @tangcong do you see places where we could add some tracing? As I was digging into read path, I haven't (yet!) read the code for write path.

tangcong · 2020-09-25T14:25:02Z

good. etcd is suitable for reading more and writing less. If the write request is more frequent, the read performance(lineraized read) may decrease more severely.

PierreZ · 2020-09-28T12:58:42Z

good. etcd is suitable for reading more and writing less. If the write request is more frequent, the read performance(lineraized read) may decrease more severely.

Do you think ~600 Txn/1k range is the limit on a 3 NVME member cluster ? Do you see a way to improve the throughput ?

jadencodes · 2020-09-28T19:04:36Z

Found this thread today after a lot of digging in to our own problem. We aren't using Txn's, just plain deletes (with return prev_value) and range/puts. After lots of tweaking and messing around, we still couldn't figure out why everything looks fine for hours/days (max of 3 of 16 cores used, 1 gig ram, plenty of iops), then the node health check starts failing for no apparent reason.

We have 3 etcd nodes with the following specs:

16 cores
32 gigs ram
6k iops io1 drive

We are doing at most 200 deletes per second and 300 reads and rarely hit this near the time we see this issue. It is a single connection/client so far with a single lease and around 150k keys on this lease. We plan to expand the number of leases/keys but not until we can resolve this.

Our cluster is running the latest 3.4.13 in a kubernetes statefulset (not the kubernetes etcd control plane nodes). We have debug turned on and it appears that the scenario is as follows:

Slower and slower reads/deletes (standard is 20-40ms) over the course of 1-5 minutes
Pod health check fails probe (etcdctl endpoint health)
Pod restarts (as expected)
Cluster is fine again

After increasing the lease ttl to 10 seconds for our use case, the cluster is able to recover before the ttl expires so it appears to the consumer that everything is fine. However, it feels that either our consumer or etcd is doing something fundamentally wrong because it really looks steady for a while and then shuts down for not apparent reason.

Someone above mentioned server_slow_applys being a symptom, after checking ours it seems to align exactly to the couple minutes before the pod restarts and this is the only indication of trouble coming.

Let me know if there is a better place to post this or if there is anymore info i can provide!

PierreZ · 2020-09-29T06:38:41Z

Someone above mentioned server_slow_applys being a symptom, after checking ours it seems to align exactly to the couple minutes before the pod restarts and this is the only indication of trouble coming.

Yes, server_slow_applies is a very good metrics to follow to see ETCD's health and performance.

Pod restarts (as expected)

Is is the leader?

Cluster is fine again

Can you check if there is slow apply only the failing pod or on others? We are running a defrag every 20 min on one node of the cluster. As you are deleting a lot of keys, that might help.

In the mean time, I will go through the write-path's code and see if I can add some tracing to find bottlenecks while waiting for @tangcong's answer to my question.

jadencodes · 2020-09-29T16:00:42Z

@PierreZ Thanks for the response! It appears that it is not the leader that is restarting but the pod that the client is actually connected to (it only connects to one of the pods, not all members of the cluster). The slow apply is the same thing, slight increase on the leader sometimes but generally just the pod the client is connected to. I'll investigate implementing defrag. Thank you.

PierreZ · 2020-10-01T07:09:39Z

I made some progress, thanks to @andyliuliming! I went through #11783 where he identified a perf regression. Thanks a lot for putting me on a direction 👍

So, I digged between v3.3.22 and v3.4.0 and played the tests to find which commits introduced the regression. I found this commit by @jingyih but I may be wrong about the benchmark.

to reproduce

git checkout a73fb85c0cb5b5e112d600a83f01318b6d2748c2
./build && goreman -f ./Procfile.v2 start

# in another terminal
# first putting some values and then range
benchmark put --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 --key-size 8 --val-size 256 --total 100000 --sequential-keys && benchmark --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 range 12345678123456781234567812345678  --total 100000

range result

Summary:
  Total:	3.7788 secs.
  Slowest:	0.1397 secs.
  Fastest:	0.0007 secs.
  Average:	0.0375 secs.
  Stddev:	0.0160 secs.
  Requests/sec:	26463.6391

Response time histogram:
  0.0007 [1]	|
  0.0146 [5547]	|∎∎∎∎∎
  0.0285 [23182]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0424 [38186]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0563 [22220]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0702 [7614]	|∎∎∎∎∎∎∎
  0.0841 [2246]	|∎∎
  0.0980 [598]	|
  0.1119 [147]	|
  0.1258 [142]	|
  0.1397 [117]	|

Latency distribution:
  10% in 0.0185 secs.
  25% in 0.0270 secs.
  50% in 0.0361 secs.
  75% in 0.0461 secs.
  90% in 0.0573 secs.
  95% in 0.0650 secs.
  99% in 0.0842 secs.
  99.9% in 0.1280 secs.

If you try the same bench with the commit just before (88922b0), we obtain for range requests:

Summary:
  Total:	1.5111 secs.
  Slowest:	0.1128 secs.
  Fastest:	0.0004 secs.
  Average:	0.0114 secs.
  Stddev:	0.0064 secs.
  Requests/sec:	66176.6583

Response time histogram:
  0.0004 [1]	|
  0.0116 [60772]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0229 [34807]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0341 [3146]	|∎∎
  0.0453 [945]	|
  0.0566 [172]	|
  0.0678 [131]	|
  0.0791 [11]	|
  0.0903 [10]	|
  0.1016 [1]	|
  0.1128 [4]	|

Latency distribution:
  10% in 0.0052 secs.
  25% in 0.0073 secs.
  50% in 0.0103 secs.
  75% in 0.0139 secs.
  90% in 0.0180 secs.
  95% in 0.0222 secs.
  99% in 0.0369 secs.
  99.9% in 0.0588 secs.

I decided to try without putting values first, like this:

./benchmark --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 range 12345678123456781234567812345678  --total 100000

on a73fb85:

Summary:
  Total:	2.3489 secs.
  Slowest:	0.0729 secs.
  Fastest:	0.0007 secs.
  Average:	0.0229 secs.
  Stddev:	0.0095 secs.
  Requests/sec:	42573.3523

Response time histogram:
  0.0007 [1]	|
  0.0079 [4743]	|∎∎∎∎∎∎
  0.0152 [15456]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0224 [30808]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0296 [28497]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0368 [12464]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0440 [5399]	|∎∎∎∎∎∎∎
  0.0513 [1698]	|∎∎
  0.0585 [647]	|
  0.0657 [260]	|
  0.0729 [27]	|

Latency distribution:
  10% in 0.0112 secs.
  25% in 0.0166 secs.
  50% in 0.0222 secs.
  75% in 0.0282 secs.
  90% in 0.0351 secs.
  95% in 0.0398 secs.
  99% in 0.0507 secs.
  99.9% in 0.0623 secs.

on 88922b0:

Summary:
  Total:	1.4633 secs.
  Slowest:	0.0835 secs.
  Fastest:	0.0005 secs.
  Average:	0.0116 secs.
  Stddev:	0.0070 secs.
  Requests/sec:	68336.5878

Response time histogram:
  0.0005 [1]	|
  0.0088 [39525]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0171 [45452]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0254 [10590]	|∎∎∎∎∎∎∎∎∎
  0.0337 [2719]	|∎∎
  0.0420 [1071]	|
  0.0503 [283]	|
  0.0586 [216]	|
  0.0669 [114]	|
  0.0752 [24]	|
  0.0835 [5]	|

Latency distribution:
  10% in 0.0051 secs.
  25% in 0.0071 secs.
  50% in 0.0101 secs.
  75% in 0.0140 secs.
  90% in 0.0196 secs.
  95% in 0.0244 secs.
  99% in 0.0381 secs.
  99.9% in 0.0618 secs.

Something is definitely going on, but I'm not sure about my benchmarking method, can someone confirm it? Today I'll dive into the commit and I will keep you posted.

stale · 2020-12-30T08:28:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

PierreZ · 2021-01-25T08:45:04Z

Thanks for reopening, as we are still stuck on this issue.

I talked with @jingyih about my previous question, he told me that the patch was needed to improve long reads. He also suggested trying to tune the batch limit/interval, but performance were worst afterwards.

As a temporary fix, we are adding more clusters to spread customer's load.

stale · 2021-04-25T09:55:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

xtso520ok · 2023-03-15T06:21:56Z

I have a bunch of these

May 14 11:29:06 etcd05 etcd[11500]: {"level":"warn","ts":"2020-05-14T11:29:06.092+0200","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"476.300053ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/leases/kube-system/kube-controller-manager\" ","response":"range_response_count:1 size:365"}

May 14 11:29:06 etcd05 etcd[11500]: {"level":"info","ts":"2020-05-14T11:29:06.092+0200","caller":"traceutil/trace.go:145","msg":"trace[203066865] range","detail":"{range_begin:/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/leases/kube-system/kube-controller-manager; range_end:; response_count:1; response_revision:16398215814; }","duration":"476.321764ms","start":"2020-05-14T11:29:05.616+0200","end":"2020-05-14T11:29:06.092+0200","steps":["trace[203066865] 'agreement among raft nodes before linearized reading'  (duration: 475.660494ms)"]}

and solution

PierreZ mentioned this issue Sep 25, 2020

etcdserver: add more detailed traces on linearized reading #12335

Merged

hillaryfraley mentioned this issue Oct 20, 2020

Add read/write and grant config info for external etcd in cluster guide sensu/sensu-docs#2218

Closed

Sh4d1 mentioned this issue Nov 3, 2020

Allow configuration of etcd healthcheck timeout kubernetes/kubernetes#93244

Merged

stale bot added the stale label Dec 30, 2020

stale bot closed this as completed Jan 20, 2021

spzala reopened this Jan 25, 2021

stale bot removed the stale label Jan 25, 2021

stale bot added the stale label Apr 25, 2021

stale bot closed this as completed May 16, 2021

Random performance issue on etcd 3.4 #11884

Random performance issue on etcd 3.4 #11884

Comments

Sh4d1 commented May 12, 2020

xiang90 commented May 12, 2020

Sh4d1 commented May 13, 2020

Sh4d1 commented May 13, 2020

Sh4d1 commented May 13, 2020

xiang90 commented May 13, 2020

Sh4d1 commented May 13, 2020

xiang90 commented May 13, 2020

Sh4d1 commented May 13, 2020

xiang90 commented May 13, 2020

Sh4d1 commented May 13, 2020

tangcong commented May 14, 2020

Sh4d1 commented May 14, 2020

tangcong commented May 14, 2020

Sh4d1 commented May 14, 2020

tangcong commented May 14, 2020 • edited Loading

Sh4d1 commented May 14, 2020

tangcong commented May 14, 2020 • edited Loading

Sh4d1 commented May 14, 2020

tangcong commented May 14, 2020

Sh4d1 commented May 14, 2020

Sh4d1 commented May 14, 2020

tangcong commented May 14, 2020

Sh4d1 commented May 14, 2020

tangcong commented May 14, 2020

Sh4d1 commented May 14, 2020

tangcong commented May 14, 2020 via email • edited Loading

Sh4d1 commented May 14, 2020

tangcong commented May 14, 2020

Sh4d1 commented May 14, 2020

Sh4d1 commented May 14, 2020

Sh4d1 commented Jul 2, 2020

retinio commented Jul 30, 2020

PierreZ commented Aug 12, 2020 • edited Loading

tangcong commented Aug 12, 2020

PierreZ commented Aug 12, 2020

PierreZ commented Aug 13, 2020

tangcong commented Aug 13, 2020

PierreZ commented Aug 21, 2020

PierreZ commented Sep 22, 2020

tangcong commented Sep 25, 2020

PierreZ commented Sep 25, 2020

tangcong commented Sep 25, 2020

PierreZ commented Sep 28, 2020

jadencodes commented Sep 28, 2020

PierreZ commented Sep 29, 2020 • edited Loading

jadencodes commented Sep 29, 2020

PierreZ commented Oct 1, 2020

to reproduce

stale bot commented Dec 30, 2020

PierreZ commented Jan 25, 2021

stale bot commented Apr 25, 2021

xtso520ok commented Mar 15, 2023

tangcong commented May 14, 2020 •

edited

Loading

tangcong commented May 14, 2020 •

edited

Loading

tangcong commented May 14, 2020 via email •

edited

Loading

PierreZ commented Aug 12, 2020 •

edited

Loading

PierreZ commented Sep 29, 2020 •

edited

Loading