Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random performance issue on etcd 3.4 #11884

Closed
Sh4d1 opened this issue May 12, 2020 · 77 comments
Closed

Random performance issue on etcd 3.4 #11884

Sh4d1 opened this issue May 12, 2020 · 77 comments
Labels

Comments

@Sh4d1
Copy link

Sh4d1 commented May 12, 2020

Hello,

We are running a 5 node etcd cluster:

$ etcdctl endpoint status 
etcd01, 507905ef22a349ce, 3.4.7, 3.1 GB, true, false, 785, 17694912855, 17694912855, 
etcd02, 96622104eaa8652d, 3.4.7, 3.1 GB, false, false, 785, 17694912881, 17694912880, 
ectd03, e91fce12ee84c080, 3.4.7, 3.1 GB, false, false, 785, 17694912903, 17694912903, 
etcd04, 400fc14411f50272, 3.4.7, 3.1 GB, false, false, 785, 17694912989, 17694912985, 
etcd05, 87c46f0b178dc777, 3.4.7, 3.1 GB, false, false, 785, 17694913043, 17694913028, 

And we're having some weird performance issue eg:

# etcdctl endpoint health
etcd01 is healthy: successfully committed proposal: took = 12.462058ms
etcd03 is healthy: successfully committed proposal: took = 18.826686ms
etcd02 is healthy: successfully committed proposal: took = 19.418745ms
etcd04 is healthy: successfully committed proposal: took = 24.314474ms
etcd05 is healthy: successfully committed proposal: took = 244.761598ms

# etcdctl endpoint health
etcd01 is healthy: successfully committed proposal: took = 13.505405ms
etcd03 is healthy: successfully committed proposal: took = 21.905048ms
etcd04 is healthy: successfully committed proposal: took = 22.569332ms
etcd02 is healthy: successfully committed proposal: took = 23.10597ms
etcd05 is healthy: successfully committed proposal: took = 24.182998ms

# etcdctl endpoint health
etcd05is healthy: successfully committed proposal: took = 24.854541ms
etcd01 is healthy: successfully committed proposal: took = 86.045049ms
etcd03 is healthy: successfully committed proposal: took = 171.771975ms
etcd04 is healthy: successfully committed proposal: took = 576.218846ms
etcd02 is healthy: successfully committed proposal: took = 1.06666032s

Not sure how to debug it, it looks pretty random. Feel free to ask for more info!

@xiang90
Copy link
Contributor

xiang90 commented May 12, 2020

I looks like disk I/O spikes.

@Sh4d1
Copy link
Author

Sh4d1 commented May 13, 2020

This is also what I'm thinking, I also have periodically slow read only range request. Though I can't see any problems when looking at my metrics 🤔 @xiang90 any ideas on where to look?

@Sh4d1
Copy link
Author

Sh4d1 commented May 13, 2020

Commit durations looks fine

etcd_disk_backend_commit_duration_seconds_bucket{le="0.001"} 168386
etcd_disk_backend_commit_duration_seconds_bucket{le="0.002"} 576993
etcd_disk_backend_commit_duration_seconds_bucket{le="0.004"} 757708
etcd_disk_backend_commit_duration_seconds_bucket{le="0.008"} 782904
etcd_disk_backend_commit_duration_seconds_bucket{le="0.016"} 787763
etcd_disk_backend_commit_duration_seconds_bucket{le="0.032"} 788436
etcd_disk_backend_commit_duration_seconds_bucket{le="0.064"} 788584
etcd_disk_backend_commit_duration_seconds_bucket{le="0.128"} 788953
etcd_disk_backend_commit_duration_seconds_bucket{le="0.256"} 789118
etcd_disk_backend_commit_duration_seconds_bucket{le="0.512"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="1.024"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="2.048"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="4.096"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="8.192"} 789148
etcd_disk_backend_commit_duration_seconds_bucket{le="+Inf"} 789148
etcd_disk_backend_commit_duration_seconds_sum 1461.2600709299697
etcd_disk_backend_commit_duration_seconds_count 789148

fsync looks fine as well

etcd_disk_wal_fsync_duration_seconds_bucket{le="0.001"} 9.397643e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.002"} 9.4252083e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.004"} 9.4467886e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.008"} 9.4633426e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.016"} 9.4712732e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.032"} 9.4718483e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.064"} 9.4718972e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.128"} 9.4719722e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.256"} 9.4719978e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="0.512"} 9.4719985e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="1.024"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="2.048"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="4.096"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="8.192"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_bucket{le="+Inf"} 9.4719986e+07
etcd_disk_wal_fsync_duration_seconds_sum 18283.75915433406
etcd_disk_wal_fsync_duration_seconds_count 9.4719986e+07

rtt seems ok too:

etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0001"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0004"} 2
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0008"} 1813
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0016"} 2275
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0032"} 2307
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0064"} 2311
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0128"} 2312
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0256"} 4610
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.0512"} 4631
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.1024"} 4634
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.2048"} 4635
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.4096"} 4635
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="0.8192"} 4636
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="1.6384"} 4636
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="3.2768"} 4636
etcd_network_peer_round_trip_time_seconds_bucket{To="400fc14411f50272",le="+Inf"} 4636
etcd_network_peer_round_trip_time_seconds_sum{To="400fc14411f50272"} 46.492993348000034
etcd_network_peer_round_trip_time_seconds_count{To="400fc14411f50272"} 4636
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0004"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0008"} 1742
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0016"} 2228
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0032"} 2276
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0064"} 2284
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0128"} 2291
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0256"} 4495
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.0512"} 4554
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.1024"} 4581
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.2048"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.4096"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="0.8192"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="1.6384"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="3.2768"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="87c46f0b178dc777",le="+Inf"} 4594
etcd_network_peer_round_trip_time_seconds_sum{To="87c46f0b178dc777"} 49.258597341999945
etcd_network_peer_round_trip_time_seconds_count{To="87c46f0b178dc777"} 4594
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0001"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0004"} 175
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0008"} 1974
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0016"} 2246
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0032"} 2277
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0064"} 2285
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0128"} 2288
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0256"} 4533
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.0512"} 4568
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.1024"} 4582
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.2048"} 4586
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.4096"} 4587
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="0.8192"} 4588
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="1.6384"} 4588
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="3.2768"} 4588
etcd_network_peer_round_trip_time_seconds_bucket{To="96622104eaa8652d",le="+Inf"} 4588
etcd_network_peer_round_trip_time_seconds_sum{To="96622104eaa8652d"} 44.96468762199984
etcd_network_peer_round_trip_time_seconds_count{To="96622104eaa8652d"} 4588
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0001"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0004"} 1
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0008"} 1719
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0016"} 2238
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0032"} 2271
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0064"} 2278
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0128"} 2280
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0256"} 4529
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.0512"} 4558
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.1024"} 4567
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.2048"} 4573
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.4096"} 4575
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="0.8192"} 4575
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="1.6384"} 4575
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="3.2768"} 4575
etcd_network_peer_round_trip_time_seconds_bucket{To="e91fce12ee84c080",le="+Inf"} 4575
etcd_network_peer_round_trip_time_seconds_sum{To="e91fce12ee84c080"} 46.92858499100008
etcd_network_peer_round_trip_time_seconds_count{To="e91fce12ee84c080"} 4575

@Sh4d1
Copy link
Author

Sh4d1 commented May 13, 2020

we are also seeing some /health error; QGET failed etcdserver: request timed out (status code 503)

@xiang90
Copy link
Contributor

xiang90 commented May 13, 2020

Was there any leader election happening? Any warning/errors in the log?

@Sh4d1
Copy link
Author

Sh4d1 commented May 13, 2020

No election leader, and I see the spikes in latency every 5-10s on 1 node for 1-2s maybe. No errors, but I think the read only range requests are at around the same time

@xiang90
Copy link
Contributor

xiang90 commented May 13, 2020

but I think the read only range requests are at around the same time

If there is an expensive range request, it will affect latency. But etcd should log something out when that happens.

@Sh4d1
Copy link
Author

Sh4d1 commented May 13, 2020

It only logs that the request took longer than 100ms. And I think the request are not that big 🤔 I'll paste some examples tomorrow!

@xiang90
Copy link
Contributor

xiang90 commented May 13, 2020

It only logs that the request took longer than 100ms. And I think the request are not that big 🤔 I'll paste some examples tomorrow!

Probably that was what going on.

@Sh4d1
Copy link
Author

Sh4d1 commented May 13, 2020

@xiang90 I'm also seeing some high p99 grpc latency (above 2s) on for instance Range, LeaseGrant and Txn. I'm not sure which is causing which 🤔

@tangcong
Copy link
Contributor

can you enable trace feature? ( set --logger=zap). you can specify --metrics=extensive to get all grpcMethod latency. can you provide all etcd metrics info and trace log? @Sh4d1

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

@tangcong here are the metrics for one of the node https://gist.github.com/Sh4d1/bbfba21f3176865cf69c4533d5e39286
Do you want for the 4 other nodes too?

Can't set the logger right now, I'll do it when I can

@tangcong
Copy link
Contributor

It seems that the number of connections is very large, how about the etcd server load?

process_open_fds 76494
go_goroutines 224164
etcd_server_health_failures 1900
etcd_server_heartbeat_send_failures_total 904

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

Yes there is around ~ 1000 k8s api servers connected to the cluster.
The cpu load is okay 4.49, 5.18, 5.81 and cpu usage of etcd is between 200 and 600% (on 12 cores)

@tangcong
Copy link
Contributor

tangcong commented May 14, 2020

do you also use etcd v2? What is the current cluster read and write QPS for etcd v2,v3? in the k8s, listing pods and events are expensive, in my opinion, supporting 1000 k8s cluster by one etcd cluster is very risky.

etcd_debugging_store_reads_total{action="get"} 505432
etcd_debugging_store_reads_total{action="getRecursive"} 4

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

nop it's only v3 for all clusters. How can I get the QPS ? 🤔

@tangcong
Copy link
Contributor

tangcong commented May 14, 2020

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

@tangcong around 7k

@tangcong
Copy link
Contributor

I guess performance maybe reached the bottleneck and look forward to your etcd server trace log, it can give us detailed info.

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

I have a bunch of these

May 14 11:29:06 etcd05 etcd[11500]: {"level":"warn","ts":"2020-05-14T11:29:06.092+0200","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"476.300053ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/leases/kube-system/kube-controller-manager\" ","response":"range_response_count:1 size:365"}

May 14 11:29:06 etcd05 etcd[11500]: {"level":"info","ts":"2020-05-14T11:29:06.092+0200","caller":"traceutil/trace.go:145","msg":"trace[203066865] range","detail":"{range_begin:/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/leases/kube-system/kube-controller-manager; range_end:; response_count:1; response_revision:16398215814; }","duration":"476.321764ms","start":"2020-05-14T11:29:05.616+0200","end":"2020-05-14T11:29:06.092+0200","steps":["trace[203066865] 'agreement among raft nodes before linearized reading'  (duration: 475.660494ms)"]}

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

Other long ones

May 14 11:35:51 etcd[11500]: {"level":"info","ts":"2020-05-14T11:35:51.795+0200","caller":"traceutil/trace.go:145","msg":"trace[233128838] range","detail":"{range_begin:/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/configmaps/kube-system/ingress-controller-leader-nginx; range_end:; response_count:1; response_revision:16398946939; }","duration":"1.318159848s","start":"2020-05-14T11:35:50.477+0200","end":"2020-05-14T11:35:51.795+0200","steps":["trace[233128838] 'get authentication metadata'  (duration: 1.317348866s)"]}

@tangcong

@tangcong
Copy link
Contributor

Other long ones

May 14 11:35:51 etcd[11500]: {"level":"info","ts":"2020-05-14T11:35:51.795+0200","caller":"traceutil/trace.go:145","msg":"trace[233128838] range","detail":"{range_begin:/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/configmaps/kube-system/ingress-controller-leader-nginx; range_end:; response_count:1; response_revision:16398946939; }","duration":"1.318159848s","start":"2020-05-14T11:35:50.477+0200","end":"2020-05-14T11:35:51.795+0200","steps":["trace[233128838] 'get authentication metadata'  (duration: 1.317348866s)"]}

@tangcong

func (as *authStore) isOpPermitted(userName string, revision uint64, key, rangeEnd []byte, permTyp authpb.Permission_Type) error {
	// TODO(mitake): this function would be costly so we need a caching mechanism
	if !as.IsAuthEnabled() {
		return nil
	}

	// only gets rev == 0 when passed AuthInfo{}; no user given
	if revision == 0 {
		return ErrUserEmpty
	}
	rev := as.Revision()
	if revision < rev {
		if as.lg != nil {
			as.lg.Warn("request auth revision is less than current node auth revision",
				zap.Uint64("current node auth revision", rev),
				zap.Uint64("request auth revision", revision),
				zap.ByteString("request key", key),
				zap.Error(ErrAuthOldRevision))
		} else {
			plog.Warningf("request auth revision is less than current node auth revision,"+
				"current node auth revision is %d,"+
				"request auth revision is %d,"+
				"request key is %s, "+
				"err is %v", rev, revision, key, ErrAuthOldRevision)
		}
		return ErrAuthOldRevision
	}

	tx := as.be.BatchTx()
	tx.Lock()
	defer tx.Unlock()

	user := getUser(as.lg, tx, userName)
	if user == nil {
		if as.lg != nil {
			as.lg.Warn("cannot find a user for permission check", zap.String("user-name", userName))
		} else {
			plog.Errorf("invalid user name %s for permission checking", userName)
		}
		return ErrPermissionDenied
	}

	// root role should have permission on all ranges
	if hasRootRole(user) {
		return nil
	}

	if as.isRangeOpPermitted(tx, userName, key, rangeEnd, permTyp) {
		return nil
	}

	return ErrPermissionDenied
}

do you enable auth? I guess that tx.Lock() is expensive and I will take a try to optimize it(such as cache, using rwmutex).

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

Yep we use tls auth, (which is pretty expensive indeed 😅 )

@tangcong
Copy link
Contributor

I prefer to use cache mechanism to improve isOpPermitted performance. do you have any suggestions? thanks.
@xiang90 @mitake

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

@tangcong is this called on each request?

@tangcong
Copy link
Contributor

tangcong commented May 14, 2020 via email

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

Not easily no :(
But I'm not sure why it would only happend for ~1s every 5-10s 🤔

@tangcong
Copy link
Contributor

it may be related to the snapshot?it will increase the load. I get info from your metrics file that your etcd cluster has v2 keys. how much v2 keys?

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

That's what I thought but it didn't exactly match the pattern (though it may be it true).
Do I have to restart etcd with --enable-v2=true ? 🤔

@Sh4d1
Copy link
Author

Sh4d1 commented May 14, 2020

And I think we don't have v2 keys though 🤔

@Sh4d1
Copy link
Author

Sh4d1 commented Jul 2, 2020

@tangcong any news on this?

@retinio
Copy link

retinio commented Jul 30, 2020

I have a similar problem.
Yesterday, disk sync delay occurred only on one node (etcd leader from 2ms to ~16-37ms)

etcd-1

Today, it is on randomly node

etcd-5

Logs from node at time long dysk sync:

etcd-6

$ master-01
Jul 30 08:03:59 master-01 etcd[15005]: store.index: compact 37973635
Jul 30 08:03:59 master-01 etcd[15005]: finished scheduled compaction at 37973635 (took 46.989429ms)
Jul 30 08:08:59 master-01 etcd[15005]: store.index: compact 37975488
Jul 30 08:08:59 master-01 etcd[15005]: finished scheduled compaction at 37975488 (took 47.586964ms)
Jul 30 08:13:59 master-01 etcd[15005]: store.index: compact 37977341
Jul 30 08:13:59 master-01 etcd[15005]: finished scheduled compaction at 37977341 (took 51.217841ms)
Jul 30 08:18:59 master-01 etcd[15005]: store.index: compact 37979192
Jul 30 08:18:59 master-01 etcd[15005]: finished scheduled compaction at 37979192 (took 48.051738ms)
Jul 30 08:23:59 master-01 etcd[15005]: store.index: compact 37981047
Jul 30 08:23:59 master-01 etcd[15005]: finished scheduled compaction at 37981047 (took 46.45166ms)
Jul 30 08:28:59 master-01 etcd[15005]: store.index: compact 37982901
Jul 30 08:28:59 master-01 etcd[15005]: finished scheduled compaction at 37982901 (took 47.246082ms)

$ master-02
Jul 30 08:03:59 master-02 etcd[13317]: store.index: compact 37973635
Jul 30 08:03:59 master-02 etcd[13317]: finished scheduled compaction at 37973635 (took 46.375912ms)
Jul 30 08:08:59 master-02 etcd[13317]: store.index: compact 37975488
Jul 30 08:08:59 master-02 etcd[13317]: finished scheduled compaction at 37975488 (took 45.726045ms)
Jul 30 08:13:59 master-02 etcd[13317]: store.index: compact 37977341
Jul 30 08:13:59 master-02 etcd[13317]: finished scheduled compaction at 37977341 (took 45.845867ms)
Jul 30 08:18:59 master-02 etcd[13317]: store.index: compact 37979192
Jul 30 08:18:59 master-02 etcd[13317]: finished scheduled compaction at 37979192 (took 46.424525ms)
Jul 30 08:23:59 master-02 etcd[13317]: store.index: compact 37981047
Jul 30 08:23:59 master-02 etcd[13317]: finished scheduled compaction at 37981047 (took 45.964387ms)
Jul 30 08:28:59 master-02 etcd[13317]: store.index: compact 37982901
Jul 30 08:28:59 master-02 etcd[13317]: finished scheduled compaction at 37982901 (took 46.377512ms)
Jul 30 08:33:59 master-02 etcd[13317]: store.index: compact 37984757
Jul 30 08:33:59 master-02 etcd[13317]: finished scheduled compaction at 37984757 (took 46.59736ms)

$master-03
Jul 30 08:03:59 master-03 etcd[17249]: store.index: compact 37973635
Jul 30 08:03:59 master-03 etcd[17249]: finished scheduled compaction at 37973635 (took 45.501156ms)
Jul 30 08:08:59 master-03 etcd[17249]: store.index: compact 37975488
Jul 30 08:08:59 master-03 etcd[17249]: finished scheduled compaction at 37975488 (took 45.430944ms)
Jul 30 08:13:59 master-03 etcd[17249]: store.index: compact 37977341
Jul 30 08:13:59 master-03 etcd[17249]: finished scheduled compaction at 37977341 (took 46.691081ms)
Jul 30 08:18:59 master-03 etcd[17249]: store.index: compact 37979192
Jul 30 08:18:59 master-03 etcd[17249]: finished scheduled compaction at 37979192 (took 45.734024ms)
Jul 30 08:23:59 master-03 etcd[17249]: store.index: compact 37981047
Jul 30 08:23:59 master-03 etcd[17249]: finished scheduled compaction at 37981047 (took 47.319429ms)
Jul 30 08:28:59 master-03 etcd[17249]: store.index: compact 37982901
Jul 30 08:28:59 master-03 etcd[17249]: finished scheduled compaction at 37982901 (took 45.726165ms)
Jul 30 08:33:59 master-03 etcd[17249]: store.index: compact 37984757
Jul 30 08:33:59 master-03 etcd[17249]: finished scheduled compaction at 37984757 (took 46.440745ms)

etcd (version 3.4.7) uses separate ssd disk

$ etcdctl - endpoint health
https://172.16.192.22:2379 is healthy: successfully committed proposal: took = 4.109557ms
https://172.16.192.23:2379 is healthy: successfully committed proposal: took = 4.625822ms
https://172.16.192.24:2379 is healthy: successfully committed proposal: took = 5.467934ms

$ etcdctl - endpoint status
https://172.16.192.22:2379, 84888d87b4780bbe, 3.4.7, 46 MB, true, false, 842, 42078843, 42078843,
https://172.16.192.23:2379, 74d87de777c36d9b, 3.4.7, 46 MB, false, false, 842, 42078843, 42078843,
https://172.16.192.24:2379, d00e45ffe6a40a1b, 3.4.7, 46 MB, false, false, 842, 42078843, 42078843,

fio
fsync/fdatasync/sync_file_range:
sync (usec): min=303, max=5445, avg=396.53, stdev=154.94
sync percentiles (usec):
| 1.00th=[ 318], 5.00th=[ 322], 10.00th=[ 326], 20.00th=[ 334],
| 30.00th=[ 338], 40.00th=[ 347], 50.00th=[ 359], 60.00th=[ 367],
| 70.00th=[ 383], 80.00th=[ 408], 90.00th=[ 486], 95.00th=[ 709],
| 99.00th=[ 766], 99.50th=[ 799], 99.90th=[ 2474], 99.95th=[ 3458],
| 99.99th=[ 5473]

I don't understand why this is happening. Should I be worried about this?

@PierreZ
Copy link
Contributor

PierreZ commented Aug 12, 2020

Hi 👋

We are also experiencing some performance issues on most of our (loaded) ETCD's clusters. we have a setup very similar to @Sh4d1:

  • 3 nodes running 3.4.9
  • hundred of active k8s api servers connected to it
  • ~1.5k range/s and 500 txn/s
  • load average: 2.83 3.49 3.50 out of 16 cores
  • disk commit duration is fine thanks to expensive nvme disks
  • auth disabled
  • metrics are also fine except etcd_server_slow_apply_total (~70/s):
    • no heartbeat fail
    • no etcd_server_slow_read_indexes_total or etcd_server_read_indexes_failed_total
    • no network issues
  • nothing in the logs, except this one which correlates with etcd_server_slow_apply_total:
{
  "level": "info",
  "ts": "2020-08-12T08:24:56.181Z",
  "caller": "traceutil/trace.go:145",
  "msg": "trace[677217921] range",
  "detail": "{range_begin:/...redacted...; range_end:; response_count:1; response_revision:2725080604; }",
  "duration": "1.553047811s",
  "start": "2020-08-12T08:24:54.628Z",
  "end": "2020-08-12T08:24:56.181Z",
  "steps": [
    "trace[677217921] 'agreement among raft nodes before linearized reading'  (duration: 1.534322015s)"
  ]
}

Duration seems to move between 150ms to 2s.

@tangcong did you have time to dig into it? I'm considering adding more trace. I was thinking about broadcasting the trace through the readwaitc channel in order to trace things inside the linearizableReadLoop. This will allow us to know the timing of s.r.ReadIndex vs s.applyWait.Wait(rs.Index). What are your thoughts about this?

@tangcong
Copy link
Contributor

yes. we have to add more trace in linearized reading. can you do it? I do not reproduce it. thanks. @PierreZ

@PierreZ
Copy link
Contributor

PierreZ commented Aug 12, 2020

Thanks for the quick reply @tangcong, working on it 🚀

@PierreZ
Copy link
Contributor

PierreZ commented Aug 13, 2020

Yesterday, I made a patch(green in CI) based on master branch, which seems to be different compared to release-3.4. I will cherry-pick it on our ETCD's version to test it under high QPS before submit the PR.

I will keep you posted 😃

@tangcong
Copy link
Contributor

good. Looking forward to your analysis results. @PierreZ

@PierreZ
Copy link
Contributor

PierreZ commented Aug 21, 2020

I didn't have time to deploy the patch because of KubeCon, I'll do it once I'm back from holidays!

@PierreZ
Copy link
Contributor

PierreZ commented Sep 22, 2020

@tangcong I reworked my patch and I will push it towards master soon, but here's an extract of some traces for linearized reading(taken from a follower):

{
  "level": "info",
  "ts": "2020-09-22T12:54:01.021Z",
  "caller": "traceutil/trace.go:152",
  "msg": "trace[2138445431] linearizableReadLoop",
  "detail": "",
  "duration": "855.447896ms",
  "start": "2020-09-22T12:54:00.166Z",
  "end": "2020-09-22T12:54:01.021Z",
  "steps": [
    "trace[2138445431] read index received  (duration: 824.408µs)",
    "trace[2138445431] applied index is now lower than readState.Index  (duration: 854.622058ms)"
  ]
}

More here.

Seems like we are spending most of our time waiting for applied indexes.

FYI this patch has been tested on a ssd-based cluster and nvme-based cluster, same behavior.

@tangcong
Copy link
Contributor

thanks @PierreZ When etcd spends most of its time waiting for application indexing, is there a warning apply log(too slow)? What is the network delay between nodes?

@PierreZ
Copy link
Contributor

PierreZ commented Sep 25, 2020

Yes, we have a high number of slow applies, this is the max for several clusters with nvme disks:
Capture d’écran 2020-09-25 à 11 13 28

Network seems fine(max(rate(etcd_network_peer_round_trip_time_seconds_sum):
Capture d’écran 2020-09-25 à 15 49 11

I will be happy to investigate to find the bottlenecks here, @tangcong do you see places where we could add some tracing? As I was digging into read path, I haven't (yet!) read the code for write path.

@tangcong
Copy link
Contributor

good. etcd is suitable for reading more and writing less. If the write request is more frequent, the read performance(lineraized read) may decrease more severely.

@PierreZ
Copy link
Contributor

PierreZ commented Sep 28, 2020

good. etcd is suitable for reading more and writing less. If the write request is more frequent, the read performance(lineraized read) may decrease more severely.

Do you think ~600 Txn/1k range is the limit on a 3 NVME member cluster ? Do you see a way to improve the throughput ?

@jadencodes
Copy link

Found this thread today after a lot of digging in to our own problem. We aren't using Txn's, just plain deletes (with return prev_value) and range/puts. After lots of tweaking and messing around, we still couldn't figure out why everything looks fine for hours/days (max of 3 of 16 cores used, 1 gig ram, plenty of iops), then the node health check starts failing for no apparent reason.

We have 3 etcd nodes with the following specs:

  • 16 cores
  • 32 gigs ram
  • 6k iops io1 drive

We are doing at most 200 deletes per second and 300 reads and rarely hit this near the time we see this issue. It is a single connection/client so far with a single lease and around 150k keys on this lease. We plan to expand the number of leases/keys but not until we can resolve this.

Our cluster is running the latest 3.4.13 in a kubernetes statefulset (not the kubernetes etcd control plane nodes). We have debug turned on and it appears that the scenario is as follows:

  1. Slower and slower reads/deletes (standard is 20-40ms) over the course of 1-5 minutes
  2. Pod health check fails probe (etcdctl endpoint health)
  3. Pod restarts (as expected)
  4. Cluster is fine again

After increasing the lease ttl to 10 seconds for our use case, the cluster is able to recover before the ttl expires so it appears to the consumer that everything is fine. However, it feels that either our consumer or etcd is doing something fundamentally wrong because it really looks steady for a while and then shuts down for not apparent reason.

Someone above mentioned server_slow_applys being a symptom, after checking ours it seems to align exactly to the couple minutes before the pod restarts and this is the only indication of trouble coming.
Screen Shot 2020-09-28 at 1 00 42 PM

Let me know if there is a better place to post this or if there is anymore info i can provide!

@PierreZ
Copy link
Contributor

PierreZ commented Sep 29, 2020

Someone above mentioned server_slow_applys being a symptom, after checking ours it seems to align exactly to the couple minutes before the pod restarts and this is the only indication of trouble coming.

Yes, server_slow_applies is a very good metrics to follow to see ETCD's health and performance.

  1. Pod restarts (as expected)

Is is the leader?

  1. Cluster is fine again

Can you check if there is slow apply only the failing pod or on others? We are running a defrag every 20 min on one node of the cluster. As you are deleting a lot of keys, that might help.

In the mean time, I will go through the write-path's code and see if I can add some tracing to find bottlenecks while waiting for @tangcong's answer to my question.

@jadencodes
Copy link

@PierreZ Thanks for the response! It appears that it is not the leader that is restarting but the pod that the client is actually connected to (it only connects to one of the pods, not all members of the cluster). The slow apply is the same thing, slight increase on the leader sometimes but generally just the pod the client is connected to. I'll investigate implementing defrag. Thank you.

@PierreZ
Copy link
Contributor

PierreZ commented Oct 1, 2020

I made some progress, thanks to @andyliuliming! I went through #11783 where he identified a perf regression. Thanks a lot for putting me on a direction 👍

So, I digged between v3.3.22 and v3.4.0 and played the tests to find which commits introduced the regression. I found this commit by @jingyih but I may be wrong about the benchmark.

to reproduce

git checkout a73fb85c0cb5b5e112d600a83f01318b6d2748c2
./build && goreman -f ./Procfile.v2 start

# in another terminal
# first putting some values and then range
benchmark put --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 --key-size 8 --val-size 256 --total 100000 --sequential-keys && benchmark --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 range 12345678123456781234567812345678  --total 100000

range result

Summary:
  Total:	3.7788 secs.
  Slowest:	0.1397 secs.
  Fastest:	0.0007 secs.
  Average:	0.0375 secs.
  Stddev:	0.0160 secs.
  Requests/sec:	26463.6391

Response time histogram:
  0.0007 [1]	|
  0.0146 [5547]	|∎∎∎∎∎
  0.0285 [23182]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0424 [38186]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0563 [22220]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0702 [7614]	|∎∎∎∎∎∎∎
  0.0841 [2246]	|∎∎
  0.0980 [598]	|
  0.1119 [147]	|
  0.1258 [142]	|
  0.1397 [117]	|

Latency distribution:
  10% in 0.0185 secs.
  25% in 0.0270 secs.
  50% in 0.0361 secs.
  75% in 0.0461 secs.
  90% in 0.0573 secs.
  95% in 0.0650 secs.
  99% in 0.0842 secs.
  99.9% in 0.1280 secs.

If you try the same bench with the commit just before (88922b0), we obtain for range requests:

Summary:
  Total:	1.5111 secs.
  Slowest:	0.1128 secs.
  Fastest:	0.0004 secs.
  Average:	0.0114 secs.
  Stddev:	0.0064 secs.
  Requests/sec:	66176.6583

Response time histogram:
  0.0004 [1]	|
  0.0116 [60772]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0229 [34807]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0341 [3146]	|∎∎
  0.0453 [945]	|
  0.0566 [172]	|
  0.0678 [131]	|
  0.0791 [11]	|
  0.0903 [10]	|
  0.1016 [1]	|
  0.1128 [4]	|

Latency distribution:
  10% in 0.0052 secs.
  25% in 0.0073 secs.
  50% in 0.0103 secs.
  75% in 0.0139 secs.
  90% in 0.0180 secs.
  95% in 0.0222 secs.
  99% in 0.0369 secs.
  99.9% in 0.0588 secs.

I decided to try without putting values first, like this:

./benchmark --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 range 12345678123456781234567812345678  --total 100000

on a73fb85:

Summary:
  Total:	2.3489 secs.
  Slowest:	0.0729 secs.
  Fastest:	0.0007 secs.
  Average:	0.0229 secs.
  Stddev:	0.0095 secs.
  Requests/sec:	42573.3523

Response time histogram:
  0.0007 [1]	|
  0.0079 [4743]	|∎∎∎∎∎∎
  0.0152 [15456]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0224 [30808]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0296 [28497]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0368 [12464]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0440 [5399]	|∎∎∎∎∎∎∎
  0.0513 [1698]	|∎∎
  0.0585 [647]	|
  0.0657 [260]	|
  0.0729 [27]	|

Latency distribution:
  10% in 0.0112 secs.
  25% in 0.0166 secs.
  50% in 0.0222 secs.
  75% in 0.0282 secs.
  90% in 0.0351 secs.
  95% in 0.0398 secs.
  99% in 0.0507 secs.
  99.9% in 0.0623 secs.

on 88922b0:

Summary:
  Total:	1.4633 secs.
  Slowest:	0.0835 secs.
  Fastest:	0.0005 secs.
  Average:	0.0116 secs.
  Stddev:	0.0070 secs.
  Requests/sec:	68336.5878

Response time histogram:
  0.0005 [1]	|
  0.0088 [39525]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0171 [45452]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0254 [10590]	|∎∎∎∎∎∎∎∎∎
  0.0337 [2719]	|∎∎
  0.0420 [1071]	|
  0.0503 [283]	|
  0.0586 [216]	|
  0.0669 [114]	|
  0.0752 [24]	|
  0.0835 [5]	|

Latency distribution:
  10% in 0.0051 secs.
  25% in 0.0071 secs.
  50% in 0.0101 secs.
  75% in 0.0140 secs.
  90% in 0.0196 secs.
  95% in 0.0244 secs.
  99% in 0.0381 secs.
  99.9% in 0.0618 secs.

Something is definitely going on, but I'm not sure about my benchmarking method, can someone confirm it? Today I'll dive into the commit and I will keep you posted.

@stale
Copy link

stale bot commented Dec 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 30, 2020
@stale stale bot closed this as completed Jan 20, 2021
@spzala spzala reopened this Jan 25, 2021
@stale stale bot removed the stale label Jan 25, 2021
@PierreZ
Copy link
Contributor

PierreZ commented Jan 25, 2021

Thanks for reopening, as we are still stuck on this issue.

I talked with @jingyih about my previous question, he told me that the patch was needed to improve long reads. He also suggested trying to tune the batch limit/interval, but performance were worst afterwards.

As a temporary fix, we are adding more clusters to spread customer's load.

@stale
Copy link

stale bot commented Apr 25, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 25, 2021
@stale stale bot closed this as completed May 16, 2021
@xtso520ok
Copy link

I have a bunch of these

May 14 11:29:06 etcd05 etcd[11500]: {"level":"warn","ts":"2020-05-14T11:29:06.092+0200","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"476.300053ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/leases/kube-system/kube-controller-manager\" ","response":"range_response_count:1 size:365"}

May 14 11:29:06 etcd05 etcd[11500]: {"level":"info","ts":"2020-05-14T11:29:06.092+0200","caller":"traceutil/trace.go:145","msg":"trace[203066865] range","detail":"{range_begin:/70c88f65-ffe4-489a-b256-f7266b2dd48e/api/leases/kube-system/kube-controller-manager; range_end:; response_count:1; response_revision:16398215814; }","duration":"476.321764ms","start":"2020-05-14T11:29:05.616+0200","end":"2020-05-14T11:29:06.092+0200","steps":["trace[203066865] 'agreement among raft nodes before linearized reading'  (duration: 475.660494ms)"]}

and solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests