-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random performance issue on etcd 3.4 #11884
Comments
I looks like disk I/O spikes. |
This is also what I'm thinking, I also have periodically slow read only range request. Though I can't see any problems when looking at my metrics 🤔 @xiang90 any ideas on where to look? |
Commit durations looks fine
fsync looks fine as well
rtt seems ok too:
|
we are also seeing some |
Was there any leader election happening? Any warning/errors in the log? |
No election leader, and I see the spikes in latency every 5-10s on 1 node for 1-2s maybe. No errors, but I think the read only range requests are at around the same time |
If there is an expensive range request, it will affect latency. But etcd should log something out when that happens. |
It only logs that the request took longer than 100ms. And I think the request are not that big 🤔 I'll paste some examples tomorrow! |
Probably that was what going on. |
@xiang90 I'm also seeing some high p99 grpc latency (above 2s) on for instance Range, LeaseGrant and Txn. I'm not sure which is causing which 🤔 |
can you enable trace feature? ( set --logger=zap). you can specify --metrics=extensive to get all grpcMethod latency. can you provide all etcd metrics info and trace log? @Sh4d1 |
@tangcong here are the metrics for one of the node https://gist.github.com/Sh4d1/bbfba21f3176865cf69c4533d5e39286 Can't set the logger right now, I'll do it when I can |
It seems that the number of connections is very large, how about the etcd server load?
|
Yes there is around ~ 1000 k8s api servers connected to the cluster. |
do you also use etcd v2? What is the current cluster read and write QPS for etcd v2,v3? in the k8s, listing pods and events are expensive, in my opinion, supporting 1000 k8s cluster by one etcd cluster is very risky.
|
nop it's only v3 for all clusters. How can I get the QPS ? 🤔 |
@tangcong around 7k |
I guess performance maybe reached the bottleneck and look forward to your etcd server trace log, it can give us detailed info. |
I have a bunch of these
|
Other long ones
|
do you enable auth? I guess that tx.Lock() is expensive and I will take a try to optimize it(such as cache, using rwmutex). |
Yep we use tls auth, (which is pretty expensive indeed 😅 ) |
@tangcong is this called on each request? |
yes,if there are no safety issues,can you disable auth to verify it in your test environment?
…---Original---
From: "Patrik"<notifications@github.com>
Date: Thu, May 14, 2020 19:12 PM
To: "etcd-io/etcd"<etcd@noreply.github.com>;
Cc: "Mention"<mention@noreply.github.com>;"tangcong"<tangcong506@foxmail.com>;
Subject: Re: [etcd-io/etcd] Random performance issue on etcd 3.4 (#11884)
@tangcong is this called on each request?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Not easily no :( |
it may be related to the snapshot?it will increase the load. I get info from your metrics file that your etcd cluster has v2 keys. how much v2 keys? |
That's what I thought but it didn't exactly match the pattern (though it may be it true). |
And I think we don't have v2 keys though 🤔 |
@tangcong any news on this? |
I have a similar problem. Today, it is on randomly node Logs from node at time long dysk sync: $ master-01 $ master-02 $master-03 etcd (version 3.4.7) uses separate ssd disk $ etcdctl - endpoint health $ etcdctl - endpoint status
I don't understand why this is happening. Should I be worried about this? |
Hi 👋 We are also experiencing some performance issues on most of our (loaded) ETCD's clusters. we have a setup very similar to @Sh4d1:
{
"level": "info",
"ts": "2020-08-12T08:24:56.181Z",
"caller": "traceutil/trace.go:145",
"msg": "trace[677217921] range",
"detail": "{range_begin:/...redacted...; range_end:; response_count:1; response_revision:2725080604; }",
"duration": "1.553047811s",
"start": "2020-08-12T08:24:54.628Z",
"end": "2020-08-12T08:24:56.181Z",
"steps": [
"trace[677217921] 'agreement among raft nodes before linearized reading' (duration: 1.534322015s)"
]
} Duration seems to move between 150ms to 2s. @tangcong did you have time to dig into it? I'm considering adding more trace. I was thinking about broadcasting the trace through the |
yes. we have to add more trace in linearized reading. can you do it? I do not reproduce it. thanks. @PierreZ |
Thanks for the quick reply @tangcong, working on it 🚀 |
Yesterday, I made a patch(green in CI) based on master branch, which seems to be different compared to release-3.4. I will cherry-pick it on our ETCD's version to test it under high QPS before submit the PR. I will keep you posted 😃 |
good. Looking forward to your analysis results. @PierreZ |
I didn't have time to deploy the patch because of KubeCon, I'll do it once I'm back from holidays! |
@tangcong I reworked my patch and I will push it towards master soon, but here's an extract of some traces for linearized reading(taken from a follower): {
"level": "info",
"ts": "2020-09-22T12:54:01.021Z",
"caller": "traceutil/trace.go:152",
"msg": "trace[2138445431] linearizableReadLoop",
"detail": "",
"duration": "855.447896ms",
"start": "2020-09-22T12:54:00.166Z",
"end": "2020-09-22T12:54:01.021Z",
"steps": [
"trace[2138445431] read index received (duration: 824.408µs)",
"trace[2138445431] applied index is now lower than readState.Index (duration: 854.622058ms)"
]
} More here. Seems like we are spending most of our time waiting for applied indexes. FYI this patch has been tested on a ssd-based cluster and nvme-based cluster, same behavior. |
thanks @PierreZ When etcd spends most of its time waiting for application indexing, is there a warning apply log(too slow)? What is the network delay between nodes? |
Yes, we have a high number of slow applies, this is the max for several clusters with nvme disks: Network seems fine(max(rate( I will be happy to investigate to find the bottlenecks here, @tangcong do you see places where we could add some tracing? As I was digging into |
good. etcd is suitable for reading more and writing less. If the write request is more frequent, the read performance(lineraized read) may decrease more severely. |
Do you think ~600 Txn/1k range is the limit on a 3 NVME member cluster ? Do you see a way to improve the throughput ? |
Yes,
Is is the leader?
Can you check if there is In the mean time, I will go through the write-path's code and see if I can add some tracing to find bottlenecks while waiting for @tangcong's answer to my question. |
@PierreZ Thanks for the response! It appears that it is not the leader that is restarting but the pod that the client is actually connected to (it only connects to one of the pods, not all members of the cluster). The slow apply is the same thing, slight increase on the leader sometimes but generally just the pod the client is connected to. I'll investigate implementing defrag. Thank you. |
I made some progress, thanks to @andyliuliming! I went through #11783 where he identified a perf regression. Thanks a lot for putting me on a direction 👍 So, I digged between v3.3.22 and v3.4.0 and played the tests to find which commits introduced the regression. I found this commit by @jingyih but I may be wrong about the benchmark. to reproducegit checkout a73fb85c0cb5b5e112d600a83f01318b6d2748c2
./build && goreman -f ./Procfile.v2 start
# in another terminal
# first putting some values and then range
benchmark put --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 --key-size 8 --val-size 256 --total 100000 --sequential-keys && benchmark --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 range 12345678123456781234567812345678 --total 100000 range result
If you try the same bench with the commit just before (88922b0), we obtain for range requests:
I decided to try without putting values first, like this: ./benchmark --endpoints http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 --clients 1000 --conns 100 range 12345678123456781234567812345678 --total 100000 on a73fb85:
on 88922b0:
Something is definitely going on, but I'm not sure about my benchmarking method, can someone confirm it? Today I'll dive into the commit and I will keep you posted. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
Thanks for reopening, as we are still stuck on this issue. I talked with @jingyih about my previous question, he told me that the patch was needed to improve long reads. He also suggested trying to tune the batch limit/interval, but performance were worst afterwards. As a temporary fix, we are adding more clusters to spread customer's load. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
and solution |
Hello,
We are running a 5 node etcd cluster:
And we're having some weird performance issue eg:
Not sure how to debug it, it looks pretty random. Feel free to ask for more info!
The text was updated successfully, but these errors were encountered: