-
Notifications
You must be signed in to change notification settings - Fork 295
[ETCD] auto-compaction on kube-aws 0.9.7? #1061
Comments
after a month of running kubernetes, I tried to create a namespace and the api returned a message: Error from server: etcdserver: mvcc: database space exceeded . Not sure if this is related to hitsory-compaction ? |
I've never experienced it myself but probably you need to run a compaction manually? As far as I remember, we had not added the flag to enable the auto compaction feature. |
Noted @mumoshu . I was suprise, I have 2.5GB of etcd data. It really needs to be compacted. Will this be added in the future releases of kube-aws? |
Looking at the journald logs of the etcd cluster notes I just created with kube-aws 0.9.9, it seems like there is scheduled auto-compaction occurring every 5 minutes.
|
@whereisaaron i see, is there a auto-defrag? Defragmenting ETCD I was able to reduced and optimized ETCD. From 2.5GB going to just only 400+MB still not sure why ETCD gets huge space and shrink. |
The documentation you mentioned indicates that defragmentation blocks data access and so effectively takes the etcd node offline. For a large defragmentations it sounds like you would need to carefully do it to one node as a time, least you lose quorum and the whole etcd cluster. Couple of ideas:
@mumoshu would either of those strategies make sense for an auto-defrag? |
I think etcdctl may deal with that concern, although the docs are not particular clear. Based on running it, it does appear the defrag occurs on one node at a time. I'm passing the full list in like I've also run it on single node etcd dev clusters and that was also fine although I've not done that so many times. |
Right, I've been defragmenting each node one-at-a-time. It is very important to defragment the ETCD as it can cause issue not only on ETCD itself but also doing backup to S3. Kindly correct me if I'm wrong here, I think kube-aws has an etcdadm-save. I think it backups the ETCD data, saved the backup somewhere in the root volume then pushes the it to S3. It sometimes fail for me as the location where it dumps the data has a small space. |
Yeah it does seems to do it in turn. It is not done automatically because it creates potentially significant latency. But with no indication if If we were to schedule it, I'd be tempted to be a little more conservative that checks all is good between dealing to each node.
Or e.g. maybe use the |
Some recent relevant discussion about the 'stop the world' behaviour of I was going to ask how we detect how fragmented the database is, but apparently there is no current mechanism (other than defragging and seeing what happens). Also, it appears that even defragmenting one node at a time can result in errors for k8s operations. You might need to actually remove an etcd node, delete its database, then re-add it, to get an error-free defragmentation according to this:
|
Usually, based on my experience, I just check the size of my ETCD if it goes 1G above. Because with a running of 4,000 containers in a single cluster will take about 500~700MB. Based on what I have read before, the optimize size of ETCD should not be greater than 2GB. But yeah, maybe we can have that kind of using |
Our team experienced a similar problem recently. I created this PR: #1427 to address some of the issues. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Would like to ask if there is a history compaction on ETCD? Or any related ETCD maintenance?
https://github.com/coreos/etcd/blob/master/Documentation/op-guide/maintenance.md
The text was updated successfully, but these errors were encountered: