Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's possible to get "wal: max entry size limit exceeded" with recommended values #14025

Closed
biosvs opened this issue May 9, 2022 · 22 comments
Closed

Comments

@biosvs
Copy link
Contributor

biosvs commented May 9, 2022

What happened?

I run clean etcd node as followed:
etcd --max-request-bytes=10485760

Then through the go client I put to key k1 value consists of 10*1024*1024-27 bytes.

Then I stopped server and try to start it again, but it failed with error wal: max entry size limit exceeded (https://github.com/etcd-io/etcd/blob/main/server/storage/wal/decoder.go#L88).

What did you expect to happen?

  1. It should not be possible to write more data than etcd can read then.
  2. It should be obvious how much data (key + value) in bytes I'm able to write. I set --max-request-bytes to 10mb, but in fact was able to pass 25bytes less data (2bytes to key, 10mb-27bytes for data). Maybe internal overhead should not be part of validation?
  3. It should not be possible to set --max-request-bytes to value so high that etcd will allow to write more than it can read then. OR the WAL limit should be configurable.

How can we reproduce it (as minimally and precisely as possible)?

Run clean etcd instance:
etcd --max-request-bytes=10485760

Run this go-code:

package main

import (
	"context"
	"go.etcd.io/etcd/clientv3"
)

func main() {
	length := 10*1024*1024 - 27
	b := make([]byte, length)
	for i := 0; i < length; i++ {
		b[i] = 'a'
	}

	cli, err := clientv3.New(clientv3.Config{
		Endpoints:          []string{"http://127.0.0.1:2379"},
		MaxCallSendMsgSize: length + 1024,
	})

	if err != nil {
		panic(err)
	}

	_, err = cli.Put(context.Background(), "k1", string(b))
	if err != nil {
		panic(err)
	}
}

Anything else we need to know?

Let's assume you run code above, you will not be able to restart server.
But you can call etcd snap save while server is still running, delete all WAL files and the start server and receive saved value.

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.18.1
Go OS/Arch: darwin/amd64

$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

--max-request-bytes=10485760

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+---------+-----------------------+-----------------------+------------+
|        ID        | STATUS  |  NAME   |      PEER ADDRS       |     CLIENT ADDRS      | IS LEARNER |
+------------------+---------+---------+-----------------------+-----------------------+------------+
| 8e9e05c52164694d | started | default | http://localhost:2380 | http://localhost:2379 |      false |
+------------------+---------+---------+-----------------------+-----------------------+------------+

$ etcdctl --endpoints=<member list> endpoint status -w table
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://localhost:2379 | 8e9e05c52164694d |   3.5.4 |   25 kB |      true |      false |         2 |          4 |                  4 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

{"level":"fatal","ts":"2022-05-09T13:19:04.630+0300","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/private/tmp/etcd-20220424-57243-1ka6pvw/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/private/tmp/etcd-20220424-57243-1ka6pvw/server/etcdmain/main.go:40\nmain.main\n\t/private/tmp/etcd-20220424-57243-1ka6pvw/server/main.go:32\nruntime.main\n\t/usr/local/Cellar/go/1.18.1/libexec/src/runtime/proc.go:250"}
@biosvs biosvs added the type/bug label May 9, 2022
@ahrtr
Copy link
Member

ahrtr commented May 10, 2022

There are two reasons:

  1. Firstly, when marshaling the data, there is indeed some overhead. In your example, although the the len(key) + len(value) = 10*1024*1024-27 + 2. But actually the final len(data) = 10485760.
  2. Secondly, when etcd saves the data into WAL files, there is additional overhead (24 bytes). See raft.pb.go#L268-L270. So the decoded length is 10485760 + 24 = 10485784

In summary, the flag --max-request-bytes can limit the max size of the request, but isn't that accurate. There will be dozens of bytes overhead. It should be OK.

Does this cause any issue to you? Do you see any real issue in production environment, or you just intentionally tested it?

@algobardo
Copy link

algobardo commented May 10, 2022

@ahrtr I think there are several concerning issues revealed by this bug report:

  1. An etcd instance that is configured with the suggested values can be brought to silently rot its wal by an adversarial client, to the point that nodes are not able to restart, due to not being able to read the entry from the wal. Solution: If there is a limit on the entry size for the wal, then make it a hard limit on the write path as well, so that the roting is not possible, nor silent.

  2. The wal entry limit of 10Mb is not documented anywhere, and in turn translates into a limit of 10Mb per transaction. Solution: document this limit.

  3. From my understanding the limit on the wal decoder has been introduced in 2020 in an unrelated diff (wal: check out of range slice in "ReadAll", entry limits in "decodeRecord" #11793 (comment)), and it is not technically necessary. If we want to impose such limit anyway, then I would make it configurable...unless I'm misunderstanding and there is an actual technical limitation. Solution: make the limit configurable.

I'm pretty sure that once you start enforcing this limit on the write path (1), a lot of production installments will start seeing transaction failures which are now hidden. (2) and (3) are the solutions for the problems exposed by (1).
Another solution to (1-3) is to remove the limit altogether.

@biosvs
Copy link
Contributor Author

biosvs commented May 10, 2022

@ahrtr, thanks for the quick response.

I think I should arrange accents better to clarify my concern.

It's allowed to set any --max-request-bytes, but it could lead to lack of possibility of node restarting.

  1. Set --max-request-bytes to 20Mb
  2. Put key with value with size 19Mb
  3. Try to restart node - and see that you cannot do it, because WAL contains ~19Mb entry.

It's happening because hardcoded limit in WAL implementation.

Maybe it's possible to make this hardcoded value configurable?

@ahrtr
Copy link
Member

ahrtr commented May 10, 2022

Thanks for all the feedback, which basically makes sense to me. I agree we should never bring down the etcd server.

Proposed actions:

  1. Make the WAL entry limit configurable, and defaults to 10 * requestLimit;
  2. or remove the WAL entry limitation on WAL?
  3. Update the document, at least we need to clearly explain the fields in help.go

cc @ptabor @serathius @spzala @xiang90 @gyuho for opinions.

@biosvs
Copy link
Contributor Author

biosvs commented May 20, 2022

@ahrtr Do you have any updates? We can work on PR from our side, but we need to know preferable solution.

@algobardo
Copy link

Let me try to do a bold move and get the ball rolling on proposal 2.
Do you see any problem with #14057?

@xiang90
Copy link
Contributor

xiang90 commented May 21, 2022

etcd is designed to serve small-sized metadata. And all the write happens sequentially in etcd. What is your use case for storing large values into etcd? Would something like S3 be a better fit for you?

@algobardo
Copy link

We are using etcd for that purpose, and S3 would not be a good fit for us. The limit we are encountering is not on the key size, but on the transaction size, as the 10Mb limit applies to the transaction size.

If you consider the (configurable) default limit of a transaction size of 128, you can easily reach the 10Mb limit with a transaction involving 128 keys of 70kb, which IMHO is not too far off from the typical etcd use case.
In our case, we are doing rare transactions of 10.000 keys, which I agree are a bit of a stretch and might stall other writes, but we are ready to pay for it since they are rare.

Notice that both transaction sizes and max request sizes are configurable, so I find it weird to allow that configurability, but then hit a hardcoded limit in the wal record read path.

@ahrtr
Copy link
Member

ahrtr commented May 21, 2022

Comments:

  1. No matter what's the use case, we should never bring down etcd server; So we should fix it from this perspective. The pull/14057 overall looks good to me.
  2. From a real use case perspective, the WAL entry should have a limitation regardless of the maxRequestSize and maxTxnOps. But the current value (10MB) might not be a good value, we should add a bigger (hardcoded as well for simplicity?, i.e. 64MB ) limitation, and add a verification. If in some extreme use cases, users still run into this issue, then they should use other solutions instead of etcd. Of course, we should get this clearly documented.

@ahrtr
Copy link
Member

ahrtr commented May 21, 2022

Since the WAL file limitation is 64MB, so the simplest solution could be just to use the SegmentSizeBytes as the each WAL entry's limitation directly?

@ptabor
Copy link
Contributor

ptabor commented May 22, 2022

I like the idea around capping the limits around SegmentSizeBytes.

  1. refuse configs where --max-request-bytes > SegmentSizeBytes/4=16MB

  2. Move the decoding safety check to comparison against SegmentSizeBytes.

@algobardo
Copy link

algobardo commented May 22, 2022

  1. refuse configs where --max-request-bytes > SegmentSizeBytes/4=16MB

To my understanding, the SegmentSizeByte is only a pre-allocated size of the wal, but it doesn't mean that the actual wal size is <= SegmentSizeByte. I'm not knowledgeable with the codebase, but to me, the wal is written to support any size. I feel that the limits should be enforced in the upper layer (etcd server) more than in the wal package itself.

Move the decoding safety check to comparison against SegmentSizeBytes.

Maybe I'm missing the point, but I don't understand why we only talk about the decoding. If the wal has to enforce some limit (segment size or entry size), then it has to do it on the write path as well, or we risk running again into a situation where the server cannot start because it cannot read what it has written in the wal. From what I have understood so far, however, I don't see the reason to put such limits in the wal package itself, given that the etcd server is already protected against large transactions through the maxRequestSize and the maxTxnOps options.

@ahrtr
Copy link
Member

ahrtr commented May 22, 2022

Copy one of my previous comments:

From a real use case perspective, the WAL entry should have a limitation regardless of the maxRequestSize and maxTxnOps.

And read 14057#discussion_r878844991 .

Please update the PR per ptabor's comment above.

@biosvs
Copy link
Contributor Author

biosvs commented May 23, 2022

Guys, I understand, you're owners and don't have to explain anything. But maybe you will find couple minutes to explain the way you're thinking?

  1. You're about to introduce backward compatibility breaking change without any hesitations. Maybe it's worth first to ask people at least in Google Groups?
  2. You think that it's okay to increase WAL entry size limit to 64MB, but input request should not be more than 16MB. Where did you get this numbers? What "real use case" means? Is it possible that 16MB request will be stored as 64MB entry, where this ratio came from?

I still can't understand your concerns about making parameters configurable. You've mentioned OOM, but those who operate particular etcd instance know more about their environment limitations?

@ahrtr
Copy link
Member

ahrtr commented May 23, 2022

Thanks all for the feedback.

  1. You're about to introduce backward compatibility breaking change without any hesitations

The proposal doesn't break anything. Note that the current WAL entry size limit is 10MB, so it is NOT possible to configure a bigger value for --max-request-bytes based on current implementation. The new limitation (16MB) we proposed above is much bigger than 10MB, so it will not break any existing runtime runtime environment.

Also keep in mind that it's just a proposal, and it's open for discussion.

To my understanding, the SegmentSizeByte is only a pre-allocated size of the wal, but it doesn't mean that the actual wal size is <= SegmentSizeByte.

those who operate particular etcd instance know more about their environment limitations

Can you elaborate your use case? @algobardo mentioned "we are doing rare transactions of 10.000 keys", do you mean 10000 keys in one transaction?

@biosvs
Copy link
Contributor Author

biosvs commented May 24, 2022

The proposal doesn't break anything

Currently it's possible to set any request size, WAL also will accept entries of any size. I bet most of people don't restart theirs etcd nodes very often, 'cause in general in just not needed as etcd works remarkably stable. We found discussed issue purely by chance.

That is mean that a lot of people could operate clusters with 10+ MB entry sizes in theirs WAL files, and most of the time everything will be perfect. Until they decide to restart node :)

If we introduce hard limit for --max-request-bytes, unpredictable number of operators/developers may face difficulties during version upgrade (and God bless them if they silently change this limit without double check how much they're actually writing).

Can you elaborate your use case? @algobardo mentioned "we are doing rare transactions of 10.000 keys", do you mean 10000 keys in one transaction?

Yes, in one transaction. It's not part of our daily routing with etcd, of course.
In general we have etcd cluster that contains settings for different objects. In turn object config consists of number of different keys (it allows to make gradually changes for multiple objects with less transaction size).

But occasionally we have to make whole "fleet" update - mostly during incidents mitigation. We use transaction because such changes should be atomic. That's how we get 10k keys in one transactions in some cases.

Of course we're thinking about separating such transactions into bunch of smaller ones. But in this case we lost some important guarantees about consistency and atomicity. You know, it's the first step to writing straight to files from your code instead of well-tested db :)

@ahrtr
Copy link
Member

ahrtr commented May 24, 2022

Have you tested what's the request size when a transaction has 10K keys?

I think we should have a limitation on both the WAL entry limit and request size, because we need to make sure they are under control and well tested. Regarding to the values, it's open to discussion.

Based on current design & implementation, the WAL entry limitation can be set to SegmentSizeBytes (64MB) by default, and the good candidates for requestSize can be 16MB, 32MB and (64MB - some_overhead). Note that the new feature (in progress of reviewing) may also throttle big request.

@algobardo
Copy link

algobardo commented May 30, 2022

Our transaction size exceeds 10MB, and is below 16MB, but having a hardcoded limit to 16MB means that we would have in front of us a hard limit to the scalability of the solution we have. If the limit would be 1000%, with a gradual degradation of performance, I would feel comfortable with it, as we would have enough time to figure out if we need to find another product to support our needs, but with 60% margin I don't feel comfortable.

The problem I see is that introducing a non-configurable limit on the WAL size is a breaking change. That limit was not there before #11793 (comment), and when it has been introduced (undocumented) it might have passed unnoticed, as it only shows up on node restarts. We noticed ourselves 2 years after the solution was in production.

My opinion is that we should set the WAL entry size limit to 120% the request size, as that would impose some limit, would match the actual size used by people in the wild, and it would not be a breaking change.

@ahrtr
Copy link
Member

ahrtr commented May 30, 2022

It's in my to do list. I will get this sorted out and ask for opinions from other maintainers and users sometime later.

@ahrtr
Copy link
Member

ahrtr commented Jun 17, 2022

This issue should have already been resolved in release-3.5 and main. Please refer to 14114

@ahrtr ahrtr closed this as completed Jun 17, 2022
@ahrtr ahrtr closed this as completed Jun 17, 2022
@ahrtr
Copy link
Member

ahrtr commented Jun 17, 2022

The fix will be included in 3.5.5 and 3.6.0.

@biosvs
Copy link
Contributor Author

biosvs commented Jun 17, 2022

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants