It's possible to get "wal: max entry size limit exceeded" with recommended values #14025

biosvs · 2022-05-09T10:20:22Z

What happened?

I run clean etcd node as followed:
etcd --max-request-bytes=10485760

Then through the go client I put to key k1 value consists of 10*1024*1024-27 bytes.

Then I stopped server and try to start it again, but it failed with error wal: max entry size limit exceeded (https://github.com/etcd-io/etcd/blob/main/server/storage/wal/decoder.go#L88).

What did you expect to happen?

It should not be possible to write more data than etcd can read then.
It should be obvious how much data (key + value) in bytes I'm able to write. I set --max-request-bytes to 10mb, but in fact was able to pass 25bytes less data (2bytes to key, 10mb-27bytes for data). Maybe internal overhead should not be part of validation?
It should not be possible to set --max-request-bytes to value so high that etcd will allow to write more than it can read then. OR the WAL limit should be configurable.

How can we reproduce it (as minimally and precisely as possible)?

Run clean etcd instance:
etcd --max-request-bytes=10485760

Run this go-code:

package main

import (
	"context"
	"go.etcd.io/etcd/clientv3"
)

func main() {
	length := 10*1024*1024 - 27
	b := make([]byte, length)
	for i := 0; i < length; i++ {
		b[i] = 'a'
	}

	cli, err := clientv3.New(clientv3.Config{
		Endpoints:          []string{"http://127.0.0.1:2379"},
		MaxCallSendMsgSize: length + 1024,
	})

	if err != nil {
		panic(err)
	}

	_, err = cli.Put(context.Background(), "k1", string(b))
	if err != nil {
		panic(err)
	}
}

Anything else we need to know?

Let's assume you run code above, you will not be able to restart server.
But you can call etcd snap save while server is still running, delete all WAL files and the start server and receive saved value.

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.18.1
Go OS/Arch: darwin/amd64

$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

--max-request-bytes=10485760

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+---------+-----------------------+-----------------------+------------+
|        ID        | STATUS  |  NAME   |      PEER ADDRS       |     CLIENT ADDRS      | IS LEARNER |
+------------------+---------+---------+-----------------------+-----------------------+------------+
| 8e9e05c52164694d | started | default | http://localhost:2380 | http://localhost:2379 |      false |
+------------------+---------+---------+-----------------------+-----------------------+------------+

$ etcdctl --endpoints=<member list> endpoint status -w table
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|       ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://localhost:2379 | 8e9e05c52164694d |   3.5.4 |   25 kB |      true |      false |         2 |          4 |                  4 |        |
+-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

{"level":"fatal","ts":"2022-05-09T13:19:04.630+0300","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"wal: max entry size limit exceeded","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/private/tmp/etcd-20220424-57243-1ka6pvw/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/private/tmp/etcd-20220424-57243-1ka6pvw/server/etcdmain/main.go:40\nmain.main\n\t/private/tmp/etcd-20220424-57243-1ka6pvw/server/main.go:32\nruntime.main\n\t/usr/local/Cellar/go/1.18.1/libexec/src/runtime/proc.go:250"}

The text was updated successfully, but these errors were encountered:

ahrtr · 2022-05-10T12:38:45Z

There are two reasons:

Firstly, when marshaling the data, there is indeed some overhead. In your example, although the the len(key) + len(value) = 10*1024*1024-27 + 2. But actually the final len(data) = 10485760.
Secondly, when etcd saves the data into WAL files, there is additional overhead (24 bytes). See raft.pb.go#L268-L270. So the decoded length is 10485760 + 24 = 10485784

In summary, the flag --max-request-bytes can limit the max size of the request, but isn't that accurate. There will be dozens of bytes overhead. It should be OK.

Does this cause any issue to you? Do you see any real issue in production environment, or you just intentionally tested it?

algobardo · 2022-05-10T15:07:33Z

@ahrtr I think there are several concerning issues revealed by this bug report:

An etcd instance that is configured with the suggested values can be brought to silently rot its wal by an adversarial client, to the point that nodes are not able to restart, due to not being able to read the entry from the wal. Solution: If there is a limit on the entry size for the wal, then make it a hard limit on the write path as well, so that the roting is not possible, nor silent.
The wal entry limit of 10Mb is not documented anywhere, and in turn translates into a limit of 10Mb per transaction. Solution: document this limit.
From my understanding the limit on the wal decoder has been introduced in 2020 in an unrelated diff (wal: check out of range slice in "ReadAll", entry limits in "decodeRecord" #11793 (comment)), and it is not technically necessary. If we want to impose such limit anyway, then I would make it configurable...unless I'm misunderstanding and there is an actual technical limitation. Solution: make the limit configurable.

I'm pretty sure that once you start enforcing this limit on the write path (1), a lot of production installments will start seeing transaction failures which are now hidden. (2) and (3) are the solutions for the problems exposed by (1).
Another solution to (1-3) is to remove the limit altogether.

biosvs · 2022-05-10T15:09:40Z

@ahrtr, thanks for the quick response.

I think I should arrange accents better to clarify my concern.

It's allowed to set any --max-request-bytes, but it could lead to lack of possibility of node restarting.

Set --max-request-bytes to 20Mb
Put key with value with size 19Mb
Try to restart node - and see that you cannot do it, because WAL contains ~19Mb entry.

It's happening because hardcoded limit in WAL implementation.

Maybe it's possible to make this hardcoded value configurable?

ahrtr · 2022-05-10T22:39:45Z

Thanks for all the feedback, which basically makes sense to me. I agree we should never bring down the etcd server.

Proposed actions:

Make the WAL entry limit configurable, and defaults to 10 * requestLimit;
or remove the WAL entry limitation on WAL?
Update the document, at least we need to clearly explain the fields in help.go

cc @ptabor @serathius @spzala @xiang90 @gyuho for opinions.

biosvs · 2022-05-20T09:42:19Z

@ahrtr Do you have any updates? We can work on PR from our side, but we need to know preferable solution.

algobardo · 2022-05-20T13:43:40Z

Let me try to do a bold move and get the ball rolling on proposal 2.
Do you see any problem with #14057?

xiang90 · 2022-05-21T04:13:05Z

etcd is designed to serve small-sized metadata. And all the write happens sequentially in etcd. What is your use case for storing large values into etcd? Would something like S3 be a better fit for you?

algobardo · 2022-05-21T08:08:36Z

We are using etcd for that purpose, and S3 would not be a good fit for us. The limit we are encountering is not on the key size, but on the transaction size, as the 10Mb limit applies to the transaction size.

If you consider the (configurable) default limit of a transaction size of 128, you can easily reach the 10Mb limit with a transaction involving 128 keys of 70kb, which IMHO is not too far off from the typical etcd use case.
In our case, we are doing rare transactions of 10.000 keys, which I agree are a bit of a stretch and might stall other writes, but we are ready to pay for it since they are rare.

Notice that both transaction sizes and max request sizes are configurable, so I find it weird to allow that configurability, but then hit a hardcoded limit in the wal record read path.

ahrtr · 2022-05-21T23:00:13Z

Comments:

No matter what's the use case, we should never bring down etcd server; So we should fix it from this perspective. The pull/14057 overall looks good to me.
From a real use case perspective, the WAL entry should have a limitation regardless of the maxRequestSize and maxTxnOps. But the current value (10MB) might not be a good value, we should add a bigger (hardcoded as well for simplicity?, i.e. 64MB ) limitation, and add a verification. If in some extreme use cases, users still run into this issue, then they should use other solutions instead of etcd. Of course, we should get this clearly documented.

ahrtr · 2022-05-21T23:16:30Z

Since the WAL file limitation is 64MB, so the simplest solution could be just to use the SegmentSizeBytes as the each WAL entry's limitation directly?

ptabor · 2022-05-22T11:33:25Z

I like the idea around capping the limits around SegmentSizeBytes.

refuse configs where --max-request-bytes > SegmentSizeBytes/4=16MB
Move the decoding safety check to comparison against SegmentSizeBytes.

algobardo · 2022-05-22T12:38:51Z

refuse configs where --max-request-bytes > SegmentSizeBytes/4=16MB

To my understanding, the SegmentSizeByte is only a pre-allocated size of the wal, but it doesn't mean that the actual wal size is <= SegmentSizeByte. I'm not knowledgeable with the codebase, but to me, the wal is written to support any size. I feel that the limits should be enforced in the upper layer (etcd server) more than in the wal package itself.

Move the decoding safety check to comparison against SegmentSizeBytes.

Maybe I'm missing the point, but I don't understand why we only talk about the decoding. If the wal has to enforce some limit (segment size or entry size), then it has to do it on the write path as well, or we risk running again into a situation where the server cannot start because it cannot read what it has written in the wal. From what I have understood so far, however, I don't see the reason to put such limits in the wal package itself, given that the etcd server is already protected against large transactions through the maxRequestSize and the maxTxnOps options.

ahrtr · 2022-05-22T21:48:00Z

Copy one of my previous comments:

From a real use case perspective, the WAL entry should have a limitation regardless of the maxRequestSize and maxTxnOps.

And read 14057#discussion_r878844991 .

Please update the PR per ptabor's comment above.

biosvs · 2022-05-23T11:28:29Z

Guys, I understand, you're owners and don't have to explain anything. But maybe you will find couple minutes to explain the way you're thinking?

You're about to introduce backward compatibility breaking change without any hesitations. Maybe it's worth first to ask people at least in Google Groups?
You think that it's okay to increase WAL entry size limit to 64MB, but input request should not be more than 16MB. Where did you get this numbers? What "real use case" means? Is it possible that 16MB request will be stored as 64MB entry, where this ratio came from?

I still can't understand your concerns about making parameters configurable. You've mentioned OOM, but those who operate particular etcd instance know more about their environment limitations?

ahrtr · 2022-05-23T22:40:09Z

Thanks all for the feedback.

You're about to introduce backward compatibility breaking change without any hesitations

The proposal doesn't break anything. Note that the current WAL entry size limit is 10MB, so it is NOT possible to configure a bigger value for --max-request-bytes based on current implementation. The new limitation (16MB) we proposed above is much bigger than 10MB, so it will not break any existing runtime runtime environment.

Also keep in mind that it's just a proposal, and it's open for discussion.

To my understanding, the SegmentSizeByte is only a pre-allocated size of the wal, but it doesn't mean that the actual wal size is <= SegmentSizeByte.

those who operate particular etcd instance know more about their environment limitations

Can you elaborate your use case? @algobardo mentioned "we are doing rare transactions of 10.000 keys", do you mean 10000 keys in one transaction?

biosvs · 2022-05-24T07:27:51Z

The proposal doesn't break anything

Currently it's possible to set any request size, WAL also will accept entries of any size. I bet most of people don't restart theirs etcd nodes very often, 'cause in general in just not needed as etcd works remarkably stable. We found discussed issue purely by chance.

That is mean that a lot of people could operate clusters with 10+ MB entry sizes in theirs WAL files, and most of the time everything will be perfect. Until they decide to restart node :)

If we introduce hard limit for --max-request-bytes, unpredictable number of operators/developers may face difficulties during version upgrade (and God bless them if they silently change this limit without double check how much they're actually writing).

Can you elaborate your use case? @algobardo mentioned "we are doing rare transactions of 10.000 keys", do you mean 10000 keys in one transaction?

Yes, in one transaction. It's not part of our daily routing with etcd, of course.
In general we have etcd cluster that contains settings for different objects. In turn object config consists of number of different keys (it allows to make gradually changes for multiple objects with less transaction size).

But occasionally we have to make whole "fleet" update - mostly during incidents mitigation. We use transaction because such changes should be atomic. That's how we get 10k keys in one transactions in some cases.

Of course we're thinking about separating such transactions into bunch of smaller ones. But in this case we lost some important guarantees about consistency and atomicity. You know, it's the first step to writing straight to files from your code instead of well-tested db :)

ahrtr · 2022-05-24T22:51:26Z

Have you tested what's the request size when a transaction has 10K keys?

I think we should have a limitation on both the WAL entry limit and request size, because we need to make sure they are under control and well tested. Regarding to the values, it's open to discussion.

Based on current design & implementation, the WAL entry limitation can be set to SegmentSizeBytes (64MB) by default, and the good candidates for requestSize can be 16MB, 32MB and (64MB - some_overhead). Note that the new feature (in progress of reviewing) may also throttle big request.

algobardo · 2022-05-30T06:29:46Z

Our transaction size exceeds 10MB, and is below 16MB, but having a hardcoded limit to 16MB means that we would have in front of us a hard limit to the scalability of the solution we have. If the limit would be 1000%, with a gradual degradation of performance, I would feel comfortable with it, as we would have enough time to figure out if we need to find another product to support our needs, but with 60% margin I don't feel comfortable.

The problem I see is that introducing a non-configurable limit on the WAL size is a breaking change. That limit was not there before #11793 (comment), and when it has been introduced (undocumented) it might have passed unnoticed, as it only shows up on node restarts. We noticed ourselves 2 years after the solution was in production.

My opinion is that we should set the WAL entry size limit to 120% the request size, as that would impose some limit, would match the actual size used by people in the wild, and it would not be a breaking change.

ahrtr · 2022-05-30T06:52:06Z

It's in my to do list. I will get this sorted out and ask for opinions from other maintainers and users sometime later.

ahrtr · 2022-06-17T08:03:09Z

This issue should have already been resolved in release-3.5 and main. Please refer to 14114

ahrtr · 2022-06-17T08:04:55Z

The fix will be included in 3.5.5 and 3.6.0.

biosvs · 2022-06-17T10:34:53Z

Thank you!

biosvs added the type/bug label May 9, 2022

algobardo mentioned this issue May 20, 2022

wal: removes entry size limit #14057

Closed

ahrtr added the stage/tracked label Jun 14, 2022

This was referenced Jun 14, 2022

Possible WAL corruption with loss of un-fsynced writes:max entry size limit exceeded #14098

Closed

Restriction on MaxRequestSize and maxWALEntrySize #14114

Closed

ahrtr closed this as completed Jun 17, 2022

roquie mentioned this issue Dec 24, 2022

ETCD wal: max entry size limit exceeded k0sproject/k0s#2553

Closed

4 tasks

denyeart mentioned this issue Jul 26, 2023

Error in upgrading orderer from 2.4.2 to 2.5.0 (wal: max entry size limit exceeded) hyperledger/fabric#4290

Closed

xiaofan-luan mentioned this issue Sep 5, 2023

[Bug]: etcd start failed: max entry size limit exceeded milvus-io/milvus#26855

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It's possible to get "wal: max entry size limit exceeded" with recommended values #14025

It's possible to get "wal: max entry size limit exceeded" with recommended values #14025

biosvs commented May 9, 2022 •

edited

Loading

ahrtr commented May 10, 2022

algobardo commented May 10, 2022 •

edited

Loading

biosvs commented May 10, 2022

ahrtr commented May 10, 2022 •

edited

Loading

biosvs commented May 20, 2022

algobardo commented May 20, 2022

xiang90 commented May 21, 2022

algobardo commented May 21, 2022

ahrtr commented May 21, 2022 •

edited

Loading

ahrtr commented May 21, 2022

ptabor commented May 22, 2022

algobardo commented May 22, 2022 •

edited

Loading

ahrtr commented May 22, 2022

biosvs commented May 23, 2022

ahrtr commented May 23, 2022

biosvs commented May 24, 2022 •

edited

Loading

ahrtr commented May 24, 2022 •

edited

Loading

algobardo commented May 30, 2022 •

edited

Loading

ahrtr commented May 30, 2022

ahrtr commented Jun 17, 2022

ahrtr commented Jun 17, 2022

biosvs commented Jun 17, 2022

It's possible to get "wal: max entry size limit exceeded" with recommended values #14025

It's possible to get "wal: max entry size limit exceeded" with recommended values #14025

Comments

biosvs commented May 9, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

ahrtr commented May 10, 2022

algobardo commented May 10, 2022 • edited Loading

biosvs commented May 10, 2022

ahrtr commented May 10, 2022 • edited Loading

biosvs commented May 20, 2022

algobardo commented May 20, 2022

xiang90 commented May 21, 2022

algobardo commented May 21, 2022

ahrtr commented May 21, 2022 • edited Loading

ahrtr commented May 21, 2022

ptabor commented May 22, 2022

algobardo commented May 22, 2022 • edited Loading

ahrtr commented May 22, 2022

biosvs commented May 23, 2022

ahrtr commented May 23, 2022

biosvs commented May 24, 2022 • edited Loading

ahrtr commented May 24, 2022 • edited Loading

algobardo commented May 30, 2022 • edited Loading

ahrtr commented May 30, 2022

ahrtr commented Jun 17, 2022

ahrtr commented Jun 17, 2022

biosvs commented Jun 17, 2022

biosvs commented May 9, 2022 •

edited

Loading

algobardo commented May 10, 2022 •

edited

Loading

ahrtr commented May 10, 2022 •

edited

Loading

ahrtr commented May 21, 2022 •

edited

Loading

algobardo commented May 22, 2022 •

edited

Loading

biosvs commented May 24, 2022 •

edited

Loading

ahrtr commented May 24, 2022 •

edited

Loading

algobardo commented May 30, 2022 •

edited

Loading