Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attached keys deleted before lease expired #14758

Closed
5o11 opened this issue Nov 15, 2022 · 17 comments
Closed

Attached keys deleted before lease expired #14758

5o11 opened this issue Nov 15, 2022 · 17 comments

Comments

@5o11
Copy link

5o11 commented Nov 15, 2022

What happened?

I use Watch API and observe that sometimes all keys attached to one lease are deleted without any warning/error log.

When I check with lease timetoleave, I expect keys remain, as below:

etcdctl --user root:LAtaWXwZSn --hex=false lease timetolive 199883126ff08d88 --keys

shoud output:

lease 199883126ff08e28 granted with TTL(300s), remaining(267s), attached keys([activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).ReplyEmail/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).StarEmail/v1.0@kin-cloud-module-gmail-v1-0-0 handler_metadata/kin.cloud/module/gmail/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).DeleteEmail/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).GetEmailsWithFilter/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).GetEmailsWithLabels/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).MarkSpamEmail/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).SendEmail/v1.0@kin-cloud-module-gmail-v1-0-0 client/kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).AddEmailLabel/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).GetAllEmails/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).TrashEmail/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).GetUnreadEmails/v1.0@kin-cloud-module-gmail-v1-0-0 activity_handler_topology/kin.cloud/module/gmail/gmail.(*Gmail).ReadEmail/v1.0@kin-cloud-module-gmail-v1-0-0])

But I actual:

lease 199883126ff08e28 granted with TTL(300s), remaining(231s), attached keys([])

Note that it happens randomly with some leases

What did you expect to happen?

Attached keys remain when lease still alive

How can we reproduce it (as minimally and precisely as possible)?

  • Create lease
  • Create keys, attach them to above lease
  • Keep lease alive
  • Use watch API on above keys
  • Wait and/or check, keys will be deleted, but at some random time

Anything else we need to know?

I run etcd in k8s

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

ETCD_ON_K8S=yes
ETCDCTL_API=3
ETCD_LOG_LEVEL=info
ETCD_AUTH_TOKEN=jwt,priv-key=/opt/bitnami/etcd/certs/token/jwt-token.pem,sign-method=RS256,ttl=10m
ETCD_DISASTER_RECOVERY=no
ETCD_START_FROM_SNAPSHOT=no
ETCD_DATA_DIR=/bitnami/etcd/data

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
+------------------+---------+---------------------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|        ID        | STATUS  |        NAME         |                                        PEER ADDRS                                         |                                                                      CLIENT ADDRS                                                                      | IS LEARNER |
+------------------+---------+---------------------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
| cae71e07e30b1998 | started | topo-storage-etcd-0 | http://topo-storage-etcd-0.topo-storage-etcd-headless.kin-dev-deps.svc.cluster.local:2380 | http://topo-storage-etcd-0.topo-storage-etcd-headless.kin-dev-deps.svc.cluster.local:2379,http://topo-storage-etcd.kin-dev-deps.svc.cluster.local:2379 |      false |
+------------------+---------+---------------------+-------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+------------+

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
+------------------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|                                      ENDPOINT                                      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+------------------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| topo-storage-etcd-0.topo-storage-etcd-headless.kin-dev-deps.svc.cluster.local:2380 | cae71e07e30b1998 |   3.5.4 |   15 MB |      true |      false |         4 |    1632423 |            1632423 |        |
+------------------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+


</details>


### Relevant log output

_No response_
@5o11 5o11 added the type/bug label Nov 15, 2022
@serathius
Copy link
Member

serathius commented Nov 15, 2022

Thanks for providing steps to reproduce, I will take a look today as it might be relevant to the upcoming v3.5 patch release. cc @ahrtr

@5o11
Copy link
Author

5o11 commented Nov 15, 2022

Quick command to find out which leases contain no key

keys="$(etcdctl --user root:password lease list | grep -v found)"
for key in ${keys}; do etcdctl --user root:password lease timetolive $key --keys | grep -E "attached\ keys\(\[\]\)"; done

@ahrtr
Copy link
Member

ahrtr commented Nov 16, 2022

@serathius Are you still working on this issue?

@serathius
Copy link
Member

serathius commented Nov 17, 2022

When setting value of a key, by default the lease will also be overridden. You need to provide --lease-ignore flag to etcdctl to remove lease (implemented in #6960). This was always behavior in etcd API v3.

@5o11 I expect that you are not using --lease-ignore when writing to the key. Am I correct?

@ajityagaty
Copy link
Contributor

@serathius @ahrtr - Do you folks think this could be worked on by a person who has very little context?

@serathius
Copy link
Member

serathius commented Nov 17, 2022

@ajityagaty Don't think there is anything to do here. I plan to close it as works like intended after getting confirmation.

Please check out good first issue or help wanted labels.

@ahrtr
Copy link
Member

ahrtr commented Nov 18, 2022

Just to be clearer, there are two reasons for why the keys disappear in the attached keys list,

  1. You removed the keys using etcdctl del or clientV3 api;
  2. You updated the keys' values using etcdctl put or clientV3 api, without setting --lease-ignore for etcdctl or WithIgnoreValue for clientv3 API.

Please let's know you belong to which case.

If neither, then it's an issue. In that case, please provide detailed steps to reproduce this issue. thx.

@5o11
Copy link
Author

5o11 commented Nov 21, 2022

@serathius @ahrtr It's none of above cases, I use clientv3 api to put key like this:

c, _ := clientv3.New(clientv3.Config{
		Endpoints:   etcdConfig.GetStringSlice("endpoints"),
		DialTimeout: KEY_TTL * time.Second,
		Username:    etcdConfig.GetString("username"),
		Password:    etcdConfig.GetString("password"),
	})

lesser := clientv3.NewLease(c)
leaseRes, _ := lesser.Grant(context.Background(), 300)
leaseAliveResChan, _ := lesser.KeepAlive(context.Background(), leaseRes.ID)
go func() {
  for leaseAliveRes := range leaseAliveResChan {
	  logger.Debug("Lease kept alive", zap.Int64("lease_id", int64(leaseAliveRes.ID)), zap.Int64("ttl", leaseAliveRes.TTL))
  }
}()
...

       // PUT key
       c.Put(
		context.Background(),
		key,
		value,
		clientv3.WithLease(lesser),
	)

...

But can you please explain me why it disappears randomly for some keys only?

@5o11
Copy link
Author

5o11 commented Nov 21, 2022

Please note that I put those keys once and those keys will not be updated, so it's not 2nd case.

I can't provide reliable way to reproduce that. Since it happens so random, I will need more time to find the way to reproduce reliably.

@ahrtr
Copy link
Member

ahrtr commented Nov 21, 2022

I did not reproduce the issue, please provide a complete program and detailed steps to reproduce this issue.

Is it easy for you to reproduce this issue?

@serathius
Copy link
Member

Based on currently presented description it looks like a bug, however we need more details to confirm it.

My current understanding:

  • Client creates a lease once.
  • Client starts a keep alives probe for the lease
  • Client creates a key pointing to the lease
  • No other writes to key or lease are done. @5o11 Can you confirm?
  • At some point later, key gets deleted but lease still exists. Error: KeepAlive should never recreate the lease, even if it expires. @5o11 Can you confirm that leaseID is not reused?

Normally I would expect that both key and lease are deleted. If there is no other changes to key and lease, the only way key could be deleted was by lease running out of ttl. This can easily happen if there is a network issue and keep alives don't get delivered. However in that situation, lease should also be deleted as later KeepAlives should fail.

It's either a bug on etcd server/client side or client has send a new LeaseGrant request after lease was deleted. We need to confirm that this issue happen without additional LeaseGrant request.

Could you provide:

  • Etcd client version
  • Code that you used to generate the issue. Feel free to remove all non-etcd logic, but we need to confirm that there is no API misuse.
  • How did you noticed the issue?
    • Do you have client/etcd logs from the event?
    • Was it during normal function of etcd or was there any distuption (network issue/etcd crash)?

@5o11
Copy link
Author

5o11 commented Nov 22, 2022

@ahrtr It's very hard to reproduce the issue

@serathius Yes, you are correct.
No other writes to key or lease are done.
Lease ID is not reused.

Etcd version is in first comment

$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

Above is all etcd-related code, it's so simple, think of it as implementing a kind of service registry, then each client creates lease once, keep it alive and just put some keys on start and never touch them again, nothing special. I can log key deletion because I use watch api using key prefix.

I see nothing abnormal from etcd logs, no crash, no disruption, clients (pods) are invoked and terminated frequently, but I see no concurrent issue here because clients work with different keys.

I cannot reproduce this for ~5 days, can we keep this issue open then when it occurs, I'll collect and post full logs of both etcd and client when I received keys deleted events.

@serathius
Copy link
Member

Etcd version is in first comment

Asked about go module version in your binary as example you sent didn't use etcdctl.

@serathius
Copy link
Member

I expect this might be a race condition related to leases not being linearizable. #13915. Would be good to verify correctness of the solution. Based on problems with reproducing I expect this is a rare race based on time.

I think it could be reproduced with adding additional sleep failpoints in lease logic. Would appriciate any help with repro as I don't think I will have time to test it.

@5o11
Copy link
Author

5o11 commented Nov 23, 2022

Sorry, this is client & api version:

$ go version
go version go1.19.3 linux/amd64

go module version:

	go.etcd.io/etcd/api/v3 v3.5.5 // indirect
	go.etcd.io/etcd/client/pkg/v3 v3.5.5 // indirect
	go.etcd.io/etcd/client/v3 v3.5.5 // indirect

Don't mind, I'll monitor it

@5o11
Copy link
Author

5o11 commented Dec 13, 2022

After long time monitoring this issue, I cannot reproduce it.
So I think we can close this issue.

@serathius
Copy link
Member

Thanks for confirming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants