Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go-routine leak while cache enabled #8715

Closed
varnson opened this issue Sep 20, 2020 · 9 comments
Closed

go-routine leak while cache enabled #8715

varnson opened this issue Sep 20, 2020 · 9 comments
Labels
theme/agent-cache Agent Cache type/bug Feature does not function as expected

Comments

@varnson
Copy link

varnson commented Sep 20, 2020

Go routine leak while cache enabled

file: agent/cache/cache.go
Step 1, func fetch() will set expire time to entry only at the first time. And will never change it. And will create a goroutine to refresh it.

func (c *Cache) fetch(tEntry typeEntry,
...........................
if newEntry.Expiry == nil || newEntry.Expiry.HeapIndex == -1 { //at the first time,the expire time was set
newEntry.Expiry = &cacheEntryExpiry{Key: key}
newEntry.Expiry.Update(tEntry.Opts.LastGetTTL)
heap.Push(c.entriesExpiryHeap, newEntry.Expiry)
}

         c.entries[key] = newEntry      **//from the second time, entry will be changed, but the expire time is old one.**
         c.entriesLock.Unlock()

................................

Step 2, when time expired, the entry will be deleted by runExpiryLoop.
func (c *Cache) runExpiryLoop() {
.................
case <-expiryCh:
c.entriesLock.Lock()

		// Entry expired! Remove it.
		delete(c.entries, entry.Key)       **//after timeout ,the entry was deleted. But infact, the entry deleted was the new entry set by the second time, not the first time.** 
		heap.Remove(c.entriesExpiryHeap, entry.HeapIndex)

		// This is subtle but important: if we race and simultaneously
		// evict and fetch a new value, then we set this to -1 to
		// have it treated as a new value so that the TTL is extended.
		entry.HeapIndex = -1

		// Set some metrics
		metrics.IncrCounter([]string{"consul", "cache", "evict_expired"}, 1)
		metrics.SetGauge([]string{"consul", "cache", "entries_count"}, float32(len(c.entries)))

		c.entriesLock.Unlock()

..................
}

Step 3, when some one call the fetch again, because the entry was deleted, so a new entry and a new go-routine will be created. Now, we have one entry and two go-routines.

@dnephin dnephin added theme/agent-cache Agent Cache type/bug Feature does not function as expected labels Sep 24, 2020
@dnephin
Copy link
Contributor

dnephin commented Jan 4, 2021

Step 1, func fetch() will set expire time to entry only at the first time. And will never change it

The expiry time is updated here every time the entry is accessed: https://github.com/hashicorp/consul/blob/master/agent/cache/cache.go#L431-L434

Step 3, when some one call the fetch again, because the entry was deleted, so a new entry and a new go-routine will be created. Now, we have one entry and two go-routines.

I think that is a problem, but I'm not sure how it's related to the expiry. There is only a single expiry goroutine for all cache entries.

The extra goroutine is the one started by fetch() to refresh the cache entry. I suspect that the fetch() goroutine needs to detect that the entry has expired and stop refreshing. I think that may be what is missing.

@varnson
Copy link
Author

varnson commented Jan 5, 2021

@dnephin I think I maybe found something strange.
Problem: entriesExpiryHeap was updated with entry.Expiry.Index() and removed with entry.Index()

Line

c.entriesExpiryHeap.Update(entry.Expiry.Index(), r.TypeEntry.Opts.LastGetTTL)

Line

c.entriesExpiryHeap.Remove(entry.Index())

It may be one reason for this issue, but there maybe any other reasons.
In fact I use Consul 1.5.3 and found this issue.
I will add more comments later.

@varnson
Copy link
Author

varnson commented Jan 5, 2021

I will try to explain the goroutine leak. (Maybe it's not right)
1, the entry was deleted after expired.

delete(c.entries, entry.Key())

2,fetch() was waiting for the result when the entry was deleted.
result, err := r.Fetch(fOpts)

3, fetch() got the result and add the entry to c.entries.
c.entries[key] = newEntry

Now though no one calls the getWithIndex() , the entry was added.

In Consul V1.5.3, entriesExpiryHeap will also be added.
In Consul V1.9, entriesExpiryHeap was not add in the fetch, but if getWithIndex() was called after expired and before r.Fetch() returns, a new entry and a new goroutine will be added.

@varnson
Copy link
Author

varnson commented Jan 5, 2021

Maybe there is a solution:
1, change the go routine in fetch() to a new function.
2, the new function create a new go routine and call it self again in the go routine, pass the entry to it.
Why:
Now, fetch create a new go routine and call the fetch again, with the key as parameter, but not the entry.
If the entry was deleted by the expired routine, fetch may get a new entry(by the key) created by a new call to getWithIndex, and will never know the old entry was expired. So the go routine need to use the same entry for ever to check if it is expired.

@varnson
Copy link
Author

varnson commented Jan 5, 2021

Another question:
Line https://github.com/hashicorp/consul/blob/master/agent/cache/cache.go#L556
entry.Fetching was set to true.
But in the fetch go routine, newEntry.Fetching is false.
https://github.com/hashicorp/consul/blob/master/agent/cache/cache.go#L615

And newEntry was added to the c.entries.
https://github.com/hashicorp/consul/blob/master/agent/cache/cache.go#L720

Since this time ,fetching is false until the next fetch set it to true.

But between this time,because fetching is false ,any call to fetch will not return the waiter and create a new go routine.
https://github.com/hashicorp/consul/blob/master/agent/cache/cache.go#L536

@dnephin
Copy link
Contributor

dnephin commented Jan 5, 2021

But between this time,because fetching is false ,any call to fetch will not return the waiter and create a new go routine.

There is a lock (c.entriesLock) that is acquired by fetch() so any other call to fetch will block, and when it unblocks Fetching will be true and receive the right waiter.

I think #8715 (comment) sounds pretty correct.

I suspect the problem is in this area:
https://github.com/hashicorp/consul/blob/d32e7f0/agent/cache/cache.go#L711-L721

At this point the entry could have been expired, but the goroutine never checks to see if the entry has been removed. So if the expire happens when while a fetch is running, then it won't properly expire, and, if another request comes in between the expiry time and when fetch returns, we'll have leaked a goroutine (as you mentioned).

I think somehow in that locked section (lines 711-721) we need to be able to detect if the previous entry was expired. I'm not sure if we have enough data stored to determine that right now. We can't check c.entries because it could be empty for other reasons, or could contain a different entry if a new request came in.

One way to handle this might be to add an Expired() bool method to ttlcache.Entry. That should allow that block of code to detect if the entry was expired and stop the refresh.

@dnephin
Copy link
Contributor

dnephin commented Jan 5, 2021

@varnson I should also mention that if you are running version 1.5.3 we have fixed some cache problems in more recent versions. #8092 is the one fix that I remember. Maybe that fixes your problem? I think there are still possible goroutine leaks, but the one we've identified should be rare because it can only happen when an entry expires, and that can only happen after the 3 day TTL.

@pierresouchay
Copy link
Contributor

@dnephin @varnson Probably unrelated, but we also have a very weird/unfrequent bug on 1.8.6 with this one: #8980 - don't know if it might be related

@dnephin
Copy link
Contributor

dnephin commented May 7, 2021

Thank you again for reporting this issue! I believe this was fixed by #9979 and #9978 , which will be released in Consul 1.10.

@dnephin dnephin closed this as completed May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/agent-cache Agent Cache type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants