Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP metadata instance cache is never refreshed but also never used #32734

Closed
endorama opened this issue Aug 18, 2022 · 2 comments · Fixed by #33655
Closed

GCP metadata instance cache is never refreshed but also never used #32734

endorama opened this issue Aug 18, 2022 · 2 comments · Fixed by #33655
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.4.0 Automated backport with mergify bug Team:Cloud-Monitoring Label for the Cloud Monitoring team

Comments

@endorama
Copy link
Member

gcp.compute metricset implements a caching mechanism for GCP Instance metadata. These metadata are collected as part of instance data collection.

An investigation into how the current caching works highlighted a major flaw in it's invalidation behaviour.

TL;DR: current cache behaviour is useless and cache is not used, the reason for having a cache (high number of calls to instances.Get GCP API) has been removed and invalidating the cache is problematic (there is no way to know when data changes in GCP, with the high risk of providing stale data). Due to this the proposal is to remove the code and introduce it back if needed, as we foresee this not to be needed anymore.

we reduced a lot the number of API calls involved that justified caching and caching invalidation for this use case is very error prone; can we remove it entirely?

How instance metadata collection works

Details

Instance metadata collection is enabled when exclude_labels is set to true in metricset configuration.

When enabled, the code instantiate a gcp.MetadataService:

if !m.config.ExcludeLabels {
if gcpService, err = NewMetadataServiceForConfig(m.config, sdc.ServiceName); err != nil {
return nil, fmt.Errorf("error trying to create metadata service: %w", err)
}
}

which in the case of gcp.compute means calling this function:

func NewMetadataService(projectID, zone string, region string, opt ...option.ClientOption) (gcp.MetadataService, error) {
return &metadataCollector{
projectID: projectID,
zone: zone,
region: region,
opt: opt,
instanceCache: common.NewCache(cacheTTL, initialCacheSize),
logger: logp.NewLogger("metrics-compute"),
}, nil
}

For each collection period, the gcp.MetadataService Metadata function is called to gather instance metadata.

We can assume, as knowledge of why this cache is in place has been lost, that the reason was avoiding calling the instances.get GCP API endpoint at each collection iteration, to avoid costly API calls and network latency. This API was called for each instance to retrieve metadata for, which also poses a scaling problem with big instance number.

This behaviour has been altered in March 2022 with this PR #30154, where the single API call per instance has been improved by using instances.aggregatedList GCP API that returns batched results (up to 500 instances per result page).

So with the current implementation concerns for scaling (performance and cost wise) for this operation are greatly reduced; with 1000 instances monitored with metadata collection enabled metricbeat now performs 2 API calls instead of 1000.

The cache

Details

The cache, called instanceCache in compute.NewMetadataService is instantiated with:

instanceCache: common.NewCache(cacheTTL, initialCacheSize),

Where values are:

cacheTTL = 30 * time.Second
initialCacheSize = 13

common.Cache is our libbeat common cache implementation.

What is cached

Details

When gcp.MetadataService Metadata is called its first step is to gather current instance metadata (by instance ID), which triggers a cache refresh:

func (s *metadataCollector) instance(ctx context.Context, instanceID string) (*compute.Instance, error) {
s.refreshInstanceCache(ctx)

Iterating over pages of results cache refresh add the result of instances.aggregatedList API to instanceCache.

How is cache invalidation wrong

Issue 1 - expired elements are not cleaned up

TL;DR: once filled the cache is never really refreshed because the condition expects an empty cache, but this case does not happen.

Details

The problem starts with this lines:

// only refresh cache if it is empty
if s.instanceCache.Size() > 0 {
return
}

This check was implemented to prevent refreshing the cache at each iteration, and wait for cache to be empty before performing a cache refresh.

But in which case Size() return 0? Looking at its implementation:

// Size returns the number of elements in the cache. The number includes both
// active elements and expired elements that have not been cleaned up.
func (c *Cache) Size() int {
c.RLock()
defer c.RUnlock()
return len(c.elements)
}

we notice that Size() returns the number of all active and expired elements (that have not been yet cleaned up).

When does clean-up happens? When Cleanup() is called, something that never happens in metricbeat code.

This is the first issue: once filled the cache is never really refreshed because the condition expects an empty cache, but this case does not happen.

Issue 2 - element expiration time is extended on access

TL;DR: the expiration time for elements is changed on access, so elements would never be cleaned up (and cache size would never be 0 after the first fill).

Details

But there is a second problem: even if the cache was correctly cleaned-up, when an element is considered expired?
The get function implementing cached element retrieval may update the element expiration time, depending on an internal attribute (accessExpire).

Metricbeat code instantiate the cache with common.NewCache(...) which implemented is:

func NewCache(d time.Duration, initialSize int) *Cache {
return newCache(d, true, initialSize, nil, time.Now)
}

The second parameter is accessExpire:

func newCache(d time.Duration, accessExpire bool, initialSize int, l RemovalListener, t clock) *Cache {

which is set to true by default.

This is the second problem: the expiration time for elements is changed on access, so elements would never be cleaned up (and cache size would never be 0 after the first fill).

Issue 3 - cache expires between periods

TL;DR: all elements expire after 30 seconds and we access them every 60 seconds, but due to not cleaning them up they are not retrieved from cache and the cache is not refreshed neither.

Details

Up to this point it seems the cache is not really working as expected, but there is another wrong behaviour.

Cache timeout is initialized to cacheTTL, which is set to:

In gcp any period less that 60 seconds cannot be configured because that's the minimum valid for collection on GCP API side.

This is the third issue: all elements expire after 30 seconds and we access them every 60 seconds, but due to not cleaning them up they are not retrieved from cache and the cache is not refreshed neither.

Issue 4 - MetadataService is instantiated at each period

The cache is instantiated in compute.NewMetadataService. This function is:

This is the fourth issue: the cache instance is created for each data collection, so it always starts empty.

Conclusions

Due to these 4 issues:

  • the cache is effectively not used
  • it would be filled and never refreshed
  • expired items would never been cleaned up
  • items would never expire because their expiration time is expanded upon access
  • the cache is instantiated fresh at each collection period

the cache behaviour is wrong and the cache is never used.

@endorama endorama added bug backport-7.17 Automated backport to the 7.17 branch with mergify Team:Cloud-Monitoring Label for the Cloud Monitoring team backport-v8.4.0 Automated backport with mergify labels Aug 18, 2022
@gpop63
Copy link
Contributor

gpop63 commented Nov 7, 2022

Should the current API call be used after cache removal or revert to the old API call? Wouldn't the current API call be too expensive to use (without a cache) since it would collect data for all instances?

Current:

req := computeService.Instances.AggregatedList(s.projectID)

Previous:

instanceData, err := service.Instances.Get(s.projectID, zone, instanceID).Do()

@endorama
Copy link
Member Author

endorama commented Nov 7, 2022

The current API is optimized for querying a big list of instances, so I don't think we need to change it.

As of now is already being used without a cache, so I would first remove the cache (as is broken code) then think on ways to reduce it's impact (the issue with caching is that is quite difficult to establish when the cache should be refreshed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.4.0 Automated backport with mergify bug Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants