GCP metadata instance cache is never refreshed but also never used #32734
Labels
backport-7.17
Automated backport to the 7.17 branch with mergify
backport-v8.4.0
Automated backport with mergify
bug
Team:Cloud-Monitoring
Label for the Cloud Monitoring team
gcp.compute
metricset implements a caching mechanism for GCP Instance metadata. These metadata are collected as part of instance data collection.An investigation into how the current caching works highlighted a major flaw in it's invalidation behaviour.
TL;DR: current cache behaviour is useless and cache is not used, the reason for having a cache (high number of calls to
instances.Get
GCP API) has been removed and invalidating the cache is problematic (there is no way to know when data changes in GCP, with the high risk of providing stale data). Due to this the proposal is to remove the code and introduce it back if needed, as we foresee this not to be needed anymore.we reduced a lot the number of API calls involved that justified caching and caching invalidation for this use case is very error prone; can we remove it entirely?
How instance metadata collection works
Details
Instance metadata collection is enabled when
exclude_labels
is set totrue
in metricset configuration.When enabled, the code instantiate a
gcp.MetadataService
:beats/x-pack/metricbeat/module/gcp/metrics/metricset.go
Lines 215 to 219 in 7449e5c
which in the case of
gcp.compute
means calling this function:beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go
Lines 29 to 38 in 7449e5c
For each collection period, the
gcp.MetadataService
Metadata
function is called to gather instance metadata.We can assume, as knowledge of why this cache is in place has been lost, that the reason was avoiding calling the
instances.get
GCP API endpoint at each collection iteration, to avoid costly API calls and network latency. This API was called for each instance to retrieve metadata for, which also poses a scaling problem with big instance number.This behaviour has been altered in March 2022 with this PR #30154, where the single API call per instance has been improved by using
instances.aggregatedList
GCP API that returns batched results (up to 500 instances per result page).So with the current implementation concerns for scaling (performance and cost wise) for this operation are greatly reduced; with 1000 instances monitored with metadata collection enabled metricbeat now performs 2 API calls instead of 1000.
The cache
Details
The cache, called
instanceCache
incompute.NewMetadataService
is instantiated with:beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go
Line 35 in 3fda469
Where values are:
beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go
Lines 24 to 25 in 3fda469
common.Cache
is our libbeat common cache implementation.What is cached
Details
When
gcp.MetadataService
Metadata
is called its first step is to gather current instance metadata (by instance ID), which triggers a cache refresh:beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go
Lines 144 to 145 in 7449e5c
Iterating over pages of results cache refresh add the result of
instances.aggregatedList
API toinstanceCache
.How is cache invalidation wrong
Issue 1 - expired elements are not cleaned up
TL;DR: once filled the cache is never really refreshed because the condition expects an empty cache, but this case does not happen.
Details
The problem starts with this lines:
beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go
Lines 173 to 176 in 7449e5c
This check was implemented to prevent refreshing the cache at each iteration, and wait for cache to be empty before performing a cache refresh.
But in which case
Size()
return0
? Looking at its implementation:beats/libbeat/common/cache.go
Lines 230 to 236 in fc2212d
we notice that
Size()
returns the number of all active and expired elements (that have not been yet cleaned up).When does clean-up happens? When
Cleanup()
is called, something that never happens in metricbeat code.This is the first issue: once filled the cache is never really refreshed because the condition expects an empty cache, but this case does not happen.
Issue 2 - element expiration time is extended on access
TL;DR: the expiration time for elements is changed on access, so elements would never be cleaned up (and cache size would never be 0 after the first fill).
Details
But there is a second problem: even if the cache was correctly cleaned-up, when an element is considered expired?
The
get
function implementing cached element retrieval may update the element expiration time, depending on an internal attribute (accessExpire
).Metricbeat code instantiate the cache with
common.NewCache(...)
which implemented is:beats/libbeat/common/cache.go
Lines 82 to 84 in fc2212d
The second parameter is
accessExpire
:beats/libbeat/common/cache.go
Line 102 in fc2212d
which is set to
true
by default.This is the second problem: the expiration time for elements is changed on access, so elements would never be cleaned up (and cache size would never be 0 after the first fill).
Issue 3 - cache expires between periods
TL;DR: all elements expire after 30 seconds and we access them every 60 seconds, but due to not cleaning them up they are not retrieved from cache and the cache is not refreshed neither.
Details
Up to this point it seems the cache is not really working as expected, but there is another wrong behaviour.
Cache timeout is initialized to
cacheTTL
, which is set to:beats/x-pack/metricbeat/module/gcp/metrics/compute/metadata.go
Line 24 in 7449e5c
In
gcp
any period less that60
seconds cannot be configured because that's the minimum valid for collection on GCP API side.This is the third issue: all elements expire after 30 seconds and we access them every 60 seconds, but due to not cleaning them up they are not retrieved from cache and the cache is not refreshed neither.
Issue 4 - MetadataService is instantiated at each period
The cache is instantiated in
compute.NewMetadataService
. This function is:gcp.NewMetadataService
gcp.eventMapping
gcp.Fetch
This is the fourth issue: the cache instance is created for each data collection, so it always starts empty.
Conclusions
Due to these 4 issues:
the cache behaviour is wrong and the cache is never used.
The text was updated successfully, but these errors were encountered: