Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ACM] Optimization / caching #2220

Closed
jalvz opened this issue May 24, 2019 · 24 comments · Fixed by #2337
Closed

[ACM] Optimization / caching #2220

jalvz opened this issue May 24, 2019 · 24 comments · Fixed by #2337
Assignees
Milestone

Comments

@jalvz
Copy link
Contributor

jalvz commented May 24, 2019

Goal: reduce requests to Kibana as much as possible.
Main risk comes from RUM agent.
This task should include a small benchmark.

Requires #2095

@simitt
Copy link
Contributor

simitt commented Jun 24, 2019

Initial thoughts around cache handling for ACM:

Step 1:

Implement a simple LRU cache with fixed size or a dynamically sized cache with expiring keys, configurable via apm-server.yml.
Cache granularity: service.name + service.env, as those are the currently supported query parameters.

Currently ACM can only be used from authenticated agents (backend agents) when token is configured.
If APM Server is not token secured but publicly reachable this generally leads to the same risks as for backend Intake API (malicious requests, no rate limiting), but adds risk for overloading Kibana.

As long as RUM agents don't support ACM, this seems ok, as we provide our users a way of safely using this endpoint by using a token, and the unsecured setup is not recommended.

Step 2:

Introduce dedicated ACM endpoint for RUM agents or make use of User-Agent information to differentiate RUM and backend agents. This allows to force token usage for backend agents while applying rate limiting for RUM agent requests.

We need to provide a solution where users can enable ACM for RUM without being forced to be open to above mentioned risk by enforcing to disable token. An ACM endpoint used by the RUM agent must be at least as secure as the Intake API endpoint for RUM. The RUM Intake endpoint does not support authentication at the moment, but rate limiting per IP is applied by default.

User-Agent information can easily be faked, but as soon as any non RUM information is provided, the request must provide the token to authenticate (if authentication is enabled).

Applying RUM rate limiting for all requests would be feasible but confusing for two reasons: the config options are nested under rum, and they are not applied to the backend Intake API.

Step 2b:

Implement separated caches for backend and RUM agents, to avoid dropping backend keys from the cache when cache size doesn't fit amount of RUM queries.

This is important, as bad cache config values or malicious RUM requests could overload Kibana. If no cache information for backend agents is stored and Kibana is not reachable, backend agents could not be served. While only static information will be returned with proper caching, in case Kibana is not reachable, this still improves situation for newly started or added backend agents (getting outdated config vs. getting no config from server).

How can we keep config easy for two different caches?
Use a dedicated cache config setting for backend agents. Rename rum.event_rate to rum.cache (in a backwards compatible way) and use the same settings for rate limit and config cache. Maybe provide an option rum.event_rate.enabled to turn off load balancing explicitly.

@simitt
Copy link
Contributor

simitt commented Jun 24, 2019

Update to above proposal (2b): Dedicated cache size setting would make more sense, as the number of unique parallel users will be a different dimension than the number of unique services using the RUM agent.

@graphaelli
Copy link
Member

Update: we should not expect to support CM for RUM in this first iteration.

simitt added a commit to simitt/apm-server that referenced this issue Jun 25, 2019
simitt added a commit to simitt/apm-server that referenced this issue Jun 25, 2019
simitt added a commit to simitt/apm-server that referenced this issue Jun 26, 2019
@simitt
Copy link
Contributor

simitt commented Jun 26, 2019

Having a smaller poll interval from the agents to the server than the server's caching interval doesn't make any sense, as server would serve the request from cached objects. Polling interval and cache expiration should be aligned.

Following up on some offline discussions and @jalvz's suggestion to leverage http header Cache-Control, this is the suggested flow:

  • No polling interval setting in agents
  • Cache expiration configuration in APM Server, defaults to 30s, will be applied for all services
  • The APM Server sets a header in the response to an agent's config request indicating when the agent should poll for the configuration the next time. I suggest to use the Cache-Control: max-age=<seconds> header in the server response.

@elastic/apm-agent-devs please review and tick off if you agree with this solution:

Agent Yes No Comment
.NET
Go
Java
Node.js
Python
Ruby
RUM

@watson
Copy link
Contributor

watson commented Jun 26, 2019

You probably already thought of this, but I would make sure that the Cache-Control: max-age=<seconds> header is set to the value of the cache expiration configuration, and not the actual number of seconds left until the cache technically expires.

If we did the latter, and the agent polled just 100ms before the cache expired, it would be told to poll again 100ms later. If it's just always 30s (or whatever the value is), then we don't have to deal with that edge case.

@beniwohli
Copy link
Contributor

I'm not sure if we can quite get rid of a default poll interval on the agent-side, as we need to continue polling if the server returns a 404 without Cache-Control header (i.e. because the deployed apm-server is pre 7.3). But this probably doesn't need to be a config option, just a sane default that all agents agree upon, and IMO it can be quite high.

The good news is that we can use the existence of a Cache-Control header in a 404 response as an indicator that the server supports config management, but didn't find a matching configuration, so no need for a version header (cc @jalvz)

@simitt
Copy link
Contributor

simitt commented Jun 26, 2019

I would make sure that the Cache-Control: max-age= header is set to the value of the cache expiration configuration

Sorry for not being more precise in the description. That's actually what I had planned, as there is no way to retrieve the TTL from the value in the cache with the underlying library we are using for caching. Also I wouldn't see value in forcing the agent to poll in irregular intervals.

@simitt
Copy link
Contributor

simitt commented Jun 26, 2019

In an error case the header would be set to Cache-Control: max-age=0. The agent needs to know how to deal with that and when to fetch again.

@beniwohli
Copy link
Contributor

@simitt could the server in an error scenario set the max-age to some sane retry value instead of 0? That would avoid us having to introduce another (possibly configurable) interval setting, and it would also remove the need to implement another special case on every single agent

@simitt
Copy link
Contributor

simitt commented Jun 26, 2019

Yes, we can simply always return a max-age=<cache-expiration> if that helps.

@eyalkoren
Copy link
Contributor

@beniwohli

we need to continue polling if the server returns a 404 without Cache-Control header (i.e. because the deployed apm-server is pre 7.3)

Is this about supporting APM server upgrades?

But this probably doesn't need to be a config option, just a sane default that all agents agree upon, and IMO it can be quite high.

Like- 5 minutes?

@beniwohli
Copy link
Contributor

@beniwohli

we need to continue polling if the server returns a 404 without Cache-Control header (i.e. because the deployed apm-server is pre 7.3)

Is this about supporting APM server upgrades?

Yes, sorry, should have clarified that. I think we should avoid having to explain to users that they need to restart their app deployments when they upgrade to APM Server 7.3+ if we can.

But this probably doesn't need to be a config option, just a sane default that all agents agree upon, and IMO it can be quite high.

Like- 5 minutes?

That sounds reasonable.

@vigneshshanmugam
Copy link
Member

vigneshshanmugam commented Jun 27, 2019

If we are going to leverage the cache-control header to determine the next polling from the agents, my suggestion would be to also include must-revalidate in the cache-control header to be compatible with the HTTP spec and we can use If-Modified-Since or whatever logic in the next versions to understand if the config cache has changed with last fetch.

cache-control: must-revalidate, max-age=<cache-expiration> 
// agents can cache it for next <cache-expiration> seconds and must poll the apm server after that to check the config status 

This also helps the RUM agent if we decide to use the same cache expiration logic.

@eyalkoren
Copy link
Contributor

@simitt do we allow multiple APM servers with different versions?
Because it would make sense to say we don't support that, but we actually must be able to not break in such scenarios because I assume an upgrade of a scalable layer of multiple servers will be done gradually.

@simitt
Copy link
Contributor

simitt commented Jun 27, 2019

@eyalkoren I'd not be aware of not allowing multiple servers with different versions. That need to be handled on the agent side. Why would agents break in such scenarios? The agents need to work with >7.3 and <7.3 branches anyway. Except you release a major version, stating to drop support, in which case I also don't see how any problem would be related to this caching handling?

@beniwohli
Copy link
Contributor

beniwohli commented Jun 27, 2019

I guess one scenario with potentially confusing behaviour could be

  • hit config endpoint of a 7.3 server, it returns config and a (relatively short) max-age
  • wait for max-age seconds, hit config endpoint again, this time a 7.2 server answers with a 404 and no max-age. We fall back to long waiting period (let's say 5 minutes)
  • user changes sample rate in kibana and wonders why some app deployments take a long time to apply the configuration

But that seems like a pretty borderline scenario.

@eyalkoren
Copy link
Contributor

@simitt wherever this will be the case, users will get a strange behaviour of getting their configuration applied on (seemingly) arbitrary times in different agents. You are right, agents shouldn't break and handle that, but experience will seem broken.

@beniwohli probably not the most common scenario, but I believe this is how any APM server layer composed of multiple instances will behave when upgraded.

Bottom line- not sure if we need to do something, other than being aware...

@simitt
Copy link
Contributor

simitt commented Jun 27, 2019

That's a good point. We should ensure to have really good APM changelog descriptions in place for this feature. Pulling in @bmorelli25 for visibility.

@jalvz
Copy link
Contributor Author

jalvz commented Jun 27, 2019

The good news is that we can use the existence of a Cache-Control header in a 404 response as an indicator that the server supports config management

@beniwohli That seems a bit brittle to me, opened #2341 to explore options but I guess in the meanwhile that will do

@beniwohli
Copy link
Contributor

The good news is that we can use the existence of a Cache-Control header in a 404 response as an indicator that the server supports config management

@beniwohli That seems a bit brittle to me, opened #2341 to explore options but I guess in the meanwhile that will do

Indeed, e.g. for GET requests, some cache/proxy in-between might add their own Cache-Control header (although I'm not sure if caching 404 answers is good practice). I use POST requests for other reasons already, so that specific problem wouldn't be an issue. Wondering if we should make it a recommendation for backend agents to only hit the API with POST requests.

@jalvz
Copy link
Contributor Author

jalvz commented Jun 27, 2019

I interpret max-age=Z informally as "after Z seconds, the data I am giving to you might be obsolete, so feel free to reach out again if that worries you", so to me doesn't make a whole lot of sense to say that for errors.

The logic in the agents that makes more sense to me is "if error, wait 5 minutes before trying again"
No need to make that configurable IMO - but if you want always a Cache-Control for 4xx then ofc we just do it.

Wondering if we should make it a recommendation for backend agents to only hit the API with POST requests

@simitt pointed out that we need a different endpoint for RUM any case, so I'd like to propose to only accept POST

@simitt
Copy link
Contributor

simitt commented Jun 27, 2019

must-revalidate

will add it

we can use If-Modified-Since or whatever logic in the next versions to understand if the config cache has changed with last fetch.

@vigneshshanmugam we are using Etag for this

The logic in the agents that makes more sense to me is "if error, wait 5 minutes before trying again"
No need to make that configurable IMO - but if you want always a Cache-Control for 4xx then ofc we just do it.

@beniwohli @jalvz can we settle on this please? I have no strong opinion on this, my initial suggestion was to set it to 0 indicating that the resource cannot be used by the agent, as there was an error. Technically it is not a problem to return another value, eg. 5m as suggested. So I'd say whatever is easier for the agents.

@axw
Copy link
Member

axw commented Jun 28, 2019

I have no strong opinion on this, my initial suggestion was to set it to 0 indicating that the resource cannot be used by the agent, as there was an error. Technically it is not a problem to return another value, eg. 5m as suggested. So I'd say whatever is easier for the agents.

Agents will need to cater for old servers that don't support ACM, as well as broken servers and broken proxies. If the cache-control header is missing from the response, then I think we should fall back to the suggested 5 minutes.

I don't think we should be attaching our own additional semantic meaning to the cache-control value (i.e. no special case 0 value), since it may be interpreted not only by the agents but potentially also by proxies. You can cache a negative response (like a 404), but seeing as the agent will need to handle the other cases anyway, I think we may as well just leave the cache-control header off in this case.

@simitt
Copy link
Contributor

simitt commented Jul 1, 2019

As agreed on offline, a cache-control header of 300s will be returned in error cases.

simitt added a commit that referenced this issue Jul 2, 2019
* Add cache to agent config management

closes #2220
simitt added a commit to simitt/apm-server that referenced this issue Jul 2, 2019
simitt added a commit that referenced this issue Jul 2, 2019
* Add cache to agent config management

closes #2220
@simitt simitt mentioned this issue Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants