[ACM] Optimization / caching #2220

jalvz · 2019-05-24T11:54:04Z

Goal: reduce requests to Kibana as much as possible.
Main risk comes from RUM agent.
This task should include a small benchmark.

Requires #2095

simitt · 2019-06-24T09:34:33Z

Initial thoughts around cache handling for ACM:

Step 1:

Implement a simple LRU cache with fixed size or a dynamically sized cache with expiring keys, configurable via apm-server.yml.
Cache granularity: service.name + service.env, as those are the currently supported query parameters.

Currently ACM can only be used from authenticated agents (backend agents) when token is configured.
If APM Server is not token secured but publicly reachable this generally leads to the same risks as for backend Intake API (malicious requests, no rate limiting), but adds risk for overloading Kibana.

As long as RUM agents don't support ACM, this seems ok, as we provide our users a way of safely using this endpoint by using a token, and the unsecured setup is not recommended.

Step 2:

Introduce dedicated ACM endpoint for RUM agents or make use of User-Agent information to differentiate RUM and backend agents. This allows to force token usage for backend agents while applying rate limiting for RUM agent requests.

We need to provide a solution where users can enable ACM for RUM without being forced to be open to above mentioned risk by enforcing to disable token. An ACM endpoint used by the RUM agent must be at least as secure as the Intake API endpoint for RUM. The RUM Intake endpoint does not support authentication at the moment, but rate limiting per IP is applied by default.

User-Agent information can easily be faked, but as soon as any non RUM information is provided, the request must provide the token to authenticate (if authentication is enabled).

Applying RUM rate limiting for all requests would be feasible but confusing for two reasons: the config options are nested under rum, and they are not applied to the backend Intake API.

Step 2b:

Implement separated caches for backend and RUM agents, to avoid dropping backend keys from the cache when cache size doesn't fit amount of RUM queries.

This is important, as bad cache config values or malicious RUM requests could overload Kibana. If no cache information for backend agents is stored and Kibana is not reachable, backend agents could not be served. While only static information will be returned with proper caching, in case Kibana is not reachable, this still improves situation for newly started or added backend agents (getting outdated config vs. getting no config from server).

How can we keep config easy for two different caches?
Use a dedicated cache config setting for backend agents. Rename rum.event_rate to rum.cache (in a backwards compatible way) and use the same settings for rate limit and config cache. Maybe provide an option rum.event_rate.enabled to turn off load balancing explicitly.

simitt · 2019-06-24T12:44:09Z

Update to above proposal (2b): Dedicated cache size setting would make more sense, as the number of unique parallel users will be a different dimension than the number of unique services using the RUM agent.

graphaelli · 2019-06-24T16:21:19Z

Update: we should not expect to support CM for RUM in this first iteration.

closes elastic#2220

simitt · 2019-06-26T12:52:20Z

Having a smaller poll interval from the agents to the server than the server's caching interval doesn't make any sense, as server would serve the request from cached objects. Polling interval and cache expiration should be aligned.

Following up on some offline discussions and @jalvz's suggestion to leverage http header Cache-Control, this is the suggested flow:

No polling interval setting in agents
Cache expiration configuration in APM Server, defaults to 30s, will be applied for all services
The APM Server sets a header in the response to an agent's config request indicating when the agent should poll for the configuration the next time. I suggest to use the Cache-Control: max-age=<seconds> header in the server response.

@elastic/apm-agent-devs please review and tick off if you agree with this solution:

Agent	Yes	No	Comment
.NET
Go
Java
Node.js
Python
Ruby
RUM

watson · 2019-06-26T13:24:18Z

You probably already thought of this, but I would make sure that the Cache-Control: max-age=<seconds> header is set to the value of the cache expiration configuration, and not the actual number of seconds left until the cache technically expires.

If we did the latter, and the agent polled just 100ms before the cache expired, it would be told to poll again 100ms later. If it's just always 30s (or whatever the value is), then we don't have to deal with that edge case.

beniwohli · 2019-06-26T13:46:37Z

I'm not sure if we can quite get rid of a default poll interval on the agent-side, as we need to continue polling if the server returns a 404 without Cache-Control header (i.e. because the deployed apm-server is pre 7.3). But this probably doesn't need to be a config option, just a sane default that all agents agree upon, and IMO it can be quite high.

The good news is that we can use the existence of a Cache-Control header in a 404 response as an indicator that the server supports config management, but didn't find a matching configuration, so no need for a version header (cc @jalvz)

simitt · 2019-06-26T13:49:01Z

I would make sure that the Cache-Control: max-age= header is set to the value of the cache expiration configuration

Sorry for not being more precise in the description. That's actually what I had planned, as there is no way to retrieve the TTL from the value in the cache with the underlying library we are using for caching. Also I wouldn't see value in forcing the agent to poll in irregular intervals.

simitt · 2019-06-26T14:27:36Z

In an error case the header would be set to Cache-Control: max-age=0. The agent needs to know how to deal with that and when to fetch again.

beniwohli · 2019-06-26T14:33:36Z

@simitt could the server in an error scenario set the max-age to some sane retry value instead of 0? That would avoid us having to introduce another (possibly configurable) interval setting, and it would also remove the need to implement another special case on every single agent

simitt · 2019-06-26T14:39:28Z

Yes, we can simply always return a max-age=<cache-expiration> if that helps.

closes elastic#2220

eyalkoren · 2019-06-27T07:39:59Z

@beniwohli

we need to continue polling if the server returns a 404 without Cache-Control header (i.e. because the deployed apm-server is pre 7.3)

Is this about supporting APM server upgrades?

But this probably doesn't need to be a config option, just a sane default that all agents agree upon, and IMO it can be quite high.

Like- 5 minutes?

beniwohli · 2019-06-27T07:56:59Z

@beniwohli

we need to continue polling if the server returns a 404 without Cache-Control header (i.e. because the deployed apm-server is pre 7.3)

Is this about supporting APM server upgrades?

Yes, sorry, should have clarified that. I think we should avoid having to explain to users that they need to restart their app deployments when they upgrade to APM Server 7.3+ if we can.

But this probably doesn't need to be a config option, just a sane default that all agents agree upon, and IMO it can be quite high.

Like- 5 minutes?

That sounds reasonable.

vigneshshanmugam · 2019-06-27T08:06:58Z

If we are going to leverage the cache-control header to determine the next polling from the agents, my suggestion would be to also include must-revalidate in the cache-control header to be compatible with the HTTP spec and we can use If-Modified-Since or whatever logic in the next versions to understand if the config cache has changed with last fetch.

cache-control: must-revalidate, max-age=<cache-expiration> 
// agents can cache it for next <cache-expiration> seconds and must poll the apm server after that to check the config status

This also helps the RUM agent if we decide to use the same cache expiration logic.

eyalkoren · 2019-06-27T08:24:16Z

@simitt do we allow multiple APM servers with different versions?
Because it would make sense to say we don't support that, but we actually must be able to not break in such scenarios because I assume an upgrade of a scalable layer of multiple servers will be done gradually.

simitt · 2019-06-27T08:51:16Z

@eyalkoren I'd not be aware of not allowing multiple servers with different versions. That need to be handled on the agent side. Why would agents break in such scenarios? The agents need to work with >7.3 and <7.3 branches anyway. Except you release a major version, stating to drop support, in which case I also don't see how any problem would be related to this caching handling?

beniwohli · 2019-06-27T08:57:45Z

I guess one scenario with potentially confusing behaviour could be

hit config endpoint of a 7.3 server, it returns config and a (relatively short) max-age
wait for max-age seconds, hit config endpoint again, this time a 7.2 server answers with a 404 and no max-age. We fall back to long waiting period (let's say 5 minutes)
user changes sample rate in kibana and wonders why some app deployments take a long time to apply the configuration

But that seems like a pretty borderline scenario.

eyalkoren · 2019-06-27T09:16:38Z

@simitt wherever this will be the case, users will get a strange behaviour of getting their configuration applied on (seemingly) arbitrary times in different agents. You are right, agents shouldn't break and handle that, but experience will seem broken.

@beniwohli probably not the most common scenario, but I believe this is how any APM server layer composed of multiple instances will behave when upgraded.

Bottom line- not sure if we need to do something, other than being aware...

simitt · 2019-06-27T09:18:40Z

That's a good point. We should ensure to have really good APM changelog descriptions in place for this feature. Pulling in @bmorelli25 for visibility.

jalvz · 2019-06-27T10:41:01Z

The good news is that we can use the existence of a Cache-Control header in a 404 response as an indicator that the server supports config management

@beniwohli That seems a bit brittle to me, opened #2341 to explore options but I guess in the meanwhile that will do

beniwohli · 2019-06-27T11:12:40Z

The good news is that we can use the existence of a Cache-Control header in a 404 response as an indicator that the server supports config management

@beniwohli That seems a bit brittle to me, opened #2341 to explore options but I guess in the meanwhile that will do

Indeed, e.g. for GET requests, some cache/proxy in-between might add their own Cache-Control header (although I'm not sure if caching 404 answers is good practice). I use POST requests for other reasons already, so that specific problem wouldn't be an issue. Wondering if we should make it a recommendation for backend agents to only hit the API with POST requests.

jalvz · 2019-06-27T11:24:53Z

I interpret max-age=Z informally as "after Z seconds, the data I am giving to you might be obsolete, so feel free to reach out again if that worries you", so to me doesn't make a whole lot of sense to say that for errors.

The logic in the agents that makes more sense to me is "if error, wait 5 minutes before trying again"
No need to make that configurable IMO - but if you want always a Cache-Control for 4xx then ofc we just do it.

Wondering if we should make it a recommendation for backend agents to only hit the API with POST requests

@simitt pointed out that we need a different endpoint for RUM any case, so I'd like to propose to only accept POST

simitt · 2019-06-27T15:05:29Z

must-revalidate

will add it

we can use If-Modified-Since or whatever logic in the next versions to understand if the config cache has changed with last fetch.

@vigneshshanmugam we are using Etag for this

The logic in the agents that makes more sense to me is "if error, wait 5 minutes before trying again"
No need to make that configurable IMO - but if you want always a Cache-Control for 4xx then ofc we just do it.

@beniwohli @jalvz can we settle on this please? I have no strong opinion on this, my initial suggestion was to set it to 0 indicating that the resource cannot be used by the agent, as there was an error. Technically it is not a problem to return another value, eg. 5m as suggested. So I'd say whatever is easier for the agents.

axw · 2019-06-28T02:51:33Z

I have no strong opinion on this, my initial suggestion was to set it to 0 indicating that the resource cannot be used by the agent, as there was an error. Technically it is not a problem to return another value, eg. 5m as suggested. So I'd say whatever is easier for the agents.

Agents will need to cater for old servers that don't support ACM, as well as broken servers and broken proxies. If the cache-control header is missing from the response, then I think we should fall back to the suggested 5 minutes.

I don't think we should be attaching our own additional semantic meaning to the cache-control value (i.e. no special case 0 value), since it may be interpreted not only by the agents but potentially also by proxies. You can cache a negative response (like a 404), but seeing as the agent will need to handle the other cases anyway, I think we may as well just leave the cache-control header off in this case.

simitt · 2019-07-01T13:32:15Z

As agreed on offline, a cache-control header of 300s will be returned in error cases.

* Add cache to agent config management closes #2220

* Add cache to agent config management closes elastic#2220

* Add cache to agent config management closes #2220

jalvz added [zube]: Impl Backlog performance labels May 24, 2019

jalvz added this to the 7.3 milestone May 24, 2019

jalvz mentioned this issue May 24, 2019

Proxy configuration from Kibana to agents #2126

Closed

5 tasks

graphaelli added [zube]: Ready and removed [zube]: Impl Backlog labels May 28, 2019

graphaelli assigned simitt Jun 18, 2019

simitt added [zube]: In Progress and removed [zube]: Ready labels Jun 24, 2019

simitt added a commit to simitt/apm-server that referenced this issue Jun 25, 2019

Add cache to agent config management

ff0889f

closes elastic#2220

simitt added a commit to simitt/apm-server that referenced this issue Jun 25, 2019

Add cache to agent config management

1525520

closes elastic#2220

simitt added a commit to simitt/apm-server that referenced this issue Jun 26, 2019

Add cache to agent config management

e781f6e

closes elastic#2220

simitt added a commit to simitt/apm-server that referenced this issue Jun 26, 2019

Add cache to agent config management

582ff30

closes elastic#2220

simitt mentioned this issue Jun 26, 2019

Add caching to Agent Remote Configuration handling #2337

Merged

4 tasks

simitt added [zube]: In Review and removed [zube]: In Progress labels Jun 27, 2019

jalvz mentioned this issue Jun 27, 2019

Expose apm-server capabilities #2341

Closed

eyalkoren mentioned this issue Jun 27, 2019

Remote config elastic/apm-agent-java#640

Merged

bmorelli25 mentioned this issue Jun 27, 2019

[docs] APM Agent configuration documentation elastic/apm#86

Closed

9 tasks

jalvz mentioned this issue Jul 2, 2019

Agent remote configuration elastic/apm#76

Closed

7 tasks

simitt closed this as completed in #2337 Jul 2, 2019

simitt added a commit that referenced this issue Jul 2, 2019

Add caching to Agent Remote Configuration handling (#2337)

3d7fda0

* Add cache to agent config management closes #2220

zube bot added [zube]: Done and removed [zube]: In Review labels Jul 2, 2019

simitt added a commit to simitt/apm-server that referenced this issue Jul 2, 2019

Add caching to Agent Remote Configuration handling (elastic#2337)

09fe71a

* Add cache to agent config management closes elastic#2220

simitt added a commit that referenced this issue Jul 2, 2019

Add caching to Agent Remote Configuration handling (#2337) (#2363)

1defbbe

* Add cache to agent config management closes #2220

graphaelli removed the [zube]: Done label Jul 9, 2019

simitt mentioned this issue Sep 12, 2019

Provide Access-Control-Allow-Origin header on the response to RUM ACM endpoint #2646

Closed

simitt mentioned this issue Oct 4, 2019

agent config GA #2747

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ACM] Optimization / caching #2220

[ACM] Optimization / caching #2220

jalvz commented May 24, 2019 •

edited

Loading

simitt commented Jun 24, 2019 •

edited

Loading

simitt commented Jun 24, 2019

graphaelli commented Jun 24, 2019

simitt commented Jun 26, 2019 •

edited by axw

Loading

watson commented Jun 26, 2019

beniwohli commented Jun 26, 2019

simitt commented Jun 26, 2019

simitt commented Jun 26, 2019

beniwohli commented Jun 26, 2019

simitt commented Jun 26, 2019

eyalkoren commented Jun 27, 2019

beniwohli commented Jun 27, 2019

vigneshshanmugam commented Jun 27, 2019 •

edited

Loading

eyalkoren commented Jun 27, 2019

simitt commented Jun 27, 2019 •

edited

Loading

beniwohli commented Jun 27, 2019 •

edited

Loading

eyalkoren commented Jun 27, 2019

simitt commented Jun 27, 2019

jalvz commented Jun 27, 2019

beniwohli commented Jun 27, 2019

jalvz commented Jun 27, 2019

simitt commented Jun 27, 2019

axw commented Jun 28, 2019

simitt commented Jul 1, 2019

[ACM] Optimization / caching #2220

[ACM] Optimization / caching #2220

Comments

jalvz commented May 24, 2019 • edited Loading

simitt commented Jun 24, 2019 • edited Loading

Step 1:

Step 2:

Step 2b:

simitt commented Jun 24, 2019

graphaelli commented Jun 24, 2019

simitt commented Jun 26, 2019 • edited by axw Loading

watson commented Jun 26, 2019

beniwohli commented Jun 26, 2019

simitt commented Jun 26, 2019

simitt commented Jun 26, 2019

beniwohli commented Jun 26, 2019

simitt commented Jun 26, 2019

eyalkoren commented Jun 27, 2019

beniwohli commented Jun 27, 2019

vigneshshanmugam commented Jun 27, 2019 • edited Loading

eyalkoren commented Jun 27, 2019

simitt commented Jun 27, 2019 • edited Loading

beniwohli commented Jun 27, 2019 • edited Loading

eyalkoren commented Jun 27, 2019

simitt commented Jun 27, 2019

jalvz commented Jun 27, 2019

beniwohli commented Jun 27, 2019

jalvz commented Jun 27, 2019

simitt commented Jun 27, 2019

axw commented Jun 28, 2019

simitt commented Jul 1, 2019

jalvz commented May 24, 2019 •

edited

Loading

simitt commented Jun 24, 2019 •

edited

Loading

simitt commented Jun 26, 2019 •

edited by axw

Loading

vigneshshanmugam commented Jun 27, 2019 •

edited

Loading

simitt commented Jun 27, 2019 •

edited

Loading

beniwohli commented Jun 27, 2019 •

edited

Loading