-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ACM] Optimization / caching #2220
Comments
Initial thoughts around cache handling for ACM: Step 1:Implement a simple LRU cache with fixed size or a dynamically sized cache with expiring keys, configurable via Currently ACM can only be used from authenticated agents (backend agents) when token is configured. As long as RUM agents don't support ACM, this seems ok, as we provide our users a way of safely using this endpoint by using a token, and the unsecured setup is not recommended. Step 2:Introduce dedicated ACM endpoint for RUM agents or make use of User-Agent information to differentiate RUM and backend agents. This allows to force token usage for backend agents while applying rate limiting for RUM agent requests. We need to provide a solution where users can enable ACM for RUM without being forced to be open to above mentioned risk by enforcing to disable token. An ACM endpoint used by the RUM agent must be at least as secure as the Intake API endpoint for RUM. The RUM Intake endpoint does not support authentication at the moment, but rate limiting per IP is applied by default. User-Agent information can easily be faked, but as soon as any non RUM information is provided, the request must provide the token to authenticate (if authentication is enabled). Applying RUM rate limiting for all requests would be feasible but confusing for two reasons: the config options are nested under Step 2b:Implement separated caches for backend and RUM agents, to avoid dropping backend keys from the cache when cache size doesn't fit amount of RUM queries. This is important, as bad cache config values or malicious RUM requests could overload Kibana. If no cache information for backend agents is stored and Kibana is not reachable, backend agents could not be served. While only static information will be returned with proper caching, in case Kibana is not reachable, this still improves situation for newly started or added backend agents (getting outdated config vs. getting no config from server). How can we keep config easy for two different caches? |
Update to above proposal (2b): Dedicated cache size setting would make more sense, as the number of unique parallel users will be a different dimension than the number of unique services using the RUM agent. |
Update: we should not expect to support CM for RUM in this first iteration. |
Having a smaller poll interval from the agents to the server than the server's caching interval doesn't make any sense, as server would serve the request from cached objects. Polling interval and cache expiration should be aligned. Following up on some offline discussions and @jalvz's suggestion to leverage http header
@elastic/apm-agent-devs please review and tick off if you agree with this solution:
|
You probably already thought of this, but I would make sure that the If we did the latter, and the agent polled just 100ms before the cache expired, it would be told to poll again 100ms later. If it's just always 30s (or whatever the value is), then we don't have to deal with that edge case. |
I'm not sure if we can quite get rid of a default poll interval on the agent-side, as we need to continue polling if the server returns a 404 without The good news is that we can use the existence of a |
Sorry for not being more precise in the description. That's actually what I had planned, as there is no way to retrieve the TTL from the value in the cache with the underlying library we are using for caching. Also I wouldn't see value in forcing the agent to poll in irregular intervals. |
In an error case the header would be set to |
@simitt could the server in an error scenario set the |
Yes, we can simply always return a |
Is this about supporting APM server upgrades?
Like- 5 minutes? |
Yes, sorry, should have clarified that. I think we should avoid having to explain to users that they need to restart their app deployments when they upgrade to APM Server 7.3+ if we can.
That sounds reasonable. |
If we are going to leverage the
This also helps the RUM agent if we decide to use the same cache expiration logic. |
@simitt do we allow multiple APM servers with different versions? |
@eyalkoren I'd not be aware of not allowing multiple servers with different versions. That need to be handled on the agent side. Why would agents break in such scenarios? The agents need to work with >7.3 and <7.3 branches anyway. Except you release a major version, stating to drop support, in which case I also don't see how any problem would be related to this caching handling? |
I guess one scenario with potentially confusing behaviour could be
But that seems like a pretty borderline scenario. |
@simitt wherever this will be the case, users will get a strange behaviour of getting their configuration applied on (seemingly) arbitrary times in different agents. You are right, agents shouldn't break and handle that, but experience will seem broken. @beniwohli probably not the most common scenario, but I believe this is how any APM server layer composed of multiple instances will behave when upgraded. Bottom line- not sure if we need to do something, other than being aware... |
That's a good point. We should ensure to have really good APM changelog descriptions in place for this feature. Pulling in @bmorelli25 for visibility. |
@beniwohli That seems a bit brittle to me, opened #2341 to explore options but I guess in the meanwhile that will do |
Indeed, e.g. for GET requests, some cache/proxy in-between might add their own |
I interpret The logic in the agents that makes more sense to me is "if error, wait 5 minutes before trying again"
@simitt pointed out that we need a different endpoint for RUM any case, so I'd like to propose to only accept |
will add it
@vigneshshanmugam we are using
@beniwohli @jalvz can we settle on this please? I have no strong opinion on this, my initial suggestion was to set it to |
Agents will need to cater for old servers that don't support ACM, as well as broken servers and broken proxies. If the I don't think we should be attaching our own additional semantic meaning to the cache-control value (i.e. no special case 0 value), since it may be interpreted not only by the agents but potentially also by proxies. You can cache a negative response (like a 404), but seeing as the agent will need to handle the other cases anyway, I think we may as well just leave the cache-control header off in this case. |
As agreed on offline, a |
* Add cache to agent config management closes #2220
* Add cache to agent config management closes elastic#2220
Goal: reduce requests to Kibana as much as possible.
Main risk comes from RUM agent.
This task should include a small benchmark.
Requires #2095
The text was updated successfully, but these errors were encountered: