Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent config GA #2747

Merged
merged 3 commits into from
Oct 15, 2019
Merged

agent config GA #2747

merged 3 commits into from
Oct 15, 2019

Conversation

jalvz
Copy link
Contributor

@jalvz jalvz commented Sep 30, 2019

fixes #2694, fixes #2630 , fixes #2646

depends on elastic/kibana#46995

required by elastic/apm-agent-rum-js#439

@jalvz jalvz force-pushed the agent-config-ga branch 4 times, most recently from 0de5ad6 to 00af2db Compare October 1, 2019 11:36
@codecov-io
Copy link

codecov-io commented Oct 2, 2019

Codecov Report

Merging #2747 into master will decrease coverage by 1.36%.
The diff coverage is 82.82%.

@@            Coverage Diff            @@
##           master   #2747      +/-   ##
=========================================
- Coverage   80.16%   78.8%   -1.37%     
=========================================
  Files          83      82       -1     
  Lines        4589    4288     -301     
=========================================
- Hits         3679    3379     -300     
+ Misses        910     909       -1
Impacted Files Coverage Δ
beater/middleware/rate_limit_middleware.go 0% <0%> (ø) ⬆️
beater/middleware/rum_middleware.go 0% <0%> (ø)
kibana/connecting_client.go 40% <0%> (ø) ⬆️
beater/middleware/cors_middleware.go 100% <100%> (ø) ⬆️
beater/api/config/agent/handler.go 94.82% <100%> (+0.38%) ⬆️
beater/api/intake/handler.go 100% <100%> (ø) ⬆️
beater/request/context.go 85.96% <100%> (+0.25%) ⬆️
sourcemap/mapper.go 87.23% <100%> (ø) ⬆️
agentcfg/fetch.go 88.57% <100%> (+0.69%) ⬆️
beater/api/mux.go 69.07% <72.72%> (-1.9%) ⬇️
... and 9 more

@jalvz
Copy link
Contributor Author

jalvz commented Oct 2, 2019

failing test can't be fixed until Kibana side is merged, as Kibana returns 400 upon unknown attributes (in this case etag)
cc @sqren

@jalvz
Copy link
Contributor Author

jalvz commented Oct 3, 2019

depends on elastic/kibana#46995

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You explained that GA changes will be breaking. Does that mean that the agents using the beta implementation would break when talking to APM Server or would they just stop applying ACM?

func dotKey(k1, k2 string) string {
if k1 == "" {
return k2
// ZeroResult creates a Result struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for exporting this.

return c.Request.URL.Query().Get(agentcfg.Etag)
}

func sanitize(isRum bool, settings agentcfg.Settings) agentcfg.Settings {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agentcfg package has knowledge about RUM, and the Query has an IsRum flag. IMO this sanitize method should be part of fetching the result from the agentcfg package.

agentcfg/model.go Show resolved Hide resolved
if !c.logger.IsDebug() {
return
if !authorized(q.IsRum, result.Source.Agent) {
return ZeroResult(), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this implementation non authorized RUM results are not cached. IMO this should be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. If then the same request comes trough the right endpoint, the result would be a wrong empty result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kibana is responding the same way to this query, no matter who requests it, right? So I don't see why caching the result here would lead to wrong results.
Ofc I did not mean to cache the ZeroResult but the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case i don't know why i would want to cache something that i am not even returning, but im fine either way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is the following:
What is the purpose of caching here?
To avoid unnecessary calls to Kibana based on legitimate requests from agents. But as soon as those requests can be triggered from requests to unsecured RUM endpoint, also to prevent overloading Kibana on any kind of attacks.
The authorized check would only return false if the agents had a bug or for malicious requests. In the case of malicious requests it makes sense to also avoid sending the same requests over and over to Kibana.

However, if you add it to the cache, you would probably also need to size limit the cache, as otherwise you open up another attack vector.

I don't see why the decision in which cases something should be added to the cache is in direct relation on whether or not you are returning this information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure you find this useful, but I gave this some thought when introducing the caching #2220 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I am changing the caching key to be service name + env, instead of Etag.
When there are many RUM agents, they will not send an ifnonematch arg until they get one from a previous call, which means that all of them would have to hit Kibana at least one.
Caching the service name solves this problem.

On the other side, I don't think is necessary to have a cache limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agentcfg/fetch.go Show resolved Hide resolved
@@ -167,6 +177,7 @@ func backendMiddleware(cfg *config.Config, m map[request.ResultID]*monitoring.In

func rumMiddleware(cfg *config.Config, m map[request.ResultID]*monitoring.Int) []middleware.Middleware {
return append(apmMiddleware(m),
middleware.SetRumFlagMiddleware(),
middleware.SetRateLimitMiddleware(cfg.RumConfig.EventRate),
middleware.CORSMiddleware(cfg.RumConfig.AllowOrigins),
middleware.KillSwitchMiddleware(cfg.RumConfig.IsEnabled()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still bothers me that all of these middleware functions are run if someone has RUM disabled; as a user I'd expect the KillswitchMiddleware to be run before the CORS and rate limiter middleware at least.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually think we should do the CORS dance properly and return to the browsers the response they expect, otherwise the browser might show an error in the lines of "domain not allowed" when the root cause is other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lack knowledge about what is the standard behavior related to CORS and browsers. If the advantage is worth the potential additional costs I'm ok with it.

@@ -31,7 +31,7 @@ func SetRateLimitMiddleware(cfg *config.EventRate) Middleware {

return func(h request.Handler) (request.Handler, error) {
return func(c *request.Context) {
c.RateLimiter = store
c.RateLimiter = store.ForIP(c.Request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the name to include IP now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha! i was about to do that, but then i thought you will reply with something like this: #2706 (comment)

not sure what is the criteria now, but will change as I agree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. The logic that the rate limiter limits by IP is now in the middleware. It was not before. So now it makes sense to reflect it in the middleware name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As clarified offline, this is a comment on L29, not L34.
Changed in 0788847

@jalvz
Copy link
Contributor Author

jalvz commented Oct 7, 2019

You explained that GA changes will be breaking. Does that mean that the agents using the beta implementation would break when talking to APM Server or would they just stop applying ACM?

they will just not get any config updates, so they won't apply it

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sanitize method is based on the querie's isRum, the cache key is based on Service attributes. You call sanitize before adding the value to the cache. In case a request is made for a backend service via the RUM endpoint and then again via the backend endpoint this will lead to wrong results.

The authorized check occurs only for non-cached information. As soon as information is cached, which is based on service attributes and ignoring RUM/backend, the information is just returned. If an unauthorized RUM request follows an authorized backend request, this will result in non-authorized response.

On the other side, I don't think is necessary to have a cache limit.

From what I see the cache can potentially grow infinite with requests for non-existing services. Since we had a bit of an offline discussion regarding caching, please describe why you decided a cache limit is not necessary.

@jalvz
Copy link
Contributor Author

jalvz commented Oct 8, 2019

You call sanitize before adding the value to the cache

Not true, as far I can see.

The authorized check occurs only for non-cached information

Right, will update.

the cache can potentially grow infinite with requests for non-existing services

Lets talk about that offline

@simitt
Copy link
Contributor

simitt commented Oct 8, 2019

You call sanitize before adding the value to the cache

Not true, as far I can see.

Within the cache.fetch you call fetch(), which leads to running the sanitize. The returned, sanitized result is then added to the cache.

@jalvz
Copy link
Contributor Author

jalvz commented Oct 8, 2019

This test 8bbae34#diff-6756162984ecc44fbead19b809440eaeR305 covers that scenario and passes locally, I have to see what's up

@jalvz
Copy link
Contributor Author

jalvz commented Oct 8, 2019

it always worked because each handler has its own cache, added another test completeness.
pushed another change because i didn't like important logic buried in a cache method any ways

Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for one comment related to removing sourrinding " of etags I only left nitpick comments.

func (c *cache) fetchAndAdd(q Query, fn func(Query) (*Doc, error)) (doc *Doc, err error) {
id := q.ID()

func (c *cache) fetch(q Query, fetch func() (Result, error)) (Result, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand from your arguments in an offline discussion, you are in favor of not using abbreviations, so please use query here instead of q.

if !c.logger.IsDebug() {
return
if c.logger.IsDebug() {
c.logger.Debugf("Cache size %v. Added ID %v.", c.gocache.ItemCount(), result.Source.Etag)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging result.Source.Etag is not correct anymore.

return nil, found
}
return val.(*Doc), found
func authorized(isRum bool, agent string) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only used in tests now

if err != nil {
return nil, err
}
func (f *Fetcher) request(r io.Reader) ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove empty line 72

func (f *Fetcher) Fetch(query Query) (Result, error) {
req := func() (Result, error) {
result, error := newResult(f.request(convert.ToReader(query)))
return result, error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can directly say return newResult... now

if err := parse(settings, out, "", h); err != nil {
return nil, err
}
func (q Query) toString() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer calling this ID as that's better describing what it does, and toString is usually used for a nicer presentation of an instance.

config_overrides = {
"logging_json": "true",
"kibana_enabled": "true",
"acm_cache_expiration": "1s",
}

@classmethod
def setUpClass(cls):
super(AgentConfigurationTest, cls).setUpClass()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

@@ -43,7 +49,10 @@ def create_service_config(self, settings, name, env=None, _id="new"):
)

def update_service_config(self, settings, _id, name, env=None):
return self.create_service_config(settings, name, env, _id=_id)
return self.create_service_config(settings, name, agent="python", env=env, _id=_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: No need for setting agent="python", it reflects the default.

tests/system/test_integration_acm.py Outdated Show resolved Hide resolved
etag = r1.headers["Etag"]

r2 = requests.get(self.rum_agent_config_url,
params={"service.name": service_name, "ifnonematch": etag.replace('"', '')},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it necessary to remove surrounding " of etag? It should be possible to just use the etag that is returned by the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the rum agent will actually send it without double quotes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't the server be able to process with or without quotes? And why is the RUM agent sending without quotes (I probably missed that discussion somewhere).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved the PR, as I haven't followed discussions about this. Please ensure then RUM agent properly removes the quotes from the response they get if you don't change it in the server.

@jalvz
Copy link
Contributor Author

jalvz commented Oct 11, 2019

Revisiting this question:

You explained that GA changes will be breaking. Does that mean that the agents using the beta implementation would break when talking to APM Server or would they just stop applying ACM?

What do you actually meant with "break"?

I bumped the min version from 7.3 to 7.5, so the agents would get an error response code, and the should handle it, therefore, not applying any config

@jalvz jalvz added this to the 7.5 milestone Oct 11, 2019
@jalvz jalvz added the feature label Oct 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants