agent config GA #2747

jalvz · 2019-09-30T14:55:13Z

fixes #2694, fixes #2630 , fixes #2646

required by elastic/apm-agent-rum-js#439

codecov-io · 2019-10-02T03:01:12Z

Codecov Report

Merging #2747 into master will decrease coverage by 1.36%.
The diff coverage is 82.82%.

@@            Coverage Diff            @@
##           master   #2747      +/-   ##
=========================================
- Coverage   80.16%   78.8%   -1.37%     
=========================================
  Files          83      82       -1     
  Lines        4589    4288     -301     
=========================================
- Hits         3679    3379     -300     
+ Misses        910     909       -1

Impacted Files	Coverage Δ
beater/middleware/rate_limit_middleware.go	`0% <0%> (ø)`	⬆️
beater/middleware/rum_middleware.go	`0% <0%> (ø)`
kibana/connecting_client.go	`40% <0%> (ø)`	⬆️
beater/middleware/cors_middleware.go	`100% <100%> (ø)`	⬆️
beater/api/config/agent/handler.go	`94.82% <100%> (+0.38%)`	⬆️
beater/api/intake/handler.go	`100% <100%> (ø)`	⬆️
beater/request/context.go	`85.96% <100%> (+0.25%)`	⬆️
sourcemap/mapper.go	`87.23% <100%> (ø)`	⬆️
agentcfg/fetch.go	`88.57% <100%> (+0.69%)`	⬆️
beater/api/mux.go	`69.07% <72.72%> (-1.9%)`	⬇️
... and 9 more

jalvz · 2019-10-02T11:54:12Z

failing test can't be fixed until Kibana side is merged, as Kibana returns 400 upon unknown attributes (in this case etag)
cc @sqren

jalvz · 2019-10-03T12:44:29Z

depends on elastic/kibana#46995

simitt

You explained that GA changes will be breaking. Does that mean that the agents using the beta implementation would break when talking to APM Server or would they just stop applying ACM?

simitt · 2019-10-04T07:40:31Z

agentcfg/model.go

-func dotKey(k1, k2 string) string {
-	if k1 == "" {
-		return k2
+// ZeroResult creates a Result struct


No need for exporting this.

simitt · 2019-10-04T07:45:06Z

beater/api/config/agent/handler.go

+	return c.Request.URL.Query().Get(agentcfg.Etag)
+}
+
+func sanitize(isRum bool, settings agentcfg.Settings) agentcfg.Settings {


The agentcfg package has knowledge about RUM, and the Query has an IsRum flag. IMO this sanitize method should be part of fetching the result from the agentcfg package.

agentcfg/model.go

simitt · 2019-10-04T08:05:22Z

agentcfg/cache.go

-	if !c.logger.IsDebug() {
-		return
+	if !authorized(q.IsRum, result.Source.Agent) {
+		return ZeroResult(), nil


With this implementation non authorized RUM results are not cached. IMO this should be changed.

I don't think so. If then the same request comes trough the right endpoint, the result would be a wrong empty result.

Kibana is responding the same way to this query, no matter who requests it, right? So I don't see why caching the result here would lead to wrong results.
Ofc I did not mean to cache the ZeroResult but the result.

in that case i don't know why i would want to cache something that i am not even returning, but im fine either way

My reasoning is the following:
What is the purpose of caching here?
To avoid unnecessary calls to Kibana based on legitimate requests from agents. But as soon as those requests can be triggered from requests to unsecured RUM endpoint, also to prevent overloading Kibana on any kind of attacks.
The authorized check would only return false if the agents had a bug or for malicious requests. In the case of malicious requests it makes sense to also avoid sending the same requests over and over to Kibana.

However, if you add it to the cache, you would probably also need to size limit the cache, as otherwise you open up another attack vector.

I don't see why the decision in which cases something should be added to the cache is in direct relation on whether or not you are returning this information.

Not sure you find this useful, but I gave this some thought when introducing the caching #2220 (comment).

Thanks.

I am changing the caching key to be service name + env, instead of Etag.
When there are many RUM agents, they will not send an ifnonematch arg until they get one from a previous call, which means that all of them would have to hit Kibana at least one.
Caching the service name solves this problem.

On the other side, I don't think is necessary to have a cache limit.

agentcfg/fetch.go

simitt · 2019-10-04T08:25:56Z

beater/api/mux.go

@@ -167,6 +177,7 @@ func backendMiddleware(cfg *config.Config, m map[request.ResultID]*monitoring.In

 func rumMiddleware(cfg *config.Config, m map[request.ResultID]*monitoring.Int) []middleware.Middleware {
 	return append(apmMiddleware(m),
+		middleware.SetRumFlagMiddleware(),
 		middleware.SetRateLimitMiddleware(cfg.RumConfig.EventRate),
 		middleware.CORSMiddleware(cfg.RumConfig.AllowOrigins),
 		middleware.KillSwitchMiddleware(cfg.RumConfig.IsEnabled()))


It still bothers me that all of these middleware functions are run if someone has RUM disabled; as a user I'd expect the KillswitchMiddleware to be run before the CORS and rate limiter middleware at least.

i actually think we should do the CORS dance properly and return to the browsers the response they expect, otherwise the browser might show an error in the lines of "domain not allowed" when the root cause is other.

I lack knowledge about what is the standard behavior related to CORS and browsers. If the advantage is worth the potential additional costs I'm ok with it.

simitt · 2019-10-04T08:26:48Z

beater/middleware/rate_limit_middleware.go

@@ -31,7 +31,7 @@ func SetRateLimitMiddleware(cfg *config.EventRate) Middleware {

 	return func(h request.Handler) (request.Handler, error) {
 		return func(c *request.Context) {
-			c.RateLimiter = store
+			c.RateLimiter = store.ForIP(c.Request)


Can you change the name to include IP now.

ha! i was about to do that, but then i thought you will reply with something like this: #2706 (comment)

not sure what is the criteria now, but will change as I agree

I don't follow. The logic that the rate limiter limits by IP is now in the middleware. It was not before. So now it makes sense to reflect it in the middleware name.

As clarified offline, this is a comment on L29, not L34.
Changed in 0788847

jalvz · 2019-10-07T11:12:05Z

You explained that GA changes will be breaking. Does that mean that the agents using the beta implementation would break when talking to APM Server or would they just stop applying ACM?

they will just not get any config updates, so they won't apply it

simitt

The sanitize method is based on the querie's isRum, the cache key is based on Service attributes. You call sanitize before adding the value to the cache. In case a request is made for a backend service via the RUM endpoint and then again via the backend endpoint this will lead to wrong results.

The authorized check occurs only for non-cached information. As soon as information is cached, which is based on service attributes and ignoring RUM/backend, the information is just returned. If an unauthorized RUM request follows an authorized backend request, this will result in non-authorized response.

On the other side, I don't think is necessary to have a cache limit.

From what I see the cache can potentially grow infinite with requests for non-existing services. Since we had a bit of an offline discussion regarding caching, please describe why you decided a cache limit is not necessary.

jalvz · 2019-10-08T10:57:17Z

You call sanitize before adding the value to the cache

Not true, as far I can see.

The authorized check occurs only for non-cached information

Right, will update.

the cache can potentially grow infinite with requests for non-existing services

Lets talk about that offline

simitt · 2019-10-08T11:04:54Z

You call sanitize before adding the value to the cache

Not true, as far I can see.

Within the cache.fetch you call fetch(), which leads to running the sanitize. The returned, sanitized result is then added to the cache.

jalvz · 2019-10-08T11:24:09Z

This test 8bbae34#diff-6756162984ecc44fbead19b809440eaeR305 covers that scenario and passes locally, I have to see what's up

jalvz · 2019-10-08T13:22:23Z

it always worked because each handler has its own cache, added another test completeness.
pushed another change because i didn't like important logic buried in a cache method any ways

tests/system/test_integration_acm.py

simitt

Except for one comment related to removing sourrinding " of etags I only left nitpick comments.

simitt · 2019-10-10T09:20:31Z

agentcfg/cache.go

-func (c *cache) fetchAndAdd(q Query, fn func(Query) (*Doc, error)) (doc *Doc, err error) {
-	id := q.ID()
-
+func (c *cache) fetch(q Query, fetch func() (Result, error)) (Result, error) {


As I understand from your arguments in an offline discussion, you are in favor of not using abbreviations, so please use query here instead of q.

simitt · 2019-10-10T09:22:21Z

agentcfg/cache.go

-	if !c.logger.IsDebug() {
-		return
+	if c.logger.IsDebug() {
+		c.logger.Debugf("Cache size %v. Added ID %v.", c.gocache.ItemCount(), result.Source.Etag)


logging result.Source.Etag is not correct anymore.

simitt · 2019-10-10T09:35:49Z

agentcfg/cache.go

-		return nil, found
-	}
-	return val.(*Doc), found
+func authorized(isRum bool, agent string) bool {


That's only used in tests now

simitt · 2019-10-10T09:37:13Z

agentcfg/fetch.go

-	if err != nil {
-		return nil, err
-	}
+func (f *Fetcher) request(r io.Reader) ([]byte, error) {


please remove empty line 72

simitt · 2019-10-10T09:39:59Z

agentcfg/fetch.go

+func (f *Fetcher) Fetch(query Query) (Result, error) {
+	req := func() (Result, error) {
+		result, error := newResult(f.request(convert.ToReader(query)))
+		return result, error


you can directly say return newResult... now

simitt · 2019-10-10T10:15:08Z

agentcfg/model.go

-	if err := parse(settings, out, "", h); err != nil {
-		return nil, err
-	}
+func (q Query) toString() string {


I'd prefer calling this ID as that's better describing what it does, and toString is usually used for a nicer presentation of an instance.

simitt · 2019-10-10T10:25:06Z

tests/system/test_integration_acm.py

    config_overrides = {
        "logging_json": "true",
        "kibana_enabled": "true",
        "acm_cache_expiration": "1s",
    }

+    @classmethod
+    def setUpClass(cls):
+        super(AgentConfigurationTest, cls).setUpClass()


Why is this necessary?

simitt · 2019-10-10T10:25:53Z

tests/system/test_integration_acm.py

@@ -43,7 +49,10 @@ def create_service_config(self, settings, name, env=None, _id="new"):
        )

    def update_service_config(self, settings, _id, name, env=None):
-        return self.create_service_config(settings, name, env, _id=_id)
+        return self.create_service_config(settings, name, agent="python", env=env, _id=_id)


nit: No need for setting agent="python", it reflects the default.

tests/system/test_integration_acm.py

simitt · 2019-10-10T10:30:27Z

tests/system/test_integration_acm.py

+        etag = r1.headers["Etag"]
+
+        r2 = requests.get(self.rum_agent_config_url,
+                          params={"service.name": service_name, "ifnonematch": etag.replace('"', '')},


why is it necessary to remove surrounding " of etag? It should be possible to just use the etag that is returned by the server.

because the rum agent will actually send it without double quotes

But shouldn't the server be able to process with or without quotes? And why is the RUM agent sending without quotes (I probably missed that discussion somewhere).

I approved the PR, as I haven't followed discussions about this. Please ensure then RUM agent properly removes the quotes from the response they get if you don't change it in the server.

jalvz · 2019-10-11T10:38:58Z

Revisiting this question:

You explained that GA changes will be breaking. Does that mean that the agents using the beta implementation would break when talking to APM Server or would they just stop applying ACM?

What do you actually meant with "break"?

I bumped the min version from 7.3 to 7.5, so the agents would get an error response code, and the should handle it, therefore, not applying any config

jalvz force-pushed the agent-config-ga branch 4 times, most recently from 0de5ad6 to 00af2db Compare October 1, 2019 11:36

jalvz force-pushed the agent-config-ga branch from 00af2db to 9001182 Compare October 2, 2019 12:37

jalvz added the [zube]: In Progress label Oct 2, 2019

jalvz force-pushed the agent-config-ga branch from 9001182 to 23dc36a Compare October 3, 2019 11:27

jalvz added [zube]: In Review and removed [zube]: In Progress labels Oct 3, 2019

jalvz force-pushed the agent-config-ga branch from 23dc36a to b588430 Compare October 3, 2019 12:55

jalvz mentioned this pull request Oct 3, 2019

use real agent name for rum events #2763

Merged

simitt reviewed Oct 4, 2019

View reviewed changes

simitt requested changes Oct 8, 2019

View reviewed changes

jalvz force-pushed the agent-config-ga branch from 8155615 to 8bbae34 Compare October 8, 2019 10:42

simitt reviewed Oct 8, 2019

View reviewed changes

tests/system/test_integration_acm.py Show resolved Hide resolved

simitt reviewed Oct 10, 2019

View reviewed changes

simitt approved these changes Oct 10, 2019

View reviewed changes

jalvz force-pushed the agent-config-ga branch from 07e4e65 to 520fc45 Compare October 11, 2019 08:25

jalvz added this to the 7.5 milestone Oct 11, 2019

jalvz added the feature label Oct 11, 2019

jalvz force-pushed the agent-config-ga branch from 8f94157 to 3ced356 Compare October 14, 2019 07:08

jalvz added [zube]: Blocked and removed [zube]: In Review labels Oct 14, 2019

agent config GA

fe651fd

jalvz force-pushed the agent-config-ga branch from 3ced356 to 5e0a3b2 Compare October 15, 2019 07:23

jalvz added 2 commits October 15, 2019 09:24

changelog

d2fae79

update gitignore

e62788f

jalvz force-pushed the agent-config-ga branch from 5e0a3b2 to e62788f Compare October 15, 2019 07:25

jalvz merged commit 8949faa into elastic:master Oct 15, 2019

zube bot added [zube]: Done and removed [zube]: Blocked labels Oct 15, 2019

jalvz mentioned this pull request Oct 15, 2019

[7.x] agent config GA #2800

Merged

vigneshshanmugam mentioned this pull request Oct 15, 2019

feat(rum-core): use etag for fetching config elastic/apm-agent-rum-js#439

Merged

jalvz mentioned this pull request Oct 17, 2019

[7.x] update gitignore #2821

Merged

agent config GA #2747

agent config GA #2747

Conversation

jalvz commented Sep 30, 2019 • edited Loading

codecov-io commented Oct 2, 2019 • edited Loading

Codecov Report

jalvz commented Oct 2, 2019

jalvz commented Oct 3, 2019

simitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jalvz commented Oct 7, 2019

simitt left a comment

Choose a reason for hiding this comment

jalvz commented Oct 8, 2019

simitt commented Oct 8, 2019

jalvz commented Oct 8, 2019

jalvz commented Oct 8, 2019

simitt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jalvz commented Oct 11, 2019

jalvz commented Sep 30, 2019 •

edited

Loading

codecov-io commented Oct 2, 2019 •

edited

Loading