Agent Configuration in Kibana, GA #138

jalvz · 2019-08-23T13:38:16Z

eIs your feature request related to a problem? Please describe.

APM delivered Agent Configuration in Kibana as beta in 7.3, intentionally deferring some aspects for later phases. The goal of this issue is to pick up where we left and have a more compelling feature towards GA. We are aiming 7.5.

Two lines of work stand out:

Provide support for more settings.
Provide feedback to the user in Kibana.

Describe the solution you'd like

On the first point, ideally we add support for 2-3 more fields. Some reasonable candidates are:

ACTIVE/RECORDING (depends on [agents] definition of ACTIVE/DISABLED_INSTRUMENTATION #92)
CAPTURE_BODY
METRICS_INTERVAL
IGNORE_URLS
TRANSACTION_MAX_SPANS

~~- SPAN_FRAMES_MIN_DURATION~~

On the second point, a good start would be to show in Kibana whether some configuration was applied by some agent or not. For this, we could use the Etags sent from the agents to know what is the last good value they have. More precisely:

Kibana will be generating the Etags (either as content hashes or UIDs), instead of APM Server.
APM Server will simply pass the Etags from agents to Kibana back and forth.
Kibana will add a boolean field to the elasticsearch documents, initially false, and flip it to true when it receives an Etag from any one agent matching the Etag in the document.
That boolean indicates to the UI that some agent is up to date. When the user changes some setting in Kibana, it resets the boolean.

This approach entails a feedback delay of up to 2 times the POLL_INTERVAL, which is acceptable.

One downside is that if some agents succeed and others fail, those failures would be silently ignored. This is (hopefully!) an unlikely scenario.

Another downside is that it is not possible for Server/Kibana to distinguish failure from missing (agent never querying) unless we keep track of how many agents are around. This would add significant complexity. A workaround is trough documentation, warning users that if they don't see feedback in 2xPOLL_INTERVAL seconds it is probably because something went wrong.

Another option is that agents send the ephemeral_id upstream, and Kibana shows a count of agents that applied the last configuration based on the ephemeral ids (or maybe even show the ids themselves). This requires agents to calculate/generate their ephemeral_id. Alternatively, agents can use their IP address.

Describe alternatives you've considered

We could come up with a way to aggregate data coming from different agents. For instance:

If all agents "so far" report success, show a "success" visual indicator the UI.
If all agents "so far" report failure, show a "failed" visual indicator the UI.
If some report success and some report failure, show both indicators.
- We could further help users to dig into which ones failed via the aforementioned ephemeral_id, IP address, or similar.

This would require agents to send data to apm-server (probably to the same endpoint) and define a new schema. A new schema would also allows us to show more information than a boolean, eg: timestamp of config application, error messages, etc.

I suggest to not introduce a new data model until we know more how the feature is used, what problems users run into, how/why exactly agents might fail to apply configuration, etc.

We also need to decide if we want to support RUM in GA, later, or never. User feedback is probably not practical for the RUM agent.

Note that at the moment Agent Configuration in Kibana is unusable for customers with a Distributed Tracing setup, as the only available setting is SAMPLING_RATE, which in case of DT will be dictated by the RUM agent (almost always).

RUM status is tracked in elastic/apm-agent-rum-js#253

Finally, during the first design we briefly touched on auditing. Some sort of audit logs could be achieved eg. by creating one elasticsearch document per update, and adding information about the user that did that change. However if we do not plan to add an UI interface for it, it might be enough to simply log configuration updates in Kibana.

Implementation issues

See background in #4 and #76

The text was updated successfully, but these errors were encountered:

watson · 2019-08-23T13:54:54Z

Regarding which config options to support, I just want to mention that of the ones listed above, IGNORE_URLS and SPAN_FRAMES_MIN_DURATION are not yet supported by all the agents. I think it's pretty trivial to add support for IGNORE_URLS across the board, but SPAN_FRAMES_MIN_DURATION is a little trickier to support in async languages. Currently, we don't have support for it in Node.js and we, unfortunately, don't have a path forward for how to support it.

Whether or not all agents support a given config option might of course not be a dealbreaker for using it in the central configuration. But if we do, do we know how it would look if say the Node.js agent tries and fails to set SPAN_FRAMES_MIN_DURATION?

jalvz · 2019-08-23T15:22:25Z

Whether or not all agents support a given config option might of course not be a dealbreaker

Agreed, but given the range of options available is probably better to maximize utility by choosing settings that all agents support.

Thanks!

axw · 2019-08-26T03:01:45Z

I'd opt for:

CAPTURE_BODY, since it could be useful to turn it on while debugging
RECORDING (no particular reason, could be useful to disable in case the agent/app is on fire)

Not sure about IGNORE_URLS. Maybe it's just me, but I'd define that closer to the application (i.e. using env vars or a config file) rather than closer to Kibana/ES, since it's relatively static and tied to the app's routes.

Also, not sure about METRICS_INTERVAL or TRANSACTION_MAX_SPANS. How often do we expect people to use these? Is it likely that anyone would want to change them on the fly?

graphaelli · 2019-08-26T20:45:42Z

Is it likely that anyone would want to change them on the fly?

I could see users eventually using only central configuration to push agent configurations eventually so the more options available to prevent that, the better IMO, regardless of whether they seem useful to change often.

jalvz · 2019-08-28T14:31:31Z

After discussion this morning with @sqren we concluded that it makes sense that Kibana keeps track of Etags instead (right now they are generated by apm-server).
I Updated the proposal accordingly.

As per the suggested fields, is there any one not in my initial proposal that makes more sense to consider now?

felixbarny · 2019-08-28T15:00:04Z

Any configuration which is marked as reloadable would be interesting: https://docs.google.com/spreadsheets/d/1JJjZotapacA3FkHc2sv_0wiChILi3uKnkwLTjtBmxwU

sorenlouv · 2019-08-30T09:15:27Z

For anyone interested: I created an issue for targeting multiple services/environments with a single configuration: elastic/kibana#44475.

jalvz · 2019-09-06T12:14:43Z

@elastic/apm-agent-devs you can link your implementation issue above

watson · 2019-09-06T12:18:42Z

Don't we need to first solve #92 before we can continue with this one?

jalvz · 2019-09-06T12:22:46Z

Sure. What I meant is that when you have an implementation issue, you can link it above.
Also: if 7.5 is not realistic (considering at least one agent should be done by FF), this is the time to speak - so the rest of the teams can plan accordingly.

beniwohli · 2019-09-06T12:50:06Z

I'm a bit unclear what needs to be done. Is there a final list of configuration options that we should implement?

felixbarny · 2019-09-09T10:05:39Z

The Java agent has generic support for any configuration option marked as dynamic, including span_frames_min_duration, for example.

axw · 2019-09-12T08:03:41Z

Given that recording isn't yet a thing, and span_frames_min_duration isn't support by the Node.js agent, how about we go with these next:

CAPTURE_BODY
IGNORE_URLS
TRANSACTION_MAX_SPANS

beniwohli · 2019-09-12T08:38:52Z

IGNORE_URLS seems to be somewhat problematic IMO, since how the agents interpret the value varies wildly:

Java and Go: a list of wildcard matchers
Node: A list of items that can be either regexes or strings
Ruby: a list of regexes. Also, the option is called ignore_url_patterns
Python: A list of regexes. Also, the option is called transaction_ignore_parrerns.
then I stopped...

That looks mightily confusing for our users. I think we first need to harmonize on a common name and semantics. Should I open a new issue for that?

axw · 2019-09-12T09:32:30Z

@beniwohli 😩 yes please

beniwohli · 2019-09-12T09:55:52Z

#144

formgeist · 2019-09-19T18:11:40Z

Just a nudge; we're awaiting which options we're aiming at adding in for upcoming release to the config UI. Are we going with what @axw suggested above?

beniwohli · 2019-09-19T18:56:38Z

@formgeist we decided in yesterday's meeting on CAPTURE_BODY and TRANSACTION_MAX_SPANS, but then noticed that there are some small differences in the interpretation of the latter setting. I think we should be able to sort that out though, so we should be good with those two.

Note that CAPTURE_BODY can have four discrete values, errors, transactions, all and off. It would be great if the user would be limited to those values, maybe a select box or similar.

formgeist · 2019-09-19T19:02:47Z

@beniwohli Excellent! Will we need any specific validation needed on the TRANSACTION_MAX_SPANS?

beniwohli · 2019-09-19T19:42:32Z

Hah, that's where there are some discrepancies amongst the agents. Some have special meanings for 0 and -1.

Can we go with >=-1 and possibly change it to >0 in a few days when us agent devs resolved our disagreement, possibly in a deathmatch?

formgeist · 2019-09-20T07:32:50Z

Sure, it was only to get some ideas around how we should validate that field. Thanks! I'll move forward with this and get the design in order for implementation. We can sort out the validation details when you folks have made up your minds 😉

beniwohli · 2019-09-20T13:55:46Z

Just a note, both of these options don't apply for RUM.

axw · 2019-09-27T01:44:37Z

@formgeist the deathmatch happened, and this is where we landed: all backend agents will treat TRANSACTION_MAX_SPANS=0 the same (it means "don't create any spans"), but we won't align on negative values for now. The UI should disallow negative values.

formgeist · 2019-09-27T07:10:00Z

@axw Thanks for the update, I'll make sure to update the description.

jalvz · 2019-09-27T13:15:23Z

I had a chat with @sqren and we concluded that now is a good time to allow users to select "all services" for which to apply a configuration. This config will be stored with an empty service name in Elasticsearch.

When querying config, if there is a conflict, the most explicit one (ie with no empty service ) takes precedence. When new services are added, they will be affected by the "all services" config (if it exists) right away.

No changes are required for agents.

felixbarny · 2019-09-27T13:37:09Z

Have you thought about merging configurations?

Example:
service: "" (all services)

transaction_sample_rate=0.1

service: opbeans-java

transaction_max_spans=100

service: opbeans-go

transaction_sample_rate=1

Effective configuration
service: opbeans-java

transaction_sample_rate=0.1 # merged
transaction_max_spans=100

service: opbeans-go

transaction_sample_rate=1  # overridden

A while ago we also discussed allowing to create different configurations blocks for specific service.environments or other metadata fields like global labels.

All of that is ofc. something we can add later as well. I'm just curious 🙂

sorenlouv · 2019-09-30T08:05:40Z

@felixbarny Have you thought about merging configurations?

Yes, we did talk about it. We need to think more about it to avoid it becoming very confusing to users - so it won't happen this release.

I did play around with it though and one thing I couldn't figure out how to handle was the "applied flag". In this version we are adding the applied flag, so that agents can report back which configuration they have applied, and the user will be able to see in the UI which configurations have been applied, and which have not.

However, if we merge configurations the agent will not apply a single configuration but multiple. Which configuration should the UI then indicate as being applied? One or all of them? Or should the apply mechanism not be on a configuration level but on a field level? (that would require the agent to report back which configuration id it got each field from...)

beniwohli · 2019-09-30T08:16:47Z

@sqren by "this version" you mean 7.5? Also, has the method the agents report back which configuration they applied already been spec'd out somewhere? I couldn't find it on a cursory search.

sorenlouv · 2019-09-30T09:04:33Z

by "this version" you mean 7.5?

Yes, 7.5.

Also, has the method the agents report back which configuration they applied already been spec'd out somewhere?

Yes, it's in the description of this issue. Agents don't have to do anything if they already sent back the etag they receive from apm-server.

SergeyKleyman · 2019-10-26T08:52:04Z

Regarding CAPTURE_BODY for the next batch of central config options: there's an accompanying option CAPTURE_BODY_CONTENT_TYPES - wouldn't it be reasonable to make it part of central config as well?

mikker · 2019-10-26T14:30:30Z

the Ruby agent doesn't have CAPTURE_BODY_CONTENT_TYPES, I think it's just an implementation detail of the Java agent?

eyalkoren · 2019-10-27T14:24:17Z

Not really an implementation detail, it is just a whitelist of content types which is made configurable (not required- has a default).
Speaking of this, maybe the next enhancement for the central config should allow for arbitrary config options as free text. This will not only solve the problem of those agent-type-specific configurations, but will also remove the delivery limitations. The Java agent would then automatically support all config options that are implemented as dynamic.

beniwohli · 2019-10-28T09:54:49Z

To be honest, given the experience we had with just agreeing on name and behavior of two config options, I'm against a free-text configuration system. Before that can happen, we need a systematic review of all common config options, and ensure that all agents agree on common behavior. Otherwise, the user experience will be terrible for people who use more than one language.

I suggest that we set up a working group for this, with 3 people from 3 different agents. For each config option, the group members write up the exact behavior of the option, and compare notes. If all three agree, it's a good indicator that it's the correct behavior, and should be put in the spec. If not, the members can then discuss how a common spec for that option should look, and suggest a spec to the wider agent team.

For future config options, a spec should be written from the beginning.

WDYT?

jalvz added the design label Sep 6, 2019

beniwohli mentioned this issue Sep 13, 2019

add tests to check update-ability of transaction_max_spans and capture_body elastic/apm-agent-python#589

Merged

eyalkoren mentioned this issue Oct 3, 2019

Add new remote config options elastic/apm-agent-java#871

Closed

SergeyKleyman mentioned this issue Oct 4, 2019

Add support for the next batch of central config options (CAPTURE_BODY and TRANSACTION_MAX_SPANS) elastic/apm-agent-dotnet#534

Closed

axw mentioned this issue Oct 8, 2019

Add central config support for CAPTURE_BODY, TRANSACTION_MAX_SPANS elastic/apm-agent-go#647

Closed

jalvz mentioned this issue Nov 15, 2019

[docs] Document Agent Config GA #168

Closed

axw mentioned this issue Nov 29, 2019

feat(config): extend central config support elastic/apm-agent-nodejs#1555

Merged

3 tasks

jalvz closed this as completed Dec 5, 2019

lreuven mentioned this issue Feb 20, 2020

Enable Remote Config for all dynamic parameters. #213

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Configuration in Kibana, GA #138

Agent Configuration in Kibana, GA #138

jalvz commented Aug 23, 2019 •

edited by hmdhk

Loading

watson commented Aug 23, 2019

jalvz commented Aug 23, 2019 •

edited

Loading

axw commented Aug 26, 2019

graphaelli commented Aug 26, 2019

jalvz commented Aug 28, 2019

felixbarny commented Aug 28, 2019

sorenlouv commented Aug 30, 2019 •

edited

Loading

jalvz commented Sep 6, 2019

watson commented Sep 6, 2019

jalvz commented Sep 6, 2019

beniwohli commented Sep 6, 2019

felixbarny commented Sep 9, 2019

axw commented Sep 12, 2019

beniwohli commented Sep 12, 2019

axw commented Sep 12, 2019

beniwohli commented Sep 12, 2019

formgeist commented Sep 19, 2019

beniwohli commented Sep 19, 2019

formgeist commented Sep 19, 2019

beniwohli commented Sep 19, 2019

formgeist commented Sep 20, 2019

beniwohli commented Sep 20, 2019

axw commented Sep 27, 2019 •

edited

Loading

formgeist commented Sep 27, 2019

jalvz commented Sep 27, 2019

felixbarny commented Sep 27, 2019

sorenlouv commented Sep 30, 2019

beniwohli commented Sep 30, 2019

sorenlouv commented Sep 30, 2019 •

edited

Loading

SergeyKleyman commented Oct 26, 2019

mikker commented Oct 26, 2019

eyalkoren commented Oct 27, 2019

beniwohli commented Oct 28, 2019

Agent Configuration in Kibana, GA #138

Agent Configuration in Kibana, GA #138

Comments

jalvz commented Aug 23, 2019 • edited by hmdhk Loading

Kibana

Server

Agents

watson commented Aug 23, 2019

jalvz commented Aug 23, 2019 • edited Loading

axw commented Aug 26, 2019

graphaelli commented Aug 26, 2019

jalvz commented Aug 28, 2019

felixbarny commented Aug 28, 2019

sorenlouv commented Aug 30, 2019 • edited Loading

jalvz commented Sep 6, 2019

watson commented Sep 6, 2019

jalvz commented Sep 6, 2019

beniwohli commented Sep 6, 2019

felixbarny commented Sep 9, 2019

axw commented Sep 12, 2019

beniwohli commented Sep 12, 2019

axw commented Sep 12, 2019

beniwohli commented Sep 12, 2019

formgeist commented Sep 19, 2019

beniwohli commented Sep 19, 2019

formgeist commented Sep 19, 2019

beniwohli commented Sep 19, 2019

formgeist commented Sep 20, 2019

beniwohli commented Sep 20, 2019

axw commented Sep 27, 2019 • edited Loading

formgeist commented Sep 27, 2019

jalvz commented Sep 27, 2019

felixbarny commented Sep 27, 2019

sorenlouv commented Sep 30, 2019

beniwohli commented Sep 30, 2019

sorenlouv commented Sep 30, 2019 • edited Loading

SergeyKleyman commented Oct 26, 2019

mikker commented Oct 26, 2019

eyalkoren commented Oct 27, 2019

beniwohli commented Oct 28, 2019

jalvz commented Aug 23, 2019 •

edited by hmdhk

Loading

jalvz commented Aug 23, 2019 •

edited

Loading

sorenlouv commented Aug 30, 2019 •

edited

Loading

axw commented Sep 27, 2019 •

edited

Loading

sorenlouv commented Sep 30, 2019 •

edited

Loading