Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent Configuration in Kibana, GA #138

Closed
2 of 7 tasks
jalvz opened this issue Aug 23, 2019 · 33 comments
Closed
2 of 7 tasks

Agent Configuration in Kibana, GA #138

jalvz opened this issue Aug 23, 2019 · 33 comments
Labels

Comments

@jalvz
Copy link
Contributor

jalvz commented Aug 23, 2019

eIs your feature request related to a problem? Please describe.

APM delivered Agent Configuration in Kibana as beta in 7.3, intentionally deferring some aspects for later phases. The goal of this issue is to pick up where we left and have a more compelling feature towards GA. We are aiming 7.5.

Two lines of work stand out:

  1. Provide support for more settings.

  2. Provide feedback to the user in Kibana.

Describe the solution you'd like

On the first point, ideally we add support for 2-3 more fields. Some reasonable candidates are:

- SPAN_FRAMES_MIN_DURATION

On the second point, a good start would be to show in Kibana whether some configuration was applied by some agent or not. For this, we could use the Etags sent from the agents to know what is the last good value they have. More precisely:

  • Kibana will be generating the Etags (either as content hashes or UIDs), instead of APM Server.
  • APM Server will simply pass the Etags from agents to Kibana back and forth.
  • Kibana will add a boolean field to the elasticsearch documents, initially false, and flip it to true when it receives an Etag from any one agent matching the Etag in the document.
  • That boolean indicates to the UI that some agent is up to date. When the user changes some setting in Kibana, it resets the boolean.

This approach entails a feedback delay of up to 2 times the POLL_INTERVAL, which is acceptable.

One downside is that if some agents succeed and others fail, those failures would be silently ignored. This is (hopefully!) an unlikely scenario.

Another downside is that it is not possible for Server/Kibana to distinguish failure from missing (agent never querying) unless we keep track of how many agents are around. This would add significant complexity. A workaround is trough documentation, warning users that if they don't see feedback in 2xPOLL_INTERVAL seconds it is probably because something went wrong.

Another option is that agents send the ephemeral_id upstream, and Kibana shows a count of agents that applied the last configuration based on the ephemeral ids (or maybe even show the ids themselves). This requires agents to calculate/generate their ephemeral_id. Alternatively, agents can use their IP address.

Describe alternatives you've considered

We could come up with a way to aggregate data coming from different agents. For instance:

  • If all agents "so far" report success, show a "success" visual indicator the UI.
  • If all agents "so far" report failure, show a "failed" visual indicator the UI.
  • If some report success and some report failure, show both indicators.
    • We could further help users to dig into which ones failed via the aforementioned ephemeral_id, IP address, or similar.

This would require agents to send data to apm-server (probably to the same endpoint) and define a new schema. A new schema would also allows us to show more information than a boolean, eg: timestamp of config application, error messages, etc.

I suggest to not introduce a new data model until we know more how the feature is used, what problems users run into, how/why exactly agents might fail to apply configuration, etc.


We also need to decide if we want to support RUM in GA, later, or never. User feedback is probably not practical for the RUM agent.

Note that at the moment Agent Configuration in Kibana is unusable for customers with a Distributed Tracing setup, as the only available setting is SAMPLING_RATE, which in case of DT will be dictated by the RUM agent (almost always).

RUM status is tracked in elastic/apm-agent-rum-js#253

Finally, during the first design we briefly touched on auditing. Some sort of audit logs could be achieved eg. by creating one elasticsearch document per update, and adding information about the user that did that change. However if we do not plan to add an UI interface for it, it might be enough to simply log configuration updates in Kibana.

Implementation issues

Kibana
Server
Agents

See background in #4 and #76

@watson
Copy link
Contributor

watson commented Aug 23, 2019

Regarding which config options to support, I just want to mention that of the ones listed above, IGNORE_URLS and SPAN_FRAMES_MIN_DURATION are not yet supported by all the agents. I think it's pretty trivial to add support for IGNORE_URLS across the board, but SPAN_FRAMES_MIN_DURATION is a little trickier to support in async languages. Currently, we don't have support for it in Node.js and we, unfortunately, don't have a path forward for how to support it.

Whether or not all agents support a given config option might of course not be a dealbreaker for using it in the central configuration. But if we do, do we know how it would look if say the Node.js agent tries and fails to set SPAN_FRAMES_MIN_DURATION?

@jalvz
Copy link
Contributor Author

jalvz commented Aug 23, 2019

Whether or not all agents support a given config option might of course not be a dealbreaker

Agreed, but given the range of options available is probably better to maximize utility by choosing settings that all agents support.

Thanks!

@axw
Copy link
Member

axw commented Aug 26, 2019

I'd opt for:

  • CAPTURE_BODY, since it could be useful to turn it on while debugging
  • RECORDING (no particular reason, could be useful to disable in case the agent/app is on fire)

Not sure about IGNORE_URLS. Maybe it's just me, but I'd define that closer to the application (i.e. using env vars or a config file) rather than closer to Kibana/ES, since it's relatively static and tied to the app's routes.

Also, not sure about METRICS_INTERVAL or TRANSACTION_MAX_SPANS. How often do we expect people to use these? Is it likely that anyone would want to change them on the fly?

@graphaelli
Copy link
Member

Is it likely that anyone would want to change them on the fly?

I could see users eventually using only central configuration to push agent configurations eventually so the more options available to prevent that, the better IMO, regardless of whether they seem useful to change often.

@jalvz
Copy link
Contributor Author

jalvz commented Aug 28, 2019

After discussion this morning with @sqren we concluded that it makes sense that Kibana keeps track of Etags instead (right now they are generated by apm-server).
I Updated the proposal accordingly.

As per the suggested fields, is there any one not in my initial proposal that makes more sense to consider now?

@felixbarny
Copy link
Member

Any configuration which is marked as reloadable would be interesting: https://docs.google.com/spreadsheets/d/1JJjZotapacA3FkHc2sv_0wiChILi3uKnkwLTjtBmxwU

@sorenlouv
Copy link
Member

sorenlouv commented Aug 30, 2019

For anyone interested: I created an issue for targeting multiple services/environments with a single configuration: elastic/kibana#44475.

@jalvz jalvz added the design label Sep 6, 2019
@jalvz
Copy link
Contributor Author

jalvz commented Sep 6, 2019

@elastic/apm-agent-devs you can link your implementation issue above

@watson
Copy link
Contributor

watson commented Sep 6, 2019

Don't we need to first solve #92 before we can continue with this one?

@jalvz
Copy link
Contributor Author

jalvz commented Sep 6, 2019

Sure. What I meant is that when you have an implementation issue, you can link it above.
Also: if 7.5 is not realistic (considering at least one agent should be done by FF), this is the time to speak - so the rest of the teams can plan accordingly.

@beniwohli
Copy link
Contributor

I'm a bit unclear what needs to be done. Is there a final list of configuration options that we should implement?

@felixbarny
Copy link
Member

The Java agent has generic support for any configuration option marked as dynamic, including span_frames_min_duration, for example.

@axw
Copy link
Member

axw commented Sep 12, 2019

Given that recording isn't yet a thing, and span_frames_min_duration isn't support by the Node.js agent, how about we go with these next:

  • CAPTURE_BODY
  • IGNORE_URLS
  • TRANSACTION_MAX_SPANS

@beniwohli
Copy link
Contributor

IGNORE_URLS seems to be somewhat problematic IMO, since how the agents interpret the value varies wildly:

  • Java and Go: a list of wildcard matchers
  • Node: A list of items that can be either regexes or strings
  • Ruby: a list of regexes. Also, the option is called ignore_url_patterns
  • Python: A list of regexes. Also, the option is called transaction_ignore_parrerns.
  • then I stopped...

That looks mightily confusing for our users. I think we first need to harmonize on a common name and semantics. Should I open a new issue for that?

@axw
Copy link
Member

axw commented Sep 12, 2019

@beniwohli 😩 yes please

@beniwohli
Copy link
Contributor

#144

@formgeist
Copy link
Contributor

Just a nudge; we're awaiting which options we're aiming at adding in for upcoming release to the config UI. Are we going with what @axw suggested above?

@beniwohli
Copy link
Contributor

@formgeist we decided in yesterday's meeting on CAPTURE_BODY and TRANSACTION_MAX_SPANS, but then noticed that there are some small differences in the interpretation of the latter setting. I think we should be able to sort that out though, so we should be good with those two.

Note that CAPTURE_BODY can have four discrete values, errors, transactions, all and off. It would be great if the user would be limited to those values, maybe a select box or similar.

@formgeist
Copy link
Contributor

@beniwohli Excellent! Will we need any specific validation needed on the TRANSACTION_MAX_SPANS?

@beniwohli
Copy link
Contributor

Hah, that's where there are some discrepancies amongst the agents. Some have special meanings for 0 and -1.

Can we go with >=-1 and possibly change it to >0 in a few days when us agent devs resolved our disagreement, possibly in a deathmatch?

@formgeist
Copy link
Contributor

Sure, it was only to get some ideas around how we should validate that field. Thanks! I'll move forward with this and get the design in order for implementation. We can sort out the validation details when you folks have made up your minds 😉

@beniwohli
Copy link
Contributor

Just a note, both of these options don't apply for RUM.

@axw
Copy link
Member

axw commented Sep 27, 2019

@formgeist the deathmatch happened, and this is where we landed: all backend agents will treat TRANSACTION_MAX_SPANS=0 the same (it means "don't create any spans"), but we won't align on negative values for now. The UI should disallow negative values.

@formgeist
Copy link
Contributor

@axw Thanks for the update, I'll make sure to update the description.

@jalvz
Copy link
Contributor Author

jalvz commented Sep 27, 2019

I had a chat with @sqren and we concluded that now is a good time to allow users to select "all services" for which to apply a configuration. This config will be stored with an empty service name in Elasticsearch.

When querying config, if there is a conflict, the most explicit one (ie with no empty service ) takes precedence. When new services are added, they will be affected by the "all services" config (if it exists) right away.

No changes are required for agents.

@felixbarny
Copy link
Member

Have you thought about merging configurations?

Example:
service: "" (all services)

transaction_sample_rate=0.1

service: opbeans-java

transaction_max_spans=100

service: opbeans-go

transaction_sample_rate=1

Effective configuration
service: opbeans-java

transaction_sample_rate=0.1 # merged
transaction_max_spans=100

service: opbeans-go

transaction_sample_rate=1  # overridden

A while ago we also discussed allowing to create different configurations blocks for specific service.environments or other metadata fields like global labels.

All of that is ofc. something we can add later as well. I'm just curious 🙂

@sorenlouv
Copy link
Member

@felixbarny Have you thought about merging configurations?

Yes, we did talk about it. We need to think more about it to avoid it becoming very confusing to users - so it won't happen this release.

I did play around with it though and one thing I couldn't figure out how to handle was the "applied flag". In this version we are adding the applied flag, so that agents can report back which configuration they have applied, and the user will be able to see in the UI which configurations have been applied, and which have not.

However, if we merge configurations the agent will not apply a single configuration but multiple. Which configuration should the UI then indicate as being applied? One or all of them? Or should the apply mechanism not be on a configuration level but on a field level? (that would require the agent to report back which configuration id it got each field from...)

@beniwohli
Copy link
Contributor

@sqren by "this version" you mean 7.5? Also, has the method the agents report back which configuration they applied already been spec'd out somewhere? I couldn't find it on a cursory search.

@sorenlouv
Copy link
Member

sorenlouv commented Sep 30, 2019

by "this version" you mean 7.5?

Yes, 7.5.

Also, has the method the agents report back which configuration they applied already been spec'd out somewhere?

Yes, it's in the description of this issue. Agents don't have to do anything if they already sent back the etag they receive from apm-server.

@SergeyKleyman
Copy link
Contributor

Regarding CAPTURE_BODY for the next batch of central config options: there's an accompanying option CAPTURE_BODY_CONTENT_TYPES - wouldn't it be reasonable to make it part of central config as well?

@mikker
Copy link
Contributor

mikker commented Oct 26, 2019

the Ruby agent doesn't have CAPTURE_BODY_CONTENT_TYPES, I think it's just an implementation detail of the Java agent?

@eyalkoren
Copy link
Contributor

Not really an implementation detail, it is just a whitelist of content types which is made configurable (not required- has a default).
Speaking of this, maybe the next enhancement for the central config should allow for arbitrary config options as free text. This will not only solve the problem of those agent-type-specific configurations, but will also remove the delivery limitations. The Java agent would then automatically support all config options that are implemented as dynamic.

@beniwohli
Copy link
Contributor

To be honest, given the experience we had with just agreeing on name and behavior of two config options, I'm against a free-text configuration system. Before that can happen, we need a systematic review of all common config options, and ensure that all agents agree on common behavior. Otherwise, the user experience will be terrible for people who use more than one language.

I suggest that we set up a working group for this, with 3 people from 3 different agents. For each config option, the group members write up the exact behavior of the option, and compare notes. If all three agree, it's a good indicator that it's the correct behavior, and should be put in the spec. If not, the members can then discuss how a common spec for that option should look, and suggest a spec to the wider agent team.

For future config options, a spec should be written from the beginning.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests