Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP Exporter: Make gzip compression the default #1895

Closed
wants to merge 9 commits into from

Conversation

johnnyshields
Copy link

@johnnyshields johnnyshields commented Aug 27, 2021

  • In specification/protocol/exporter.md, make gzip compression the default.
  • To allow disabling compression, I have added null as a supported value (currently the only way is to leave the env var blank)

AFAIK there is no practical use case where a user would not want to use gzip compression when exporting traces. It greatly reduces network load and data costs.

After approval, this change should be done in all the client libraries.

@johnnyshields johnnyshields requested review from a team August 27, 2021 11:16
@johnnyshields johnnyshields changed the title Make gzip compression the default OTLP Exporter: Make gzip compression the default Aug 27, 2021
@arminru arminru added area:sdk Related to the SDK area:configuration Related to configuring the SDK labels Aug 31, 2021
@github-actions
Copy link

github-actions bot commented Sep 8, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 8, 2021
@johnnyshields
Copy link
Author

johnnyshields commented Sep 8, 2021

@arminru @bogdandrutu can you unmark this as stale? AFAIK this is pending review from OTEL core team.

@github-actions github-actions bot removed the Stale label Sep 9, 2021
@tigrannajaryan
Copy link
Member

AFAIK there is no practical use case where a user would not want to use gzip compression when exporting traces. It greatly reduces network load and data costs.

gzip compression is possibly pointless in localhost communication (e.g. between Otel SDK and Otel Collector running on localhost) and likely is just a CPU overhead. I did not test for this but this may be a practical use case which disproves the above statement.

@yurishkuro
Copy link
Member

gzip compression is possibly pointless in localhost communication (e.g. between Otel SDK and Otel Collector running on localhost)

probably, but is the localhost-only setup a prevalent configuration or an edge case that can get by with proactively disabling gzip?

@johnnyshields
Copy link
Author

johnnyshields commented Sep 13, 2021

the localhost-only setup a prevalent configuration

It is most certainly an edge case; the standard usage is to send OTEL data directly to a third-party service such as DataDog or Lightstep.

Even in the case of local collectors, presumably these collectors forward requests to an external host, in which case it would probably still make sense to gzip your requests before they are forwarded. For example, Nginx/Apache will forward gzipped requests "as-is" without decompressing the payload. (Note that gzip applies only to the request body and not to the HTTP headers.)

@Oberon00
Copy link
Member

It is most certainly an edge case; the standard usage is to send OTEL data directly to a third-party service such as DataDog or Lightstep.

AFAIK AWS Lambda is one setup where you typically send to a local collector.

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Sep 22, 2021
@fbogsany
Copy link
Contributor

Even in the case of local collectors, presumably these collectors forward requests to an external host, in which case it would probably still make sense to gzip your requests before they are forwarded.

This is not true for the OpenTelemetry Collector when used as an "agent" (e.g. as a sidecar or daemonset). The collector will decompress, deserialize, process (perhaps including batching), re-serialize, and re-compress before forwarding. In this scenario, compression in the app and decompression in the Collector are almost certainly needless overhead.

That said, at Shopify we ended up forwarding "as-is" before abandoning the Collector daemonset altogether.

@johnnyshields
Copy link
Author

johnnyshields commented Sep 22, 2021

@fbogsany there are two questions we should think about here:

  1. Which is more common: using a local "collector" daemon, or directly forwarding OTEL to an external host? I would assert in 90%+ of cases it's the latter, and that seems to be the pattern favored by the majority of Otel analytics vendors (DataDog, Lightstep, etc.)--and as you ended up doing at Shopify.

  2. What is the penalty for "getting it wrong"? If we compress when going to a local collector when we don't actually need to, it's a miniscule bit of CPU wasted (ideally it might even save the collector the compression work.) However, if we don't compress when going to an external sink, the user's network and data ingestion bills will be 10x higher.

@github-actions github-actions bot removed the Stale label Sep 23, 2021
@github-actions
Copy link

github-actions bot commented Oct 1, 2021

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 1, 2021
@anuraaga
Copy link
Contributor

anuraaga commented Oct 1, 2021

and that seems to be the pattern favored by the majority of Otel analytics vendors (DataDog, Lightstep, etc.)

FWIW AWS encourages the sidecar collector to allow rich resource detection and attributes processing. My impression had been this is the common recommendation among vendors but could be wrong. But up until now when thinking of tradeoffs, my mental model has always considered sidecar to be a large majority.

Just an aside that it's always pretty annoying working with OTLP - the tree structure is harder to serialize by needing more state, and parsing with several loops every single time just to get some spans is tedious. It is some builtin in compression in the structs but if gziping anyways, which indeed will probably shrink it even more, then the tedium of the tree structure becomes quite disappointing.

@DragonStuff
Copy link

@tigrannajaryan There are three approvals on this MR and it seems like we’ve gathered all possible information to make a decision. Let’s not lose this to the stale bot. 🙇

Copy link
Contributor

@anuraaga anuraaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's been a spec change since the three approvals to make the variable defaults align with localhost usage

#2014

I propose not changing this compression default in light of this, but either way we need to reset the reviews I think.

@johnnyshields
Copy link
Author

johnnyshields commented Oct 21, 2021

@anuraaga #2014 was merged without anyone taking a holistic view on whether or not “sidecar” should be the basis for default parameters. We need the OTEL team to make a decision on that and then set appropriate values for default host, scheme, and compression at once. So we should leave this PR open until that decision is made.

@tigrannajaryan
Copy link
Member

@johnnyshields changes to the specification require a significant level of consensus which I do not see in this thread. Unless the objecting people change their opinion I do not think this PR can be merged.

@johnnyshields
Copy link
Author

johnnyshields commented Oct 21, 2021

@tigrannajaryan well this is considerably frustrating that #2014 was merged with zero concensus, yet this PR is for some reason held up despite having 3 approvals.

I was under the impression that OpenTelemetry had governance structures to resolve such design questions. Is there a way we could escalate this issue to the appropriate committee?

@tigrannajaryan
Copy link
Member

@tigrannajaryan well this is considerably frustrating that #2014 was merged with zero concensus, yet this PR is for some reason held up despite having 3 approvals.

I was under the impression that OpenTelemetry had governance structures to resolve such design questions. Is there a way we could escalate this issue to the appropriate committee?

@johnnyshields if you believe the code of conduct was violated please contact the governance committee. If you disagree on the technical merit of the changes and would like a technical design review you can contact the technical committee. See the community repo for more details and Github handles of GC and TC members to contact: https://github.com/open-telemetry/community

@github-actions github-actions bot removed the Stale label Oct 22, 2021
@carlosalberto
Copy link
Contributor

@johnnyshields Usually if there's at least a changes request we don't merge things. And definitely this can be escalated to the TC so we take a second look ;) I'd strongly encourage you to do so.

On a more technical note: what's reason to have a SHOULD use gzip instead? At this very moment SIGs can use either gzip or no compression as the default, and with SHOULD use gzip they could still decide to use either of them.

If you really want them to use gzip by default, a MUST would be required instead.

@johnnyshields
Copy link
Author

johnnyshields commented Oct 22, 2021

@carlosalberto thank you for the reply.

The core issue at stake here, which needs to be escalated to the TC, is whether we assume that default mode of trace transmission is a) direct to a third-party server (e.g. how NewRelic, Lightstep, etc. operate) OR b) via a "sidecar", e.g. a collector daemon process presumably running on localhost which then forwards the traces.

This has implications for how we set the default values of several parameters mentioned here in the thread

The use of SHOULD as opposed to MUST is for the reasons that even if we agree (a) to be the default, certain SIGs may still have a strong reason to go with (b) for platform-specific reasons (or vice-versa)

@tigrannajaryan
Copy link
Member

The core issue at stake here, which needs to be escalated to the TC, is whether we assume that default mode of trace transmission is a) direct to a third-party server (e.g. how NewRelic, Lightstep, etc. operate) OR b) via a "sidecar", e.g. a collector daemon process presumably running on localhost which then forwards the traces.

I think you are looking for one "true" way to use all SDKs, and I think that simply does not exist. There are different valid ways to use SDKs. I do not see how we can decide on a single blessed approach in this situation. Because of that we delegated the decision to SDK maintainers to make the best decision based on what's more appropriate for the most common scenario for that particular SDK. For example the JS SDK may decide to enable gzip by default (since sending from browser), also mobile builds of SDKs may enable gzip by default, while the rest keep the compression off by default (and assume sending to local Collector) and let the user enable gzip if needed.

I understand that this may not be the ideal outcome your are looking for, but it is probably the best we can do at the moment.

If you want further opinion from TC on this I can add it to the next TC meeting agenda.

@johnnyshields
Copy link
Author

@tigrannajaryan, yes, I as noted I am looking for a "SHOULD", not a "MUST".

@yurishkuro
Copy link
Member

@tigrannajaryan The current wording still allows SIG-initiated overrides. I see the value in this PR going in because it centrally resolves the question that otherwise every SIG needs to address while having no more data than we have now. Yes, when we talk about special cases like browser or mobile SDK, the choice might be more obvious. But for any server-side SDK there's always going to be a choice between local and remote collector. So why do we think that say Python or Java SIG would be able to choose the default any better than we can do centrally? I think by making the choice centrally we make the SIG maintainers' jobs easier.

@trask
Copy link
Member

trask commented Oct 23, 2021

would it be horrible for the default to depend on whether the endpoint is localhost or not?

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 30, 2021
@johnnyshields
Copy link
Author

Keep alive

@github-actions github-actions bot removed the Stale label Oct 31, 2021
@tigrannajaryan
Copy link
Member

The TC has discussed this. I am posting on behalf of the TC:

  • We believe that having a locally running Collector is common.
  • The SDK defaults are optimized for this scenario. For example the default endpoints that SDKs send to are "localhost".
  • In line with this we also believe that the default setting for compression options should be "none". Enabling compression when sending to localhost increases CPU consumption and is unnecessary.
  • We think that it is not desirable to make one option's default value depend on another option's value (e.g. make "gzip" default compression if the destination is not "localhost").
  • Some SDKs will know for certain or with high probability that the typical destination they send to is not local. For example this may be the case for in-browser JS SDK or for mobile Android SDK. For such SDKs we want to allow to choose a different default setting for compression.

Therefore the TC believes that:

  • The default recommended value for OTEL_EXPORTER_OTLP_*COMPRESSION option in the spec should be "none".
  • The spec will allow SDKs to deviate from this recommendation as explained above.

@johnnyshields
Copy link
Author

OK. Scratching my head because none of the major commercial platforms (New Relic, Datadog, Lightstep) use a sidecar. I guess Splunk does? Who were the representatives on the TC?

@bogdandrutu
Copy link
Member

@johnnyshields just for my curiosity by what measurement did use to determine "major commercial platforms"?

https://github.com/open-telemetry/community/blob/main/community-members.md#technical-committee

@johnnyshields
Copy link
Author

My comment was not meant to be an exhaustive list--the 3 I listed are indeed major platforms, among others. Should have written "e.g."

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 12, 2021
@johnnyshields
Copy link
Author

I'm going to close this for now because the current language is adequate. I still don't understand the TC's logic given the reality of most commercial platforms 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:configuration Related to configuring the SDK area:sdk Related to the SDK Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.