-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address #337 #344
Address #337 #344
Conversation
My 2 cents:
|
@bogdandrutu You know that there exist mixed instrumentation. you have a dog in the race of instrumentation. Most people on this spec are too biased because literally same people here are developing new instrumentation they desire to be used: open-telemetry Step away from this bias, and realize even if you wanted the world to change to your tools, you need to provide a path to that. There are instrumentation today and for at least 3 years that are in different places, some may never change, but some may. To allow them to upgrade is better than "big bang" where you pretend entire environments can change immediately. The downgrade works with backfilling zeros. For example, proxy can rewrite old b3 into w3c trace context mechanically by backfill zeros same as we used to upgrade people into google stackdriver format back in 2016. It is really dangerous to have zero desire to work with existing systems when designing a w3c spec. Please be reasonable and don't break what already works. |
I would like to voice support for 0 backfill as opposed to random backfill as I believe it is more intuitive to understand what is going on. It is also far more compatible with 64 bit systems. I don’t really understand the downside. |
Random backfill was added to make sure that systems don't take dependency on zeroes and make the best effort to propagate an entire thing even if they only use the half of the |
I think what's misunderstood here is what the "system" is. Random backfill is intentionally breaking, which would be surprising to have as a goal in a spec whose purpose is interop. the propagation system is a collection of instrumentation which have different capacities. 64bit is too prevalent to either ignore or to intentionally break. The indexing, re-fitting etc capacity of the data pipeline is out of scope for this spec, as that is about data not defined here. It is relevant to suggest that there are systems who are able to conditionally handle things in a mixed mode by looking at the 64bit part. You should know that this doesn't mean they ignore the high bits. What literally happens is that traces are grouped by 64 bits, then re-qualified by the upper bits. When the upper bits are zero (unknown) the traces are kept in a generic grouping. The "backfill random" aspect of this spec wrecks this qualification and correction heuristic. Why this was added, if to intentionally break systems today, seems literally reckless. |
I understand the problem of changing backend. Spec suggesting that you can only look at subset of bytes of a trace-id. But must propagate the rest bytes thru the process. No need to save them. It is written in an assumption that If system makes an assumption that it will never be called by anything that will have full trace-id set with random bytes and will always be called from the zero backfill systems, than zero backfill can be used, it is not forbidden by a spec. Let me know if I'm missing something. |
what led someone to tell me about this issue was bogdan's comment here:
the idea to say that nothing can use tracecontext unless every individual part of that system is 128-bit is a big problem. There's a continued misunderstanding that "propagation system" is homogeneous and it is certainly not nor likely to be for many sites. Implying you have to change all instrumentation to use tracecontext is severe and more to helping force people into open telemetry vs solving the goal of propagation header. The other thing, and more to the point here, is that the text is very confused and misguided. Ex it says 64 bit system should add random data to make believe it is not. This hints at a control that doesn't exist. Also the example is backwards |
As a user, I think backfilling the traceID with a random trace is counter intuitive and confusing for people debugging issues in production. More so, backfilling from the rightside insted of the left side is not only counter intuitive but unnatural, given that we are talking about numbers (in a hex representation) and when a number need to fulfill a certain length we add |
I think the requirement to “always propagate" leaves a lot of legacy systems in the dust, even though they could've been partially compatible with left 0 padding. I could even wrap a legacy app that uses 64bit zipkin or Jaeger header in a sidecar and automatically rewrite those headers into w3c traceparent with empty tracestate. Left 0 padding gives a clear signal that the transaction passed through a legacy layer. Randomization requirement makes it impossible. I think randomization requirement does more harm than good (the good part isn't even that clear to me). |
I would even add that if a modern 128bit capable system generates a new trace id, it MUST ensure that the first 64bits are not zeros. |
We balancing many requirements and strive to encourage interoperability. I've added two more sections into rationale doc and expanded the paragraph. Please comment - it must be easier to leave comment to the document than in PR description. |
This paragraph came from the discussion with NewRelic. @victorNewRelic can you please comment on your experience of a transition |
This paragraph came from the discussion with NewRelic. @victorNewRelic
<https://github.com/victorNewRelic> can you please comment on your
experience of a transition
new relic's tracing offering is one data point, but as far as I know it is
a relatively new product and closed source. I still don't understand why
years of existing OSS practice should be ignored over this.
|
When addressing interoperability with these systems requirements the following | ||
is taken into consideration: | ||
|
||
1. The main objective of the specification is to promote interoperability of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these points are true, then shouldn't the recommendation be to backfill with zeros, not another random 64 bits? Wouldn't that promote better interoperability with brown field ecosystems that are currently on 64 bit, with the way forward being that those systems would organically move to 128 bit as they can? Currently, by recommending backfilling with random 64 bits, the spec is encouraging breakage (and it's just surprising/confusing/unexpected to see backfilling with randomness - I would expect something deterministic, and the brown field has chosen the deterministic path of backfilling with zeros).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
further in the doc there is a note that backfilling with random numbers will encourage you to test that those numbers will be propagated. And 64 system will not break the 128 bit system passing trace thru it. Spec is encouraging interoperability of different systems, which don't know about each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encourage you to test
This doesn't seem like a good enough reason to break compatibility with some 64-bit systems. The spec can encourage you to test by stating so in the spec. I'm not sure I can see many spec implementors who would only test because the random numbers encouraged testing, and would fail to test if it was simply stated in the spec.
`trace-id` generation it SHOULD fill-in the extra bytes with random values | ||
rather than zeroes. Let's say the system works with an 8-byte `trace-id` like | ||
`23ce929d0e0e4736`. Instead of setting `trace-id` value to | ||
`23ce929d0e0e47360000000000000000` (note, that random part is kept on the left |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned by others, this example is backwards and doesn't match the comment in parenthesis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random part is kept to the left. Why is it not matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the big problem here is the words "random part" can be interpreted multiple ways. The text says that the "random part is kept on the left" but I think it means to say "the original 8-byte trace ID is kept on the left". Yes/no? This section is discussing backfilling with random values vs. zeros, so when you say "random part is kept on the left" I'm reading that as the random backfilling on the left. But you're using the original 8-byte trace ID on the left, not random-backfill values, and the right side is all zeros, which nobody here is arguing should be done.
And looking at it from the zeros point of view: why is the 0000000000000000
on the right here? The text says "Instead of setting trace-id
value to
23ce929d0e0e47360000000000000000
". Who would ever set it to 23ce929d0e0e47360000000000000000
? If it was zero-backfilled wouldn't it be 000000000000000023ce929d0e0e4736
? i.e. the backfilled zeros would be on the left.
|
||
Specification uses "SHOULD language" when asking to fill up extra bytes with | ||
random numbers instead of zeros. Typically, tracing systems that will not | ||
propagate extra bytes of incoming `trace-id` will not follow this ask and will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, but an upstream system that follows the SHOULD
advice to fill up the left side with random 64 bits will break a downstream that is using brown field 64 bit tracers. Wouldn't it be better to have the SHOULD
advice recommend filling with zeros to maximize default interoperability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
systems that fail to propagate extra bytes of trace id already breaking couple MUST of a spec =). This comment is simply saying that IF system uses 64 bits, has no way to propagate extra bits unchanged, but still wants to follow the spec, while pretty confident that all components are controlled by this system - zero backfill is what this system will implement.
|
||
When tracing system has this capability, specification suggests that extra bytes | ||
are fill out with random numbers instead of zeros. This requirement helps to | ||
validate that tracing systems implements propagation from incoming request to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What use is this validation? Isn't it better to have de-facto interoperability by default? Why is this validation more beneficial and outweigh default interoperability with existing brown field systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the key here is that spec is addressing interoperabilty of different tracing systems. So if 64 bit system receives 128bit trace-id, it must do the best effort to propagate the entire 128 bits. The best way to ensure that this is implemented is ask to fill up spaces with random numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, actual users are telling you that the best way to ensure that is to backfill with zeros.
|
||
Specification explains the "left padding" requirement for the random `trace-id` | ||
generation. Tracing systems will implement various algorithms that use | ||
`trace-id` as a base to make a sampling decision. Typically, it will be some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhh ... is this the main reason for recommending random 64 bits for backfill? I can understand how this would be beneficial for such sampling algorithms, but I would suggest that interoperability-by-default with existing brown field systems is a much better tradeoff.
(Also, how is "typically" determined here? How common is this really? How many major large-userbase tracers have this functionality? How many users actually use it? I'd be surprised if the userbase using this kind of sampling algorithm is (1) really big enough to warrant breaking default interoperability with 64 bit systems, and (2) would mostly use the leftmost bytes such that they would actually be negatively affected by backfilling with zeros.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it's the only reason, and it doesn't hold water for me. Someone mentioned elsewhere that some systems may use UUID algorithms for generating trace IDs, and the left-most bytes could be composed of a timestamp - not random at all.
If systems want to avoid RNG by reusing randomness in the trace ID, an easier way to do that is XOR the left and right 8 bytes of the full 16-bytes ID and stop caring about which side of it contains more randomness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhh ... is this the main reason for recommending
we need to keep this rationale file up to date. It really helps I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for the recommendation - we had this issue in Microsoft with the "bad ordering" of randomness and I saw a few implementations which only looking at 4 bytes to calculate sampling priority. However I personally don't feel strongly about this clause. Exactly because of these reasons that it is hard to enforce randomness.
Specification explains the "left padding" requirement for the random `trace-id` | ||
generation. Tracing systems will implement various algorithms that use | ||
`trace-id` as a base to make a sampling decision. Typically, it will be some | ||
variation of hash calculation. Those algorithms may be giving different "weight" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Breaking compatibility by default for a "may be" seems wrong. How common is this really?
`trace-id` as a base to make a sampling decision. Typically, it will be some | ||
variation of hash calculation. Those algorithms may be giving different "weight" | ||
to different bytes of a `trace-id`. So the requirement to keep randomness to the | ||
left helps interoperability between tracing systems by suggesting which bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that this is prioritizing a suggestion of interoperability with a much smaller number of systems over the definite breakage with a large volume of brown field systems.
(I'm assuming numbers on hash-based-left-bytes sampling algorithm are small compared to brown field 64 bit systems. I'm open to being proven wrong if anyone has solid numbers - I could just be out of the loop.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that many systems uses trace-ID
where the right most part is random, not the left most? Or we are still talking about zero backfill? If backfill - why not backfill other bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero-backfill.
Because of all the reasons multiple users have stated in this thread. They are telling you that backfilling with zeros will provide better compatibility and interoperability than backfilling with randomness.
left helps interoperability between tracing systems by suggesting which bytes | ||
carry a bigger weight in hash calculation. | ||
|
||
Practically, there will be tracing systems filling first bytes with zeros (see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather see the specification reverse expectations here. IMO systems SHOULD
backfill with zeros, and there could be a note that since it's only a SHOULD
and not a requirement, then systems could choose to do the random 64 bit backfill if they feel they need to due to a hash-based-left-bytes sampling algorithm.
In other words, default to maximum interoperability, and point out where there might be valid use cases for doing something else. As it stands, this feels backwards.
The reality is that implementors will see SHOULD
, and the language around 64 bit trace IDs being "violations", and they will blindly backfill with random 64 bits. As a result we'll see breakage by default instead of interoperability by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't follow what system will break. And why the randomness requirement will break it? It felt almost like you suggesting all systems only fill up 64 bits.
In the doc there are three types of systems described. First, which simply ignores extra bytes. These systems definitely will be better off by filling up with zeros. And they can. They already breaking couple MUST of this spec. Ignoring second. Third, that only recording 64 but respecting others and propagates 128. For this tracing systems - why not ask to fill up extra bytes with random numbers? What difference does it make? It still will do it when some external 128bit system will call into it and extra bytes are needed to be propagated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not ask to fill up extra bytes with random numbers?
Try looking at this from the other direction - why not fill up with zeros? Multiple people are telling you that backfilling with zeros will allow for greater compatibility with 64 bit systems.
I added review comments on the diffs as requested. I agree with the arguments of others in this PR that backfilling with zeros is both more compatible by default with existing brown field systems and more intuitive. I've seen other arguments here that essentially say backfilling with randomness isn't a big deal, because it's only a suggestion and not a spec requirement. That's not cover for unnecessary interoperability breakage by default. One of my review comments talks directly to this:
If this spec is truly about maximizing interoperability, I don't see how we could pick random-backfilling instead of zeros-backfilling. |
Sorry for the late addition. We are actively implementing this, and I would prefer zeros as well. The reason has to do with split traces. If Service A calls two services, B and C, with a 64-bit trace, and both B and C each backfill with different random bits, then the trace is now split, as I now have two separate trace IDs. SO the trace has become split. Now, consider that the analysis system consuming this data accepts 128-bit IDs. When the analysis system receives 64-bit IDs directly from A, it must likewise pad the IDs with zeros. I do not see another choice, when some services only report 64 bits, and others report 128. This is a serious issue for us. However, if there is a clear definition in the spec for how to backfill with zeros, this solves these split-trace problems very cleanly. Unfortunately, I see no solution where every service pads with random bits; currently I am watching it create a host of issues as I try to implement it. |
If B and C could both update |
@reyang unfortunately no, as information placed in tracestate is not reported by any system we interoperate with. Also, there are scenarios where we must switch headers - to propagate to a service configured with B3, for example - and thus there is no tracestate. |
@nicmunroe hi! Glad to see you. Thank you for review! |
@tedsuo I think you are missing a point. First, behavior when system recieves a trace is not specified. Random backfill is only for new trace-id. Second, it is recommended that B and C will propagate extra bytes. If you talking about any of opentracing implementations - there is probably some sort of baggage mechanism, so propagating of extra 64 bits shouldn't be a major problem. Please let me know what you think |
Absolutely! And I'm glad to see this discussion here. I only want them to share their experience of migrating to 128 bits as this entire paragraph was added after discussion of NewRelic's implementation of trace context. |
Summary
Is this correct so far? |
@SergeyKanzhelev After responding to a few other comments, this really caught my eye, and may be one of the main reasons people in this PR are having trouble hearing each other. I'm not sure you can specify one thing (how to generate the trace ID) without considering the other (how to receive a 128 bit trace when you're a 64 bit system, and then how to propagate it again later). Why? If a 64 bit system receives a 128 bit trace ID and downgrades it to 64 bit (assuming that's the best it can do), then what is it supposed to do when it goes to propagate it on outbound calls? The spec is very clear about all trace IDs being 128 bit, and this section talks about backfilling with random values. So they'll probably backfill with random values when they propagate downstream. This is really bad, as @tedsuo 's example showed. It doesn't matter that there's a small side note buried in the text saying "it's only SHOULD, you can backfill with zeros if you want" - implementors who are trying to be good citizens will backfill with random, because it's what the spec recommends. And it doesn't matter that the random backfill stuff is only supposed to be for ID generation - the 64 bit system has to do something to conform to the w3c 128 bit trace ID requirement when propagating, so it's logical that they would follow the section talking about what 64 bit systems are supposed to do with trace IDs. If, on the other hand, the recommendation was to backfill with zeros both for generation and 64-bit downgrade+propagation, then you'd get some level of interoperability and compatibility by default. You'd get this base level of interoperability even if a trace bounced back and forth through multiple 64 bit and multiple 128 bit systems on a single request. |
@nicmunroe thank you for the note. I'm also trying to understand how to make sure we are not talking pass each other. Yes, spec is only talking about new ID generation. In case of propagation - either fill up with zeroes or try to pass by extra bytes unchanged if you can. It is definitely not clear from the text as it raises so much confusion. This is how this PR is actually started. As for the inter-oparability and left padding - the prior art reference goes to how, for example,
So asking for left padding of randomness breaks this scenario. You'd want some randomness in the right part of a trace so inter-operability with 64bit systems would work. |
@danielkhan as I see it - in an order of importance there are three questions:
1st is not specified in a spec. So we can add it without normative change to the spec. We have a strange SHOULD-MUST combination of randomness generation. Reading it is unclear what it actually asks for. One option is to nuke it completely from the spec. Again, I'd argues not a normative change as the entire paragraph is confusing. For the item 3 "Zero vs. random backfill" on a new If we all agree with this - there will be (from my perspective) no normative changes and everybody's interests will be addressed. |
I recommend filing separate issues and holding a discussion / vote there before proposing a PR to change.
I think most people on this thread disagree with this. If a request goes through several 64bit systems, then zero-padding on egress from those systems allows almost the whole request to be associated with a single trace ID. Random padding will break it into many traces, with no way of correlating. |
I agree with @yurishkuro. I hardly felt heard in this discussion, and I do not see my concerns being clearly addressed. I would like to see this hashed out more fully. |
@tedsuo so you mentioning two issues. First, split trace. In this example you assume that B and C will fill up trace-id with zeroes:
When spec is saying:
Second concern is about analytical system. And rationale doc addresses this concern saying that:
So this is not really a concern - if system breaks this spec requirement anyway, you can break one more "SHOULD". In fact spec is not addressing these type of services at all - it uses MUST langugage that the entire Furthermore if you look at the same example and assume that What am I missing? |
Absolutely! This PR simply removes the uncertancy in the current spec w.r.t. how it was written. Mostly around the fact that random backfill is only requested for new trace-id, not the case when you propagate somebody else's So the way I'd suggest we approach it is by merging this PR and filing these three issues:
My comment was written in response to @danielkhan's summary trying to break down the problem in smaller issues. |
@SergeyKanzhelev "random backfill is only requested for new trace-id" kind-of doesn't make sense to me as a use case. If a system is truly 64bit, and not upgradeable, it most likely won't have any place to store the other random part of the trace. The backfill only applies on egress from a 64bit system, which is where many are arguing a zero-padding is better than random padding. Note that the padding may be done not by the 64bit system itself, but by a sidecar that rewrites the headers. |
@yurishkuro these systems
breaks a lot (like not propagating Spec uses MUST language on propagation with a small not on what to do when it's absolutely impossible. However, once the systems is capable to propagate all 128 bit (even though 64 of these bits may not be recorded), the ask is to do random generation. For downstream 64bit only systems these "slightly upgraded" 64 bit systems may look like a truly 128 systems that still needs to be supported. Are you suggesting to add another issue for discussion:
|
Also I realized we may need a names to differentiate systems:
|
so spec mostly talking about
|
I think so. Most of the feedback on this thread was about "legacy" systems. If a system can be upgraded to support |
Good. We are on the same page here. |
@SergeyKanzhelev to be clear, this is the scenario I am raising. I believe the confusion relates to whether existing systems had somewhere to put the extra 64 bits. The
I admit I don't see the logic here. As long as we are accepting data from
Thank you. Having looked at it from several directions now, I don't believe there is a complete solution for backwards compatibility, except to stick with only 64 bits in the trace id until every |
I'm closing this PR in favor of this issue: #349 Thank you everybody for a valuable feedback! Let's make sure the spec is fixed to address compatibility scenarios identified in this discussion. |
Fixes #337
The intent of the paragraph was to describe the ID generation. It fails to mention that the whole trace-id MUST be propagated.