Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process.command_line accepts multiple value types #831

Closed
anuraaga opened this issue Aug 19, 2020 · 17 comments · Fixed by #1137
Closed

process.command_line accepts multiple value types #831

anuraaga opened this issue Aug 19, 2020 · 17 comments · Fixed by #1137
Labels
area:semantic-conventions Related to semantic conventions priority:p3 Lowest priority level question Question for discussion release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:resource Related to the specification/resource directory

Comments

@anuraaga
Copy link
Contributor

I noticed process.command_line accepts different types of values. I don't think we generally do this but was this intended? I would just use a string for it in all cases.

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/resource/semantic_conventions/process.md#process

@anuraaga anuraaga added the spec:resource Related to the specification/resource directory label Aug 19, 2020
@Oberon00
Copy link
Member

This was intentional and was discussed at #635 (comment). In some environments (Windows) a single string (space separated or however it was typed in the shell) is the raw form of this. In others (Linux) a NUL-separated string that is easily translated to a list is the raw form. Additionally, some runtimes support only obtaining the commandline as list of arguments, which would need to be escaped (according to which shell's syntax?) if you want them as single string.
If we only support a single string, we would additionally need an attribute that tells us the format the string is in.

@Oberon00 Oberon00 added area:semantic-conventions Related to semantic conventions question Question for discussion labels Aug 19, 2020
@Oberon00
Copy link
Member

Actually, a much nicer summary of the pros & cons is at #635 (comment) by @james-bebbington :

There is also concern around using two different data types to represent the same attribute value: #635 (comment). It might create additional complexity on back end systems to have to manage potentially different data types for the same attribute, though I'm not sure how much of a concern this would be? I guess parsing null-characters on the back end would likely be equivalently complex.

Note this isn't strictly ambiguous btw. If that was a command with no arguments, and you were looking at the Unix representation, it would have a null-character as the last character in the string. On Windows it would not. Admittedly that is not a great representational difference, and could be difficult to distinguish in some programming languages.

After thinking about this some more, I think I do prefer the list of strings approach, but I'm not sure I'm aware of all the pros/cons. For now I have changed it back to that, and updated the example for each OS accordingly. But for the record here's the three options considered and pros/cons:

  1. Join arguments with a string:
    Advantages: Most consistent representation across OSs.
    Disadvantages: Potentially lossy/non-trivial conversion on Unix - do we need to enquote parameters with spaces in them (and escape internal quotes), etc?
  2. Pass through string returned from OS as is (null-delimited on Unix):
    Advantages: No processing needed by code producing telemetry, can just provide what OS returns.
    Disadvantages: Back end needs to parse null-characters.
  3. Convert to list of strings (for Unix only):
    Advantages: Clean ("properly typed") representation.
    Disadvantages: Inconsistent data type for the same attribute between OSs, and back end needs to handle this differently.

@anuraaga
Copy link
Contributor Author

anuraaga commented Aug 19, 2020

Yeah I can definitely see the motivation for this. At the same time, instrumentation and backend does get quite a bit more complex because of the dual-type so I don't know if this will be easy to implement - generating code from the semantic spec yaml seems especially problematic.

@thisthat Can you check if your semantic spans work be able to cover this case? #571

@Oberon00
Copy link
Member

At the same time, instrumentation and backend does get quite a bit more complex because of the dual-type

Instrumentation gets simpler because it has a choice, and can do whatever is easier. Or am I missing something?

@Oberon00
Copy link
Member

Oberon00 commented Aug 19, 2020

@thisthat Can you check if your semantic spans work be able to cover this case? #571

Currently the tool doesn't support this, but it can be extended. In fact we had the possibility of such union types in mind from the beginning.

@anuraaga
Copy link
Contributor Author

I see - currently in Java we use an abstraction like StringAttributeSetter for operating on attributes. We don't yet use it for resource attributes but I think we should and probably will eventually. So we would need to introduce a clunky StringOrArrayAttributeSetter - it's not the end of the world, but this is what I meant by instrumentation getting more complex (probably only applies to typesafe language).

@Oberon00
Copy link
Member

You can also have two constants for this semantic attribute, e.g. a StringAttributeSetter PROCESS_COMMANDLINE_STR and PROCESS_COMMANDLINE_LIST. Though I begin to see your point. We could instead have two alternative semantic attributes with two different names in the semantic conventions. It would be a bit clunky in the spec then though.

@anuraaga
Copy link
Contributor Author

Anyways it's a very small point - I was surprised to see it but since it was already discussed it's fine with me. Thanks for clarifying it.

@anuraaga anuraaga reopened this Aug 19, 2020
@anuraaga
Copy link
Contributor Author

Actually let me keep this open. The spec text says either primitive or array

https://github.com/open-telemetry/opentelemetry-specification/blob/master/specification/common/common.md#attributes

So I think we need to tweak that to more formalize the possibility of union types.

@Oberon00
Copy link
Member

The actual Attribute will not have an union type, it will have either one or the other type. Only the semantic convention has the union type.

@anuraaga
Copy link
Contributor Author

It took me a minute to parse that but I understand what you mean :) It's true, but I don't think it hurts to clarify this example explicitly. I think dual type convention is going to cause confusion when a newcomer reads and tries to understand so we can probably help. That newcomer this time was me, hence the issue filed ;)

@thisthat
Copy link
Member

I might miss something but, what is the reason for not using array as the only available type and in windows put it as a single element array?

@jmacd
Copy link
Contributor

jmacd commented Aug 19, 2020

@thisthat I think the opposite argument could also be made. Why not convert the array representation into a string in the backend, if it's so important to have a consistent type?

@Oberon00 wrote:

Instrumentation gets simpler because it has a choice, and can do whatever is easier. Or am I missing something?

and I agree. I don't think OTel should be in the business of saying what is and is not a valid data representation for these semantic conventions. The conventions are describing semantics. There is no loss of semantics by allowing multiple representations, but there are well-documented benefits to the instrumentation modules that can simply observe their inputs as opposed to transforming their inputs to match a specified data representation.

@anuraaga
Copy link
Contributor Author

I don't think OTel should be in the business of saying what is and is not a valid data representation for these semantic conventions.

I'm not against having conventions with multiple types like this if we clearly lay it out. I'm not sure I'd go so far as saying we aren't in the business of describing the data representation though. If we are ok with numeric values being reported as strings, for example, all processors of data whether collector or backend would have to check both types or risk missing the data, which is unnecessary complexity.

@andrewhsu andrewhsu added the release:after-ga Not required before GA release, and not going to work on before GA label Aug 21, 2020
@Oberon00
Copy link
Member

Oberon00 commented Sep 24, 2020

I might miss something but, what is the reason for not using array as the only available type and in windows put it as a single element array?

Because the meaning of a single array element currently is: After splitting the argument list, this was the only argument.

E.g. when you have a value of ["myapp myarg"] that would actually mean that the original commandline invocation was something like $ myapp\ myarg (note the escaped space).

@Oberon00
Copy link
Member

Oberon00 commented Oct 23, 2020

@open-telemetry/technical-committee I think this issue should be allowed-for-ga. EDIT: I made a PR #1137 that would fix this.

@Oberon00 Oberon00 removed the release:after-ga Not required before GA release, and not going to work on before GA label Oct 23, 2020
@andrewhsu andrewhsu added release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs priority:p3 Lowest priority level labels Oct 27, 2020
@andrewhsu
Copy link
Member

from the spec sig mtg today, re-triaged for allowed-for-ga

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:semantic-conventions Related to semantic conventions priority:p3 Lowest priority level question Question for discussion release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:resource Related to the specification/resource directory
Projects
None yet
5 participants