-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add design doc for processing query language. #4444
Add design doc for processing query language. #4444
Conversation
For a reminder, notes from an initial review - I meant to mostly capture them but if missed anything feel free to point out!
|
Codecov Report
@@ Coverage Diff @@
## main #4444 +/- ##
==========================================
+ Coverage 90.69% 90.72% +0.03%
==========================================
Files 178 179 +1
Lines 10356 10690 +334
==========================================
+ Hits 9392 9699 +307
- Misses 746 770 +24
- Partials 218 221 +3
Continue to review full report at Codecov.
|
Thanks @anuraaga Adding to Collector SIG meeting agenda for discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading again the examples I fear that we try too much to reduce the number of statements/words and because of that we need to make some "assumptions"/"tradeoffs". I would suggest to avoid any kind of optimizations like that in the first version and ask users to always fully specify (sometimes duplicate) the statements. Then we can start adding "optimizations" like if you don't have a from applies to all etc.
docs/processing.md
Outdated
mechanism could be used for selecting metrics for temporality conversion as other cases, but it is expected that in | ||
practice configuration will be limited. | ||
|
||
The processors implementing this use case are `cumulativetodeltaprocessor` and `deltatorateprocessor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deltatorateprocessor
is not temporality related :D the name is a bit wrong to include "delta", it is just a rate calculation value/time_interval, this is more meaningful for delta I agree, but the operation can be done on cumulative as well.
docs/processing.md
Outdated
Remove a forbidden attribute such as `http.request.header.authorization` from all telemetry. | ||
|
||
`delete(attributes["http.request.header.authorization"])` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is nice, but a bit confusing. Does this remove the attribute from the "span.events" as well? What about the "resource.attributes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, as on line 119 I defined all field accesses as fully specified, so it is not ambiguous I think. Let me know if I should phrase it differently
docs/processing.md
Outdated
|
||
Remove all attributes except for some | ||
|
||
`keep(attributes, "http.method", "http.status_code") from metric` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call the "signals" as [traces/metrics/logs] may want to be consistent here and use plural at least, and probably use "traces" instead of "spans"?
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Thanks @bogdandrutu - sorry for the delay, I descoped things like "from all telemetry" since as you mentioned before, the collector has no support for this at all anyways. And the paths are absolute paths as defined here, hopefully that's the reduced scope you're talking about. Also have started on a very hacky prototype of the AST processing anuraaga/opentelemetry-collector-contrib#1 If anyone has any suggestions on test cases to solve / benchmark in the first versions, I can list them up in the doc |
@anuraaga this is very interesting. I have been thinking about a query language in the background for a while, would be good to discuss this. Before you go ahead with the current design let's discuss the alternates and possible amendments. A couple thing I would be interested to explore: SQL or NoIs SQL-like language preferable or we can go with imperative style (e.g. more Python-like)? It doesn't seem like we are going to support complicated SQL-like facilities (e.g. joins) and SQL's verbosity seems superfluous. Compare SQL-like vs Python-like below. Example 1: Example 2: The Special syntax/lexical rulesCan we define the syntax such that it makes usage with Otel data specifically convenient? You mentioned a shorthand for accessing scoped data. This is the right idea and I think we can make spill/populate all "attributes" as top-level variables. Maybe we can also combine this will tailored lexical rules for identifiers that allows dot as a valid identifier character and then we can do this instead of the above: Example 1: The idea being that the most common cases are much more concise and pleasant to write. The downside is you loose the dot as an operator, but maybe we don't need it and brackets are enough for traversal of objects. |
docs/processing.md
Outdated
create_histogram("duration", end_time_nanos - start_time_nanos) from traces | ||
keep(attributes, "http.method") from metrics where descriptor.metric_name = "duration" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what this does? Does this read a span and produce a metric? How is this possible with current processors interface where the processor's input/output data type is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I had meant to remove these cases but forgot this one. When originally writing the doc I forgot about the one signal-per-pipeline limitation.
In the future, I think shared pipelines could be cool but not for now, especially because collector core has no way of supporting that
docs/processing.md
Outdated
Create utilization metric from base metrics. Because navigation expressions only operate on a single piece of telemetry, | ||
helper functions for reading values from other metrics need to be provided. | ||
|
||
`create_gauge("pod.cpu.utilized", read_gauge("pod.cpu.usage") / read_gauge("node.cpu.limit") from metric`**s** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this possible the processor will need to store metric data points referenced by read_gauge
, right? Would that be the responsibility of read_gauge
implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's the idea, though need to confirm better with code indeed
This builds on top of open-telemetry#4444 but uses Python-like syntax instead of SQL-like.
For easier comparison I posted the alternate with Python-like syntax here: #4499 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments @tigrannajaryan!
One of the main motivations of SQL-like was to take advantage of existing tooling or muscle memory. SQL itself has a grammar that is pretty general, I think the proposed syntax is actually a true subset of it. We probably don't want to provide a real programming language in these statements (or at least not yet :) ), so worried if the Python-derived syntax may end up too flexible. We also need to make sure to keep open query engine optimizations, e.g. combining transformations with the same where
clause. Of course, if it's just a difference in formatting, then if:
and where
could be implemented similarly, but the former seems like it could open a larger can of worms.
The from
clause mostly comes from my original idea of having a pipeline with all the signals, forgetting this isn't possible. However, I think having it even from the beginning may be worth it to allow for this in the future, and even in an initial rendition, it seems like UX can be simpler by only configuring the processor once and applying it to each signal pipeline - the processor would smartly only use the rules that actually make sense.
Let me know what you think of these thoughts. /cc @punya also who has been a proponent of SQL-like syntax I believe
docs/processing.md
Outdated
create_histogram("duration", end_time_nanos - start_time_nanos) from traces | ||
keep(attributes, "http.method") from metrics where descriptor.metric_name = "duration" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I had meant to remove these cases but forgot this one. When originally writing the doc I forgot about the one signal-per-pipeline limitation.
In the future, I think shared pipelines could be cool but not for now, especially because collector core has no way of supporting that
docs/processing.md
Outdated
Create utilization metric from base metrics. Because navigation expressions only operate on a single piece of telemetry, | ||
helper functions for reading values from other metrics need to be provided. | ||
|
||
`create_gauge("pod.cpu.utilized", read_gauge("pod.cpu.usage") / read_gauge("node.cpu.limit") from metric`**s** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's the idea, though need to confirm better with code indeed
@anuraaga I am not opposed to SQL-like but I think it may be worth to spend a bit more time with exploring what can give us best usability. I am not completely sure SQL is the most familiar language for our target audience. What do you think if we put together a side-by-side comparison of the examples that you have, plus some other interesting use-cases and see what it looks like? I had also looked into SPL (Splunk's query language) as a source of inspiration, so it may be useful to see if we can borrow any useful ideas from there. If we are totally brave it may even make sense to compare 3 possibilities side-by-side: SQL-like, Python-like and homegrown. I am happy to help/work on this together if you want. |
@tigrannajaryan That sounds like a good idea, let me try to merge the snippets from your PR to let us see things side-by-side |
@tigrannajaryan Sorry for the delay I added snippets following your form as well how does it look? Also feel free to make edits directly to the branch if any easy improvements |
Thanks @anuraaga . I think we need to compare and see what we gain by using one or the other approach. Some arguments in favour of SQL-like:
Arguments in favour of Python-like:
To comments on some arguments you brought up earlier:
I think it is arguable whether SQL or Python has more tooling or muscle memory. Besides it is not clear what tooling we can reuse. Editors with syntax highlighting perhaps, provided that the expression is in a separate file (not possible today but maybe in the future)?
I agree that we need to be careful with this. However given that both with SQL-like and Python-like we are only providing a severely limited subset of the language I think we are going to be exposing roughly the same level of complexity, so I think it is a wash here.
I would like to explore this more. Do we believe that SQL syntax is inherently easier to optimize for our use cases? I am not sure why is it so. The optimizations SQL engines do are typically all about execution plans. However, in our case we have a stream of data, not data at rest, which inherently limits what sort variations you may have in the "execution plan". I doubt that a lot of optimizations are possible, in the majority of case we are going to be literally executing the statements as is, while iterating over the data records. I would like to see a specific example where we think a particular SQL syntax is easier to optimize that the equivalent Python syntax. I am having a bit of a hard time coming up with an example myself. This is probably not very helpful. I feel like there is not enough arguments in favour or against one or the other approach yet. Let me think a bit to see if I can come up with some additional arguments and please feel free to add some yourself. Some other topics to try to explore:
|
Thanks @tigrannajaryan - I agree that the differences aren't that large and there isn't any clear winner.
I guess I am thinking in terms of forwards compatibility, which may be too much early optimization anyways. I do expect a way to process multiple signals at the same time to be useful in the future, and it would be nice if the syntax supports it. The syntax could change to support it in the future though.
It's true that this probably ties into the previous section, if the language exposes too much complexity then it would become harder to optimize. But if the complexity is kept down for even a python-like language, then the optimization possibilities should be similar. They would both probably be frontends into the same AST backend that actually runs the commands. I guess from reading them, and maybe just personal preference, the SQL form seems to map more precisely to single statements. |
You are right, potentially with python like syntax we can allow: if condition:
statement1
statement2
.... |
@anuraaga Can you perhaps show and example of what you mean by this? I feel that I may be missing some nice way to support multiple signals that you are envisioning. |
@tigrannajaryan "A way to support multiple signals"? Nah I don't have that :P Would be a large rewrite in the pipeline I think. I do have a desire though. I think there are many cross-signal use cases like "Drop all health check telemetry", "Redact auth header on all telemetry", "Reduce cardinality for all telemetry of certain attribute in same way". So I am hoping that at least a future configuration language allows this. Ah, if you meant syntax-wise, then I was thinking
Former is only traces, latter is for all telemetry. Explicitly requiring the signal name in the statement means in the later it could be omitted when applying to all telemetry. Even now it could be, the same instantiated transform processor could be applied to all signals. |
I see. You can achieve it today by either including the processor in a particular pipeline or not. This does not seem to enable any new interesting functionality that is not possible with the current philosophy of using a processor in one or more pipelines as applicable. I was wondering if it could somehow enable more interesting scenarios where signal type is getting converted in the middle of the pipeline and somehow one pipeline temporarily happens to hold more than one signal type and on which you execute statements selectively using the "from" statement. This would essentially mean a pipeline can contain a data of multiple signals. I can see how in that case the "from" statement would be really necessary. However, I am having a hard time imagining how exactly we could change the Collector's pipelines to work like that. Anyway, I was just trying to find stronger arguments in favour of having the "from" clause, but I do not seem to be able :-) |
@punya can you review the list of examples and add any missing ones. |
// Assuming this is not in "core" | ||
processors.register("replace_wildcards", replace_wildcards) | ||
|
||
func replace_wildcards(pattern regexp.Regexp, replacement string, path processors.TelemetryPath) processors.Result { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the string in the call get converted to a regexp.Regexp here? The example above shows:
replace_wildcards("/user/*/list/*", "/user/{userId}/list/{listId}", attributes["http.target"])
Is reflection used for every invocation of of replace_wildcards to figure out that "/user/*/list/*"
must be compiled into regexp.Regexp? Or is it the responsibility of the compiler to figure this out?
Related to this: do we adopt weak typing, when you can pass a string when regexp is expected? Weak typing may be more complicated to optimize (i.e. precompile regexp patterns once) so it may be worth thinking about this aspect of the language.
Do we intend to make the compiler smart enough to figure out that the regexp compilation can be done ahead of time and not for every invocation for every span? That's a fairly big ask for a simpler compiler, but without that the execution can be quite slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great callout - while writing the POC I realized the functions need to be factories of the actual logic function.
(a bit of shorthand)
func replace_wildcards(string, string) func(Span, Resource) {
r := regex.MustCompile
return func(span, resource) {
return doReplace(r, s, span, resource)
}
}
func doReplace(regex, string, span, resource) {
go()!
}
The UX for defining a function is a bit reduced but seems quite reasonable. And with this it will be possible to use reflection to convert types passed into the factory so instread of string, it could be defined to accept regex. The framework would only need to reflect, convert regex, and invoke factory, once during config parse time.
I'll add a note in the doc about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, is it the same smartness you're referring to? 😅 I think it achieves some balance by creating the factory / logic function split to make the optimization easier without making the compiler have to be too complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The factory approach requires that some arguments are compile time constants. We loose the ability to use a value computed at runtime as the regular expression, for example. It also requires us to declare which of the function parameters are compile-time parameters and which are runtime parameters. In this example the fist 2 arguments of replace_wildcards are compile-time constants.
What if I want to replace a span attribute by a value of another attribute and not by a constant string? It doesn't seem to be possible. To be honest, I have reservations about the approach you suggest. I think it limits expressiveness of the language. Implementation-wise this approach is deviating from how language compilers and VMs are typically implemented and it may be difficult to fix it without major rewrites in the future.
Alternative approach is to have Regexp type as a first class citizen in the language. So you can do this instead:
replace_wildcards(regexp("/user/*/list/*"), "/user/{userId}/list/{listId}", attributes["http.target"])
where regexp
is a function that takes a string and returns a Regexp. Then a sufficiently smart compiler can perform the Regexp value computation at compile time and avoid calling the regexp()
function at time if the argument to regexp()
is a compile-time constant. See how our current expr evaluator added support for that: https://github.com/antonmedv/expr/pull/96/files
We have another unclear moment with the 3rd parameter of replace_wildcards. It is not clear how attributes["http.target"]
becomes a TelemetryPath
(or a Span
and a Resource
pair that is passed to doReplace
). Is the semantic of the language that attributes["http.target"]
is an lvalue of type TelemetryPath
that can be either evaluated to get the value of or can be passed as a reference to a function to store a value into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The factory approach requires that some arguments are compile time constants. We loose the ability to use a value computed at runtime as the regular expression, for example.
Could you describe an example of a regex computed at runtime? As this is a static config (well possibly dynamically updated via a remote), I figured that by nature the config elements, such as a regex, would be static.
is an lvalue of type TelemetryPath that can be either evaluated to get the value of or can be passed as a reference to a function to store a value into?
Yeah - since it's not a string, it is parsed as a telemetry path. If the function's argument in that position does not accept a telemetry path, it would be a configuration error and cause a warning. It's true that currently only telemetry paths are runtime values - I think this hits all of the use cases for telemetry transform (the only runtime data is the telemetry itself) but let me know what other ones there are that may need something more.
Currently the value
type is a literal, path expression, or invocation of a function
It would be relatively easy to expand support for special syntax, e.g. /regex/
or even regexp("foo")
if it's useful I think. That being said, I don't think we'd avoid factories (though they wouldn't necessarily need to be required). We probably can't support all types of config-time resolved values. For keep
, we need to convert a slice of strings into a map for quick presence tests
There doesn't seem to be a way to handle this generically, but the factory approach seems to work well for it.
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging this as draft. After we progress with the implementation we may update it.
@anuraaga I found a couple interesting alternate approaches that we could probably look into: I think the syntax of either of these is nicer than SQL. I like that the order of syntax elements matches the order of logical execution. |
@tigrannajaryan Thanks a lot for the pointers! I will look through those to see what we can take out of them. Do you think we should continue to iterate on this PR or followup with another one? |
Works either way for me. Whatever you prefer. |
This has been out for a while so I think it'll be good to get this in as a draft and iterate on this / the initial implementation together if it's ok |
This captures ongoing discussions about a revamp of the processing pipeline.
@bogdandrutu @punya