-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan validator/consumer reference implementation #131
Comments
Hey @jvanstraten, thanks for bringing this up. Yes, this is something many of us have discussed and agree is very important. I also agree that some early substrait implementations have been casual about the set of rules they follow from the spec or not. :) If you have the time and interest and interest in coming up with an outline/proposal/spec/thought piece on how this would work and implement it, that would be great! Want to start a google doc of thoughts/approach? WRT to the impl language, I think it is important to be able to expose this functionality as a shared library with a c interface that can be used from nearly anywhere. This will allows to avoid implementing this multiple times. On that point, avoiding interpreted or gc'd languages is important since their runtime are generally much less optimal for embedding. So among the options you proposed, C++ sounds like the better option. That being said, I'd actually love to see this done in Rust if that something you'd be interested in doing. Validation like this seems like a classic place to have to fight memory bugs. (Curious if others like @cpcloud have an opinion here.) The other note I'd make is it would be great if we could find a way to make at least part of this validity rules more structured that code. For example, maybe we can use annotation in the proto files to define at least some of the simpler rules (validate this is set, or one and only one of these is set, etc). I'd like to make modifications to the specification and subsequent requirement to update the validation tool as straightforward as possible. Lastly, we may want to think about whether the validator could validate multiple versions of a plan. (I'm not convinced on this) As we develop new rules/requirements, we're likely to have producer and consumer on differing plan versions. Allowing the use of a single validator to support multiple plan versions would be nice. |
I hadn't even considered implementing this as a library, much less with a language-agnostic interface. It seems much simpler and lower-maintenance to me to just build a standalone application, analogous to what https://validator.w3.org/ is for HTML. I suppose a library form might be easier to use from within test cases, since you'd just call a function instead of starting a child process and inspecting its exit code, but then still I think I'd prefer just writing a utility function that does that over the headaches of adding a link-time dev dependency that transitively depends on protobuf, especially when they define the same messages that you already do in your application (there is, apparently, no good way to privately or statically link to protobuf stuff, because the protobuf library itself maintains a global registry of message types). You'd also only be able to link one version of the validator per test executable that way. That said... if you'd want to provide both options, the base form would be a library. Also, unless I'm missing a use case, I don't think it ever makes sense for a producer/consumer to link to a validator under normal (non-test) conditions. A consumer would always have its own "validation" as a natural consequence of converting the Substrait messages to its own internal representation, which doesn't necessarily relate to the actual Substrait spec. After all, if it doesn't support literally everything that the spec supports, it will have to reject more than the validator, and it might well want to accept things that aren't strictly valid as well (like how browsers accept nonsense HTML to because they're usually the one held responsible for not rendering a page "correctly," or like how most C compilers support their own language extensions on top of the C standard). I suppose a producer might want to self-check itself at runtime via some kind of assertions, but I personally wouldn't consider that worth the extra runtime dependencies. I do see uses for a Substrait library in general though, like:
but to me this seems like something that's much lower priority than making a validation tool to help disambiguate the spec itself, so I think it'd be better to just start with a standalone app in a single language. That said, it does make sense to choose a language for the validator that allows bits of it (or all of it) to be easily reused in a library later. So, like you said, not Python.
It's been a while since I did anything with Rust, mostly because I tend to get very frustrated fighting rustc when it inevitably says "no" to my OO/inheritance/C++-esque intuition for solving problems, but I'm willing to give it a shot.
I'm generally in favor of this as well (the more concise, the better), but after some googling this doesn't seem to be as easy as I had hoped it'd be. It's possible to add arbitrary annotations to messages, fields, and so on using protoc itself, but only by extending protobuf's internal There is already a project in that extension registry that deals with basic protobuf validation. However, it's still in an alpha state, so I'm not sure if we should be relying on that already. Using that would also not allow for defining any Substrait-specific constraints. I suppose we could also make our own protobuf file parser with better support for annotations (via special comments or something), or a parser for some DSL that we can generate protobuf files from. But that seems excessive.
I think this would be quite valuable. A consumer probably wouldn't want to restrict itself to accepting only a single version of Substrait, either. There's a maintenance trade-off to make here, though. Either way, I don't think we need to make a decision about this until we figure out versioning in general.
I started something here, but it's not very extensive or structured yet (and no information content beyond what's in this thread already). |
|
Okay, that's fair enough and indeed shouldn't be all that complicated. For the first version I'll probably make it a Rust library + application then. C bindings for it are a bit more work, but with an interface that simple it can't be that hard to add later.
I'm also absolutely not a Rust expert, but I doubt any pure-X reimplementation of protobuf would have the same problems indeed. I was recommended prost by a colleague who's very much into Rust compared to me, for which indeed the generated code doesn't link to Google's implementation at all. Unfortunately however, it seems to lack the ability to get information about custom field options at runtime, so I don't know how suitable it is for single-point-of-truth'ing constraints for the validator in the protobuf files. Maybe there's a supported way to hook into its code generator and handle it there, but I haven't found the time to look into that yet (I've been working on getting the initial Substrait consumer merged into Arrow). Hopefully I'll have more time to investigate options tomorrow. |
Given your comments in the prior about the weak solutions in this space, I think I was already resigned to the fact that single-truth may be difficult. If you figure it out, that would be great! Looking forward to the direction you take this. |
Having looked into it a bit more, I don't think prost-build can easily be hooked into to change code generation without forking prost. But after giving it a bit more thought, I'm also not sure if it's worth putting lots of effort into this at all. The kind of checks that can be easily described this way in .proto files is pretty limited, so I think most (if not all) changes to the .proto files and format specification would require changing the validator in some way anyway. The opposite direction (i.e. generating the .proto files from Rust or some higher-level DSL) might be more useful at some point, but that's quite a bit more work than I'm interested in doing right now, and I don't know how new-collaborator-friendly that would be either. I did figure out that even a pure-rust lib like prost still relies on protoc for parsing: protoc can output well-known I've started work on the validator (just project setup for now) here: https://github.com/jvanstraten/substrait/tree/validator. Once I have an MVP I'll submit a PR for it. |
Thinking sounds good to me. Looking forward to the MVP. Thanks @jvanstraten! |
Closing this as it has moved to substrait-io/substrait-validator#1 |
* refactor: allow for custom schema creates in tests * test: mapping of arithmetic operations to Substrait * fix: incorrect mapping of fp + and - ops * feat: map numeric negation * feat: map mod * feat: map power * feat: map exp * feat: map trigonometric functions * feat: map abs * feat: map sign * refactor: simplify arithmetic tests
tl;dr: requesting a tool that can give an authoritative answer to the question "does this plan conform to the Substrait spec?"
I've been studying the Substrait specification and protobuf files for a few days now, with the purpose of contributing more extensively to the consumer for Arrow's query engine. While I think I understand the basic ideas for most of it, I'm finding that the specification is not very precise about some of the nitty-gritty details. For example, is the
type_variation_reference
field of the various builtin type messages mandatory? If yes, what's the type variation of, for example, a literal (which doesn't define a reference)? Based on that I figure the answer is "no," but that means thattype_variation_reference == 0
(or possibly some other reserved value?) must mean "no variation," as protobuf just substitutes a default value for primitive types when a field isn't specified. Does that then also mean that anchor 0 is reserved for other such links? And so on. The most annoying part of these types of questions is that I don't know if I just glossed over some part of the spec. I don't like asking pedantic questions when I'm not 100% sure I'm not the problem myself.The core issue here, in my opinion, is that there is no objective, authoritative way to determine for some given Substrait plan whether A) it is valid and B) what it means. The only way to do that right now is to interpret a plan by hand, which, well... is open to interpretation. Even if a spec is completely precise about everything, it's still very easy for a human to make mistakes here. If however there was a way to do that, I could just feed it some corner cases to resolve above question, or look through its source code.
What's worse though, is how easy it is to make a plan that looks sensible and that protobuf gives a big thumbs-up to, but actually isn't valid at all according to the spec. For example, protobuf won't complain when you make a
ReadRel
with no schema, as for instance the folks over at https://github.com/duckdblabs/duckdb-substrait-demo seem to have done (for now, anyway). Now, that one pretty obviously contradicts the "required" tag in the spec and they seem to be aware of it, but my point is that relying solely on protobuf's validation makes it very easy to come up with some interpretation of what the Substrait protobuf messages mean that makes sense to you from your frame of reference, to the point where everything seems to work, as DuckDB has already done for the majority of TPC-H... until you try to use Substrait for what it's built for by connecting to some other engine/parser, and find out that nothing actually works together. Or worse (IMO), that it only works for a subset of queries and/or only fails sometimes or for some versions of either tool due to UB, and it's not obvious why or whose fault it is. An issue like that could easily devolve into a finger-pointing contest between projects, especially if a third party finds the problem.So, my suggestion is to make a Substrait consumer that tells you with authority whether a plan is:
Note that when I say "with authority," I mean that once things stabilize, any disagreement between the tool and the spec for a given version should be resolved by patching/adding errata to the spec. If the tool doesn't provide the definitive answer, the answer again becomes based on interpretation, and the tool loses much of its value.
An initial version of the tool can be as simple as just traversing the message tree, verifying that all mandatory fields are present, and verifying that all the anchor/reference links match up. That can then at least tell you when something is obviously invalid or may be valid. After that it can be made smarter incrementally, with things like type checking and cross-validating YAML extension files.
I can try to start this effort, but I don't know if I'd want to maintain it long-term, so I don't know if I'm the right person for this. If I am to start it though, I'm on the fence about using Python or C++, so let me know if there's a preference.
The text was updated successfully, but these errors were encountered: