Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan validator/consumer reference implementation #131

Closed
jvanstraten opened this issue Jan 28, 2022 · 8 comments
Closed

Plan validator/consumer reference implementation #131

jvanstraten opened this issue Jan 28, 2022 · 8 comments

Comments

@jvanstraten
Copy link
Contributor

tl;dr: requesting a tool that can give an authoritative answer to the question "does this plan conform to the Substrait spec?"

I've been studying the Substrait specification and protobuf files for a few days now, with the purpose of contributing more extensively to the consumer for Arrow's query engine. While I think I understand the basic ideas for most of it, I'm finding that the specification is not very precise about some of the nitty-gritty details. For example, is the type_variation_reference field of the various builtin type messages mandatory? If yes, what's the type variation of, for example, a literal (which doesn't define a reference)? Based on that I figure the answer is "no," but that means that type_variation_reference == 0 (or possibly some other reserved value?) must mean "no variation," as protobuf just substitutes a default value for primitive types when a field isn't specified. Does that then also mean that anchor 0 is reserved for other such links? And so on. The most annoying part of these types of questions is that I don't know if I just glossed over some part of the spec. I don't like asking pedantic questions when I'm not 100% sure I'm not the problem myself.

The core issue here, in my opinion, is that there is no objective, authoritative way to determine for some given Substrait plan whether A) it is valid and B) what it means. The only way to do that right now is to interpret a plan by hand, which, well... is open to interpretation. Even if a spec is completely precise about everything, it's still very easy for a human to make mistakes here. If however there was a way to do that, I could just feed it some corner cases to resolve above question, or look through its source code.

What's worse though, is how easy it is to make a plan that looks sensible and that protobuf gives a big thumbs-up to, but actually isn't valid at all according to the spec. For example, protobuf won't complain when you make a ReadRel with no schema, as for instance the folks over at https://github.com/duckdblabs/duckdb-substrait-demo seem to have done (for now, anyway). Now, that one pretty obviously contradicts the "required" tag in the spec and they seem to be aware of it, but my point is that relying solely on protobuf's validation makes it very easy to come up with some interpretation of what the Substrait protobuf messages mean that makes sense to you from your frame of reference, to the point where everything seems to work, as DuckDB has already done for the majority of TPC-H... until you try to use Substrait for what it's built for by connecting to some other engine/parser, and find out that nothing actually works together. Or worse (IMO), that it only works for a subset of queries and/or only fails sometimes or for some versions of either tool due to UB, and it's not obvious why or whose fault it is. An issue like that could easily devolve into a finger-pointing contest between projects, especially if a third party finds the problem.

So, my suggestion is to make a Substrait consumer that tells you with authority whether a plan is:

  • valid, ideally with some human-friendly representation of what it does (I have had a few ideas for this, but none of them strike me as particularly good; even if the tool just gives a thumbs up it's already very useful though);
  • invalid, and why; or
  • possibly valid, when validity depends on unknown context (like YAML files that aren't accessible by the validator), or, at least initially, when some check isn't implemented in the validator yet.

Note that when I say "with authority," I mean that once things stabilize, any disagreement between the tool and the spec for a given version should be resolved by patching/adding errata to the spec. If the tool doesn't provide the definitive answer, the answer again becomes based on interpretation, and the tool loses much of its value.

An initial version of the tool can be as simple as just traversing the message tree, verifying that all mandatory fields are present, and verifying that all the anchor/reference links match up. That can then at least tell you when something is obviously invalid or may be valid. After that it can be made smarter incrementally, with things like type checking and cross-validating YAML extension files.

I can try to start this effort, but I don't know if I'd want to maintain it long-term, so I don't know if I'm the right person for this. If I am to start it though, I'm on the fence about using Python or C++, so let me know if there's a preference.

@jacques-n
Copy link
Contributor

Hey @jvanstraten, thanks for bringing this up. Yes, this is something many of us have discussed and agree is very important. I also agree that some early substrait implementations have been casual about the set of rules they follow from the spec or not. :)

If you have the time and interest and interest in coming up with an outline/proposal/spec/thought piece on how this would work and implement it, that would be great! Want to start a google doc of thoughts/approach?

WRT to the impl language, I think it is important to be able to expose this functionality as a shared library with a c interface that can be used from nearly anywhere. This will allows to avoid implementing this multiple times. On that point, avoiding interpreted or gc'd languages is important since their runtime are generally much less optimal for embedding. So among the options you proposed, C++ sounds like the better option. That being said, I'd actually love to see this done in Rust if that something you'd be interested in doing. Validation like this seems like a classic place to have to fight memory bugs. (Curious if others like @cpcloud have an opinion here.)

The other note I'd make is it would be great if we could find a way to make at least part of this validity rules more structured that code. For example, maybe we can use annotation in the proto files to define at least some of the simpler rules (validate this is set, or one and only one of these is set, etc). I'd like to make modifications to the specification and subsequent requirement to update the validation tool as straightforward as possible.

Lastly, we may want to think about whether the validator could validate multiple versions of a plan. (I'm not convinced on this) As we develop new rules/requirements, we're likely to have producer and consumer on differing plan versions. Allowing the use of a single validator to support multiple plan versions would be nice.

@jvanstraten
Copy link
Contributor Author

WRT to the impl language, I think it is important to be able to expose this functionality as a shared library with a c interface that can be used from nearly anywhere. This will allows to avoid implementing this multiple times.

I hadn't even considered implementing this as a library, much less with a language-agnostic interface. It seems much simpler and lower-maintenance to me to just build a standalone application, analogous to what https://validator.w3.org/ is for HTML. I suppose a library form might be easier to use from within test cases, since you'd just call a function instead of starting a child process and inspecting its exit code, but then still I think I'd prefer just writing a utility function that does that over the headaches of adding a link-time dev dependency that transitively depends on protobuf, especially when they define the same messages that you already do in your application (there is, apparently, no good way to privately or statically link to protobuf stuff, because the protobuf library itself maintains a global registry of message types). You'd also only be able to link one version of the validator per test executable that way. That said... if you'd want to provide both options, the base form would be a library.

Also, unless I'm missing a use case, I don't think it ever makes sense for a producer/consumer to link to a validator under normal (non-test) conditions. A consumer would always have its own "validation" as a natural consequence of converting the Substrait messages to its own internal representation, which doesn't necessarily relate to the actual Substrait spec. After all, if it doesn't support literally everything that the spec supports, it will have to reject more than the validator, and it might well want to accept things that aren't strictly valid as well (like how browsers accept nonsense HTML to because they're usually the one held responsible for not rendering a page "correctly," or like how most C compilers support their own language extensions on top of the C standard). I suppose a producer might want to self-check itself at runtime via some kind of assertions, but I personally wouldn't consider that worth the extra runtime dependencies.

I do see uses for a Substrait library in general though, like:

  • version up- and down-conversions for plans (where possible), such that a consumer only needs to implement one version of the spec and can get backward- and forward-compatibility for free;
  • a parser and interpreter for things like type expressions, and tools to aid type annotations in general;
  • tools to implement plan optimizers and other transformations;
  • some way to work around protobuf's versioning/linkage issues (as far as I know, everything explodes if a program depends on two libraries compiled with slightly different versions of protobuf, especially if they link statically, and also especially if two libraries happen to link slightly different versions of Substrait's message types);
  • etc

but to me this seems like something that's much lower priority than making a validation tool to help disambiguate the spec itself, so I think it'd be better to just start with a standalone app in a single language. That said, it does make sense to choose a language for the validator that allows bits of it (or all of it) to be easily reused in a library later. So, like you said, not Python.

That being said, I'd actually love to see this done in Rust if that something you'd be interested in doing.

It's been a while since I did anything with Rust, mostly because I tend to get very frustrated fighting rustc when it inevitably says "no" to my OO/inheritance/C++-esque intuition for solving problems, but I'm willing to give it a shot.

The other note I'd make is it would be great if we could find a way to make at least part of this validity rules more structured that code. [...]

I'm generally in favor of this as well (the more concise, the better), but after some googling this doesn't seem to be as easy as I had hoped it'd be.

It's possible to add arbitrary annotations to messages, fields, and so on using protoc itself, but only by extending protobuf's internal *Options messages with additional fields. Like any other protobuf message, the fields are identified only by an integer, and since this is extending a message defined by protobuf itself, it must be globally unique. There's a registry here, but... judging by the projects that made it into that list, it seems intended more so for extensions to protobuf itself, and not so much for projects that just so happen to use protobuf for something. It would have been much more convenient if they'd just provided something like repeated Any under the hood, with some additional syntax to specify options via (namespaced) message name, but alas, they didn't.

There is already a project in that extension registry that deals with basic protobuf validation. However, it's still in an alpha state, so I'm not sure if we should be relying on that already. Using that would also not allow for defining any Substrait-specific constraints.

I suppose we could also make our own protobuf file parser with better support for annotations (via special comments or something), or a parser for some DSL that we can generate protobuf files from. But that seems excessive.

Lastly, we may want to think about whether the validator could validate multiple versions of a plan. [...]

I think this would be quite valuable. A consumer probably wouldn't want to restrict itself to accepting only a single version of Substrait, either. There's a maintenance trade-off to make here, though. Either way, I don't think we need to make a decision about this until we figure out versioning in general.

If you have the time and interest and interest in coming up with an outline/proposal/spec/thought piece on how this would work and implement it, that would be great! Want to start a google doc of thoughts/approach?

I started something here, but it's not very extensive or structured yet (and no information content beyond what's in this thread already).

@jacques-n
Copy link
Contributor

...much less with a language-agnostic interface
I wasn't imagining something especially complex. I could imagine something as simple as byte in, success out (similar to the app you're envisioning).

adding a link-time dev dependency that transitively depends on protobuf
The shared library without conflict may be easier (?) with something like rust where I don't think you run into the same set of pain points that you do if you source the protobuf c++ impl. It seems like there are several pure rust impls that could be used. (I'm not in any way a rust expert so I may be totally wrong on this front.)

It seems much simpler and lower-maintenance to me to just build a standalone application
I'm okay with an independent application. I just thought it was likely that, especially early on, most consumers would want to invoke this as an input filter before consuming data to simplify the code doing the consuming (I definitely want to). Otherwise, a bunch of this validation/checking has to be replicated in each application as consumption occurs. In those situations, having to invoke as an external application seems less optimal.

@jvanstraten
Copy link
Contributor Author

I wasn't imagining something especially complex. I could imagine something as simple as byte in, success out (similar to the app you're envisioning). [...] I'm okay with an independent application. I just thought it was likely that, especially early on, most consumers would want to invoke this as an input filter before consuming data to simplify the code doing the consuming (I definitely want to). [...]

Okay, that's fair enough and indeed shouldn't be all that complicated. For the first version I'll probably make it a Rust library + application then. C bindings for it are a bit more work, but with an interface that simple it can't be that hard to add later.

The shared library without conflict may be easier (?) with something like rust where I don't think you run into the same set of pain points that you do if you source the protobuf c++ impl. It seems like there are several pure rust impls that could be used. (I'm not in any way a rust expert so I may be totally wrong on this front.)

I'm also absolutely not a Rust expert, but I doubt any pure-X reimplementation of protobuf would have the same problems indeed. I was recommended prost by a colleague who's very much into Rust compared to me, for which indeed the generated code doesn't link to Google's implementation at all. Unfortunately however, it seems to lack the ability to get information about custom field options at runtime, so I don't know how suitable it is for single-point-of-truth'ing constraints for the validator in the protobuf files. Maybe there's a supported way to hook into its code generator and handle it there, but I haven't found the time to look into that yet (I've been working on getting the initial Substrait consumer merged into Arrow). Hopefully I'll have more time to investigate options tomorrow.

@jacques-n
Copy link
Contributor

...I don't know how suitable it is for single-point-of-truth'ing constraints

Given your comments in the prior about the weak solutions in this space, I think I was already resigned to the fact that single-truth may be difficult. If you figure it out, that would be great!

Looking forward to the direction you take this.

@jvanstraten
Copy link
Contributor Author

Having looked into it a bit more, I don't think prost-build can easily be hooked into to change code generation without forking prost. But after giving it a bit more thought, I'm also not sure if it's worth putting lots of effort into this at all. The kind of checks that can be easily described this way in .proto files is pretty limited, so I think most (if not all) changes to the .proto files and format specification would require changing the validator in some way anyway. The opposite direction (i.e. generating the .proto files from Rust or some higher-level DSL) might be more useful at some point, but that's quite a bit more work than I'm interested in doing right now, and I don't know how new-collaborator-friendly that would be either.

I did figure out that even a pure-rust lib like prost still relies on protoc for parsing: protoc can output well-known FileDescriptorSet messages to describe the parse result of a set of protobuf files, which means a custom protobuf code generator doesn't also need to reinvent the parser. That's probably good to know in advance if we ever do go this route.

I've started work on the validator (just project setup for now) here: https://github.com/jvanstraten/substrait/tree/validator. Once I have an MVP I'll submit a PR for it.

@jacques-n
Copy link
Contributor

Thinking sounds good to me. Looking forward to the MVP. Thanks @jvanstraten!

@jacques-n
Copy link
Contributor

Closing this as it has moved to substrait-io/substrait-validator#1

rkondakov pushed a commit to rkondakov/substrait that referenced this issue Nov 21, 2023
* refactor: allow for custom schema creates in tests

* test: mapping of arithmetic operations to Substrait

* fix: incorrect mapping of fp + and - ops

* feat: map numeric negation

* feat: map mod

* feat: map power

* feat: map exp

* feat: map trigonometric functions

* feat: map abs

* feat: map sign

* refactor: simplify arithmetic tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants