Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: introduce compound (parameterizable) extension types and variations #196

Merged
merged 2 commits into from
Jul 4, 2022

Conversation

jvanstraten
Copy link
Contributor

#190 prompted me to start thinking about what would be needed to support parameterizable user-defined types. Since the use case in #190 kind of suits a parameterizable type variation of timezone_tz more so than a new type, I figured I'd apply the same logic to both extension types and type variations at the same time.

This commit is really just me spitballing ideas and is intended as a request for comments of sorts rather than a PR; I figured it'd probably be more succinct to do this via code than a long-winded unstructured issue. This certainly couldn't be merged as such even if we'd wanted to because substrait.extensions now depends on substrait.Type, which protobuf doesn't support (sigh), and because I'm sure the commit message is not compliant.

Anyway, please shoot holes!

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks superb.

uniqueItems: true
items:
type: string
optional: # when set to true, the parameter may be omitted at the end or skipped using null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do min, max, options and optional need to exist in the proto implementation as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean the files from #193, yeah, probably, I didn't touch those (yet, pending the outcome of that issue, and I've also been working on more accurately specifying type expressions in the background, for which they'd probably change as well). Other than that, the proto messages in substrait.Plan only bind values to the parameters, so, I don't think so, no.

@cpcloud
Copy link
Contributor

cpcloud commented May 12, 2022

On a practical note, you'd have to combine the recursively-referring/referred-to files into a single file to get this to work. We did this for expression.proto and relation.proto which became algebra.proto, to support subqueries.

Yes, it's really annoying that proto doesn't support this.

@jvanstraten
Copy link
Contributor Author

I don't think these additions would create a new cycle between files, thankfully; it'd be plan -> extensions -> type and plan -> algebra -> type.

@@ -67,6 +76,23 @@ message SimpleExtensionDeclaration {
// more than one impl per name in the YAML.
string name = 3;
}

message TypeParameter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be part of the type variable as opposed to being normalized in the extensiontypevariation.

It means we should probably deprecate this line:
https://github.com/substrait-io/substrait/blob/main/proto/substrait/type.proto#L39

uint32 user_defined_type_reference = 31;

And instead introduce something more like

UserDefinedType user_defined = 32;

message UserDefinedType {
  uint32 user_defined_type_reference = 1;

  // Parameterization, if this is a compound type variation.
  repeated TypeParameter parameters = 2; 

  message TypeParameter {
    oneof parameter {
      // Explicitly null/unspecified parameter, to select the default value (if
      // any).
      google.protobuf.Empty null = 1;

      // Data type parameters, like the i32 in LIST<i32>.
      Type data_type = 2;

      // Value parameters, like the 10 in VARCHAR<10>.
      bool boolean = 3;
      int64 integer = 4;
      string enum = 5;
      string string = 6;
    }
  }
}

This would make user defined definitions consistent with all other compound types, defining them within the operations rather than having to constantly normalize them. E.g. creating list<i8> would be basically the same as creating mylist<i8>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I did it this way for consistency with function overloads (i.e. you'd need separate anchors for each overload), to make it a bit easier to compare types, and to save some repetition in the plan, but yours is probably the stronger argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any status update on the comments here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just spitballing with this and haven't looked at this for a while to be honest, but I can pick it back up next week.

@jvanstraten jvanstraten force-pushed the compound-extension-types branch from 2f85dd1 to 9f23575 Compare May 12, 2022 18:29
@jvanstraten jvanstraten force-pushed the compound-extension-types branch 2 times, most recently from b7a4f52 to a284d34 Compare June 21, 2022 17:54
@jvanstraten jvanstraten changed the title Compound extension types and variations feat: introduce compound (parameterizable) extension types and variations Jun 21, 2022
@jvanstraten jvanstraten force-pushed the compound-extension-types branch from a284d34 to df1df3c Compare June 21, 2022 17:55
@jvanstraten
Copy link
Contributor Author

Okay, rebased, updated according to @jacques-n's comment, and got fixed things up to get CI to pass.

I tried to update function.proto, type_expressions.proto, and parameterized_types.proto but gave up due to lack of documentation and inconsistencies with what little documentation there is. I really, really want to overhaul everything related to type expressions (not to change the intended behavior, but to actually implement the intended behavior with correct protos, a correct grammar, correct logic for solving for the return types, round-tripping conversion logic between YAML and those protos, and hopefully wrapping it all up in a static library that other projects can depend on) because it's been blocking me in one way or another for upwards of three months now. This is probably also why I originally left this in draft form in the first place. In my mind #221 is a transitive dependency for even beginning to think about that, but apparently I caused a deadlock on that one... d'oh.

@jvanstraten jvanstraten marked this pull request as ready for review June 21, 2022 18:35
@jacques-n
Copy link
Contributor

When I first reviewed this I didn't realize that you were also proposing compound type variations. That feels like an entirely different discussion that I can't say I see a really pressing need for. Let's definitely remove it from the original intention here of supporting compound extension types.

As a note, (per my comments elsewhere), I specifically avoided using "parameterized" to reference compound types since I reserved that use for a different use case. Let's hold off on starting to use that language here until we have a good dictionary elsewhere.

@jvanstraten
Copy link
Contributor Author

As a note, (per my comments elsewhere), I specifically avoided using "parameterized" to reference compound types since I reserved that use for a different use case. Let's hold off on starting to use that language here until we have a good dictionary elsewhere.

Uh... what use case would that be if not to specify "something with parameters"? I'm confused. I'm also not sure how to rewrite the docs if that's a somehow forbidden word.

When I first reviewed this I didn't realize that you were also proposing compound type variations. That feels like an entirely different discussion that I can't say I see a really pressing need for. Let's definitely remove it from the original intention here of supporting compound extension types.

I figured as much, since it didn't get any comments; again, this whole thing was just supposed to be me writing down whatever came to mind. That said, I do have a use case in the back of my mind for them, namely to represent timezone-aware timestamps stored with a particular timezone offset that isn't necessarily UTC, which Arrow uses IIRC. It also just seemed like low-hanging fruit to support both, since the logic for compound type classes can be copied one-to-one to variations, though in general I've found that I interpreted variations to be much more general of a concept than they were apparently intended to be. Anyway, I don't feel that strongly about this so I'll remove it once I figure out how to reformulate the docs...

Removed/reverted:
 - parameterizable type variations
 - changes to how types are referred to in the YAML
 - the word "parameterizable" in this context

Changed:
 - rename "type" metatype to "dataType" to disambiguate a bit better

Fixed:
 - doc examples using "enum" instead of "enumeration"
@jvanstraten jvanstraten force-pushed the compound-extension-types branch from 27e18f7 to a8ec3bb Compare June 27, 2022 08:46
Copy link
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jvanstraten . Looks good.

@jacques-n jacques-n merged commit a79eb07 into substrait-io:main Jul 4, 2022
@jvanstraten jvanstraten deleted the compound-extension-types branch July 7, 2022 10:20
curino pushed a commit to curino/substrait that referenced this pull request Jul 20, 2022
…ions (substrait-io#196)

* feat: introduce compound (parameterizable) extension types and variations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants