[SPARK-27181][SQL]: Add public transform API #24117

rdblue · 2019-03-17T00:05:50Z

What changes were proposed in this pull request?

This adds a public Expression API that can be used to pass partition transformations to data sources.

How was this patch tested?

Existing tests to validate no regressions. Added transform cases to DDL suite and v1 conversions suite.

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/logical/expressions/Expression.java

cloud-fan · 2019-03-26T18:39:48Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

+    ;
+
+transform
+    : qualifiedName                                                           #identityTransform


it's used in CREATE TABLE statement only, do we really need qualifiedName? I think identifier is good enough here.

I think it is better to use qualifiedName. This may be a logical name in the current use, but later Spark may need to resolve the transform using this name. For example, this could be set to builtin.bucket to tell Spark that it is the built-in bucket transform function. Using that information, Spark would know it can run a bucketed join.

How about the transform arguments? Do they need to be qualifiedName as well?

Yes, arguments need to be qualifiedName because they may reference nested fields.

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

...catalyst/src/main/java/org/apache/spark/sql/catalyst/logical/expressions/NamedReference.java

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/logical/expressions/Transforms.java

cloud-fan · 2019-03-26T22:22:33Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/logical/expressions/Transforms.java

+package org.apache.spark.sql.catalyst.logical.expressions;
+
+/**
+ * Helper methods to create logical transforms to pass into Spark.


Who will call these help methods? Spark or data source?

Data sources that need to pass transforms back to Spark through Table.partitioning. Spark uses the internal LogicalExpressions.

cloud-fan · 2019-03-29T18:25:22Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/expressions/Literal.java

+ * @param <T> the Java type of a value held by the literal
+ */
+@Experimental
+public interface Literal<T> extends Expression {


I'm a little hesitant to add type parameter to an expression interface. I'm not sure how useful it is. When I deal with expressions, my method parameter and return type is usually Expression. Because of type erasure, I won't get the type parameter of literal, unless the method deals with literal only.

What is the down side to using this? We have a typed literal in Iceberg and it is useful for maintaining type safety.

The downside is, we may need to add cast to read value from this literal, e.g.

def func(e: Expression) = e match { case lit: Literal[_] => lit.asInstanceOf[Literal[Any]].value }

Actually it will be good to see some examples. In general, my feeling is that, adding type parameter to a sub-class but not the base class is not going to be very useful.

The alternative is to cast the value instead, so you have to cast either way. You can't get around casting when the type is discarded. I don't think it is a good idea to throw away type information in all cases just because it isn't useful in some cases.

Here's an example of how it is used in Iceberg in expression evaluation:

public <T> Boolean lt(BoundReference<T> ref, Literal<T> lit) { Comparator<T> cmp = lit.comparator(); return cmp.compare(ref.get(struct), lit.value()) < 0; }

In Iceberg, expression binding guarantees that the literal's type matches the reference's type. With that information, this code knows that the value returned by the reference's get method matches the type of the comparator and of the literal value.

cloud-fan · 2019-03-29T18:29:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/expressions/expressions.scala

+  override def toString: String = describe
+}
+
+private[sql] final case class ApplyTransform(


What's the semantics of this? an arbitrary function?

Some unknown transform.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/expressions/expressions.scala

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/expressions/Expressions.java

sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/expressions/expressions.scala

rxin · 2019-03-29T23:30:45Z

It is an ill pattern to have a lazy val that doesn’t do any computation.

…

On Fri, Mar 29, 2019 at 4:28 PM Wenchen Fan ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/expressions/expressions.scala <#24117 (comment)>: > + + override lazy val references: Array[NamedReference] = { + arguments + .filter(_.isInstanceOf[NamedReference]) + .map(_.asInstanceOf[NamedReference]) + } + + override lazy val describe: String = s"$name(${arguments.map(_.describe).mkString(", ")})" + + override def toString: String = describe +} + +private[sql] final case class IdentityTransform( + ref: NamedReference) extends SingleColumnTransform(ref) { + override lazy val name: String = "identity" + override lazy val describe: String = ref.describe I think it's more than a preference. PRs should keep the code style consistent with the code base. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24117 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AATvPPjFIH3rKqBfggEEEYmnWuDJ5Z8bks5vbqGLgaJpZM4b4GN3> .

SparkQA · 2019-03-30T01:49:10Z

Test build #104098 has finished for PR 24117 at commit 0aa3533.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-04-01T23:41:58Z

@cloud-fan, I've updated the uses of lazy val to def. I think that's the last problem. Ready to commit?

rdblue · 2019-04-05T16:13:12Z

Retest this please.

SparkQA · 2019-04-05T19:54:54Z

Test build #104323 has finished for PR 24117 at commit 76a4067.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-04-05T23:01:37Z

Retest this please.

SparkQA · 2019-04-06T02:38:11Z

Test build #104334 has finished for PR 24117 at commit 76a4067.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-04-08T16:44:46Z

Retest this please.

SparkQA · 2019-04-08T18:30:44Z

Test build #104396 has finished for PR 24117 at commit 76a4067.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T21:04:20Z

Test build #104397 has finished for PR 24117 at commit a4a87ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-04-08T21:31:04Z

@cloud-fan, tests passed. Is this ready to merge?

cloud-fan · 2019-04-09T03:45:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/PlanResolutionSuite.scala

+  test("create table - partitioned by transforms") {
+    val transforms = Seq(
+        "bucket(16, b)", "years(ts)", "months(ts)", "days(ts)", "hours(ts)", "foo(a, 'bar', 34)",
+        "bucket(32, b), days(ts)")


this reminds me one thing: shall we resolve the references before passing the transforms to the table catalog? Now we create transforms at parser side, and we may pass years(ts) to the table catalog even if ts doesn't exist. Do we expect the table catalog itself to resolve references?

Spark will validate that the columns exist in the table schema using an analysis rule. That isn't going into this PR. This PR updates the parser and adds tests for that. We will add analysis rules later, when this PR makes it in.

The reason why we can't add them here is that we don't want to write validations against the parsed SQL plans. We want to write them against the v2 create table commands, which won't be added until after the table catalog makes it in (#24246).

cloud-fan · 2019-04-09T03:46:22Z

LGTM except one comment

cloud-fan · 2019-04-10T06:30:52Z

thanks, merging to master!

rdblue · 2019-04-10T16:23:25Z

Thank you @cloud-fan!

rxin · 2019-04-15T23:49:07Z

Can we create a JIRA ticket and mark it a blocker for 3.0? It'd be bad to move all classes after 3.0.

…

On Mon, Apr 15, 2019 at 4:48 PM, Ryan Blue < ***@***.*** > wrote: ***@***.**** commented on this pull request. In sql/ catalyst/ src/ main/ java/ org/ apache/ spark/ sql/ catalog/ v2/ expressions/ Expression. java ( #24117 (comment) ) : > + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http:/ / www. apache. org/ licenses/ LICENSE-2. 0 ( http://www.apache.org/licenses/LICENSE-2.0 ) + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2.expressions; First, we have to decide on what all we need to move. I don't think that is clear yet. The next concern is getting #24246 ( #24246 ) in because it is a blocker for a lot of work that can be done in parallel. After that, it would be nice to coordinate to avoid breaking lots of PRs, but that's less of a concern. So to answer your question, I think we should do this after #24246 ( #24246 ) and after we've decided what needs to move and what the new organization should be. The actual changes should be simple and quick to review, but would cause too much delay combined into a PR with other changes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( #24117 (comment) ) , or mute the thread ( https://github.com/notifications/unsubscribe-auth/AATvPPrGVn_d-wMJ-FpQfWH0ySYnFrS6ks5vhQ-xgaJpZM4b4GN3 ).

rdblue · 2019-04-16T00:14:15Z

@rxin: https://issues.apache.org/jira/browse/SPARK-27471

## What changes were proposed in this pull request? This adds a public Expression API that can be used to pass partition transformations to data sources. ## How was this patch tested? Existing tests to validate no regressions. Added transform cases to DDL suite and v1 conversions suite. Closes apache#24117 from rdblue/add-public-transform-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HyukjinKwon · 2019-12-19T09:12:18Z

@rdblue, sorry that I just came here late and leave a comment like this but do you mind if I ask what does this PR target (or any discussion link I can refer)? I happened to come here while tracking history and I am lost here.

So .. does this PR target to support transform(col) in partition clause, for instance, from

CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

For transform, DSv2 should implement its logic? Like, are you planning to implement YearsTransform, MonthsTransform, DaysTransform in Spark side,
and then expose ApplyTransform to Data source implementation?

Looks it's going to be super confusing.

Users would expect they can use the same Spark expressions but it's actually different per datasource implementation. For instance, I would expect hours(col + col), hours(current_timestamp()) or trunc(...) works.
If we should define YearsTransform, MonthsTransform, DaysTransform, ... then it's going to a copy of our expressions in Spark.

Seems like we should either try to push expression itself (or subset of the expressions that the source can handle) and the datasource implementation should interpret it.

WDYT? please let me know if I am getting completely wrong somewhere.

rdblue · 2019-12-19T19:11:29Z

@HyukjinKwon, these transforms are passed to the data source as table configuration and it is up to the sources to implement them.

These are not expressions like other Spark SQL syntax. They are named transformations that accept a list of literals and columns.

HyukjinKwon · 2019-12-20T01:14:22Z

Then, I think at the very least the way of calling it should be different from calling Spark functions.
... PARTITIONED BY transform(col) this form exactly looks like a Spark expression.

This is the root cause because users cannot distinguish which one is the transforms and Spark expressions; however, both work differently.

rdblue · 2019-12-20T01:22:50Z

These are only allowed within a PARTITIONED BY clause, where previously no expressions were allowed (only identifiers).

HyukjinKwon · 2019-12-20T01:25:28Z

The problem is that, previously it wasn't allowed but apparently looks like it's going to allow now (while it's actually not). I myself was confused about this so I had to track the history.

HyukjinKwon · 2019-12-20T01:27:38Z

Can we have a bit different syntax for this at least? this:

CREATE TABLE table(col INT) USING parquet PARTITIONED BY transform(col)

looks it will allow arbitrary expression at PARTITIONED BY clause.

HyukjinKwon · 2019-12-23T01:00:06Z

@rdblue, shall we bring this topic up to the next DSv2 meeting if you don't currently have an idea to handle it and/or don't think it matters?

rdblue · 2019-12-23T19:33:19Z

@HyukjinKwon, we discussed this a while ago and I don't see much reason to reopen it. You're welcome to bring it up at the next sync, but I don't consider this a problem.

These look like expressions because they are limited expressions for transforming data to produce partition values. Expressions should look like expressions. If you want to improve some of the cases where more complex expressions aren't supported, then let's do that.

We also don't yet know how much expression syntax we will pass to sources. This is why we started a public expression API: so we can pass expressions to sources where they are needed. I expect that API to expand as we solve more complicated use cases.

HyukjinKwon · 2019-12-24T00:42:10Z

we discussed this a while ago and I don't see much reason to reopen it

Can you point out any link or the summary in the mailing list? Maybe I have missed some discussions so I wanted to read and follow. This was my original intention, actually.

These look like expressions because they are limited expressions for transforming data to produce partition values. Expressions should look like expressions. If you want to improve some of the cases where more complex expressions aren't supported, then let's do that.

We also don't yet know how much expression syntax we will pass to sources. This is why we started a public expression API: so we can pass expressions to sources where they are needed. I expect that API to expand as we solve more complicated use cases.

If this is going to support an expression-like support, it should at least work like an expression (even though we don't partially push down like DSv1 filter APIs).

From a cursory look, the current transform API just looks like half-baked one - looks we will have to have a copy of the Spark expressions, and unable to extend to support other expression-like supports such transform(col + col).

I indeed see a problem here - my impression was that the point of DSv2 is to avoid half-baked APIs or just-work-for-now. How do we plan to extend this support? The current status looks definitely half-backed, and some changes against this API look being proposed, for example #26929 which was why I had to follow the history.

HyukjinKwon · 2019-12-24T00:52:49Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/expressions/Expression.java

+ * Base class of the public logical expression API.
+ */
+@Experimental
+public interface Expression {


I am confused why we should expose a custom API which is only used in transform API currently.
Maybe there was a discussion about this plan. Do we plan to switch the current Spark expressions to this expression entirely in the future and how is it different from using a UDF?

It's better to give the semantic to data source instead of a concrete UDF. Data sources can implement the partitioning semantic efficiently if they don't need to call a java UDF here and there.

cloud-fan · 2019-12-25T11:59:44Z

The partitioning expressions need to be public because it's used in DS v2, that's why we create a public Expression interface. It's kind of a copy of the interval catalyst expressions, but for now there are only a few public expressions, and we plan to add more in the future. Adding new public expressions is backward compatible.

But I do agree with the concern from @HyukjinKwon about how we are going to extend in the future. For now the parser is pretty strict about the partitioning expression: it can only be column name or function call with column name. I think it's good enough, it looks weird to me to support "partitioned by a + b". However, I'm a little worried about ApplyTransform, which just pass arbitrary function names specified by end-users to the data source, without a well defined semantic. Image we add a new transform called Second, whose function name is "second". Then in the new version data source would get Second while in the old version they got ApplyTransform. This is not backward compatible.

@rdblue what do you think?

HyukjinKwon · 2020-01-16T10:28:08Z

@rdblue, do you plan to do a DSv2 meeting this month? It's a code freeze soon. I would like to take a step back here and revisit & rethink about this. Let me send an email to dev to collect more feedback.

This comment has been minimized.

Sign in to view

rdblue force-pushed the add-public-transform-api branch 2 times, most recently from 80c3966 to 5d192b9 Compare March 25, 2019 22:30

rdblue changed the title ~~[SPARK-27181][SQL]: Add public transform API (WIP)~~ [SPARK-27181][SQL]: Add public transform API Mar 25, 2019

rxin reviewed Mar 25, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/logical/expressions/Expression.java Outdated Show resolved Hide resolved

This comment has been minimized.

Sign in to view

cloud-fan reviewed Mar 26, 2019

View reviewed changes

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Show resolved Hide resolved

cloud-fan reviewed Mar 26, 2019

View reviewed changes

...catalyst/src/main/java/org/apache/spark/sql/catalyst/logical/expressions/NamedReference.java Outdated Show resolved Hide resolved

rdblue commented Mar 26, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/logical/expressions/Transforms.java Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 26, 2019

View reviewed changes

This comment has been minimized.

Sign in to view

cloud-fan reviewed Mar 29, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/expressions/expressions.scala Outdated Show resolved Hide resolved

rxin reviewed Mar 29, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/expressions/Expressions.java Show resolved Hide resolved

cloud-fan reviewed Mar 29, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/expressions/expressions.scala Outdated Show resolved Hide resolved

rdblue force-pushed the add-public-transform-api branch from f3508b3 to 3d1ec9b Compare March 29, 2019 21:31

This comment has been minimized.

Sign in to view

rdblue force-pushed the add-public-transform-api branch from 3d1ec9b to 0aa3533 Compare March 29, 2019 21:43

rdblue mentioned this pull request Mar 30, 2019

[SPARK-24252][SQL] Add TableCatalog API #24246

Closed

rdblue added 3 commits April 1, 2019 16:42

Add logical Expressions API for transformations in v2, update CREATE.

9ef6f8e

Move public expression classes outside of the catalyst package.

6906c5d

Update FieldReference to use parsed names.

42d89f5

Fix missing renames.

a4a87ac

rdblue force-pushed the add-public-transform-api branch from 56ca5da to a4a87ac Compare April 8, 2019 16:52

cloud-fan reviewed Apr 9, 2019

View reviewed changes

cloud-fan closed this in 58674d5 Apr 10, 2019

HyukjinKwon reviewed Dec 24, 2019

View reviewed changes

[SPARK-27181][SQL]: Add public transform API #24117

[SPARK-27181][SQL]: Add public transform API #24117

Conversation

rdblue commented Mar 17, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

rxin commented Mar 29, 2019 via email

SparkQA commented Mar 30, 2019

rdblue commented Apr 1, 2019

rdblue commented Apr 5, 2019

SparkQA commented Apr 5, 2019

rdblue commented Apr 5, 2019

SparkQA commented Apr 6, 2019

rdblue commented Apr 8, 2019

SparkQA commented Apr 8, 2019

SparkQA commented Apr 8, 2019

rdblue commented Apr 8, 2019

Choose a reason for hiding this comment

rdblue Apr 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 9, 2019

cloud-fan commented Apr 10, 2019

rdblue commented Apr 10, 2019

rxin commented Apr 15, 2019 via email

rdblue commented Apr 16, 2019

HyukjinKwon commented Dec 19, 2019 • edited Loading

rdblue commented Dec 19, 2019

HyukjinKwon commented Dec 20, 2019 • edited Loading

rdblue commented Dec 20, 2019

HyukjinKwon commented Dec 20, 2019

HyukjinKwon commented Dec 20, 2019

HyukjinKwon commented Dec 23, 2019

rdblue commented Dec 23, 2019

HyukjinKwon commented Dec 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 25, 2019

HyukjinKwon commented Jan 16, 2020

rdblue commented Mar 17, 2019 •

edited

Loading

rdblue Apr 9, 2019 •

edited

Loading

HyukjinKwon commented Dec 19, 2019 •

edited

Loading

HyukjinKwon commented Dec 20, 2019 •

edited

Loading