[SPARK-31694][SQL] Add SupportsPartitions APIs on DataSourceV2 #28617

jackylee-ch · 2020-05-23T01:19:27Z

What changes were proposed in this pull request?

There are no partition Commands, such as AlterTableAddPartition supported in DatasourceV2, it is widely used in mysql or hive or other datasources. Thus it is necessary to defined Partition API to support these Commands.

We defined the partition API as part of Table API, as it will change table data sometimes. And a partition is composed of identifier and properties, while identifier is defined with InternalRow and properties is defined as a Map.

Does this PR introduce any user-facing change?

Yes. This PR will enable user to use some partition commands

How was this patch tested?

run all tests and add some partition api tests

Change-Id: If8ae497644895167fd0f75de863411c3c37e2662

Change-Id: I56cd7d5f02b4fe9018a25bcc901bc23e6acaaed4

Change-Id: I39d0a5457b11dc071962f8e60d9a580fb9db1ed6

jackylee-ch · 2020-06-09T00:44:30Z

cc @cloud-fan @rdblue @xuanyuanking @dongjoon-hyun

dongjoon-hyun · 2020-07-05T20:32:06Z

ok to test

SparkQA · 2020-07-06T04:18:22Z

Test build #124963 has finished for PR 28617 at commit 4a77db0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

RussellSpitzer · 2020-07-16T16:29:30Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

+     */
+    void dropPartitions(
+            Identifier ident,
+            Map<String, String>[] partitions,


There are a few cases here where partitions are referred to as Map<String, String> and a few times where they use the TablePartition class. I think it would probably make more sense if they were all TablePartition (the class) unless there is a significant reason for them not to be.

TablePartition Contains the partition metadata, it's too heavy for this. As for Transform, it maybe a good choice if it can pass partition value.

Then is this partitionSpec and not "partition"?

I think we need to decide how to pass the data that identifies a partition.

There have been a lot of problems over the years working with Hive partitions because values are coerced to and from String. Often, people get the conversions slightly wrong. I think a better approach is to use a row of values to pass partition data between Spark and a source. We already pass typed rows in the read and write APIs, so it would be reasonable to do so here as well.

One benefit of using a typed row to represent partition data is that we can directly use a listPartitions call for metadata queries.

This would also align more closely with how Spark handles partitions internally. From PartitioningUtils:

/** * Holds a directory in a partitioned collection of files as well as the partition values * in the form of a Row. Before scanning, the files at `path` need to be enumerated. */ case class PartitionPath(values: InternalRow, path: Path) case class PartitionSpec( partitionColumns: StructType, partitions: Seq[PartitionPath])

A table that implements SupportsPartitions could return a partitionType(): StructType that describes the partition rows it accepts and produces.

One more thing: using a row to pass a tuple would make it possible to also get the partition that an input split (e.g., file) belongs to. That would be useful for storage-partitioned joins.

Good Point! The partition identifiers can be written like PartitionSpec.

I was actually thinking that partitions would be identified more like PartitionPath. In this API, I'm not sure if the Path part of PartitionPath is needed, since sources may not need to expose it to Spark. (In Iceberg, for example, there is no partition path.)

I think just using an InternalRow to identify a partition is a good idea.

Yeap, I have changed it to InternalRow. The definition in PartitioningUtils is a very good idea, thanks.

Further think about this. InternalRow only contains the data of partition identifier, not partition columns. That means user must make partition data always in order. Is that reasonable to user? @rdblue

I think it is reasonable. While this is a public API, the user here is a connector developer and they are expected to be able to produce InternalRow in other places. I think this is actually a good thing because we don't need to pass the names every time.

RussellSpitzer · 2020-07-16T16:34:58Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TablePartition.java

+import java.util.HashMap;
+import java.util.Map;
+
+public class TablePartition {


Partitions are already referred to in other parts of the catalog with the Transform class, do we need this as well?

Thanks for your CR.
Transform doest not contains actual partition values and partition metadata. In many cases, we need to know the metadata of a partition.
Please correct me if there is something wrong.

this class and method needs documentation, it might also help clarify how this differents from Transform

Thanks for reply, I will doing it.

emkornfield · 2020-07-17T15:31:55Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

+    void createPartitions(
+            Identifier ident,
+            TablePartition[] partitions,
+            Boolean ignoreIfExists);


nit: should this be boolean? also please add to the javadoc for this.

This can be renamed to createPartition(identifier, properties)

emkornfield · 2020-07-17T15:33:48Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

+            TablePartition[] partitions);
+
+    /**
+     * Retrieve the metadata of a table partition, assuming it exists.


what happens if they don't exist? Is an exception thrown. I don't know if this documentation style is consistent with how spark does it but I would expect something like:

Retrieve the metadata of a table partition.

@throws ParitionNotFoundException ...

Sorrry, I will rewrite the document. The NoSuchPartition should be thrown.

emkornfield · 2020-07-17T15:34:23Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

+     * Retrieve the metadata of a table partition, assuming it exists.
+     *
+     * @param ident a table identifier
+     * @param partition a list of string map for existing partitions


this isn't exactly clear what the keys and values of the maps are here. It also does not appear to be a list which I find confusing.

Sorrry, I will rewrite the document. Thanks for pay attention to this pr

emkornfield · 2020-07-17T15:34:48Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TablePartition.java

+
+public class TablePartition {
+    private Map<String, String> partitionSpec;
+    private Map<String, String> parametes;


nit: typo "parameters"?

rdblue · 2020-07-23T18:33:42Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

+ * Catalog methods for working with Partitions.
+ */
+@Experimental
+public interface SupportsPartitions extends TableCatalog {


What is the reason to extend TableCatalog instead of Table? I think it would be better to support partitions at a table level.

Doing it this way creates more complexity for implementations because they need to handle more cases. For example, if the table doesn't exist, this should throw NoSuchTableException just like loadTable. It would be simpler for the API if these methods were used to manipulate a table, not to load and manipulate a table. Loading should be orthogonal to partition operations.

Another issue is that this assumes a table catalog contains tables that support partitions, or tables that do not. But Spark's built-in catalog supports some sources that don't expose partitions and some that do. This would cause more work for many catalogs, which would need to detect whether a table has support and throw UnsupportedOperationException if it does not. That also makes integration more difficult for Spark because it can't check a table in the analyzer to determine whether it supports the operation or not. Instead, Spark would need to handle exceptions at runtime.

Sounds reasonable to me.
The reason I want defined it as Catalog API is I think Catalog API is used to manage metadata for Partition and Table API is used for the actual data operation.
However, as you said, there are some source, such as mysql or FileTable will use partition API to manage partition data. Thus making Partition API as part of Table API is a better way.
Thanks

rdblue · 2020-07-23T18:38:58Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

+     * @param partitions a list of string map for existing partitions
+     * @param ignoreIfNotExists
+     */
+    void dropPartitions(


In other places in the v2 API, we leave ifNotExists to Spark so that we don't need to pass it. Instead of passing an extra parameter, the method returns a boolean to indicate whether the partition was dropped or if no action was taken. Either way, the partition should not exist after the method call so it is idempotent.

Then, Spark decides whether to throw an exception because the partition did not exist. We prefer that pattern of doing more in Spark to make behavior standard across sources and to make the requirements as simple as we can.

Good point, I will change it.

This can be renamed to boolean dropPartition(identifier)

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

rdblue · 2020-07-23T18:56:52Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

+     */
+    String[] listPartitionNames(
+            Identifier ident,
+            Map<String, String> partition);


It isn't clear what this does. What is a partition name? Are you referring to the "key" in Hive?

It will return the partition identifiers, may be filtered by a partition identifier, in this table. Not just the key in Hive.

This can be changed to Row[] listPartitionIdentifiers(identifier); The identifier` in parameter is used to find partition identifiers in which the parameter is part of then..

rdblue · 2020-07-23T19:03:13Z

Thanks for working on this, @stczwd! I think it would be great to get this into 3.1, if possible, to support some of the existing SQL that doesn't work with v2 (like ADD/DROP PARTITION).

The main things I think should change are:

Partition support should be at the table level, not the catalog level
Partitions should be identified by a tuple of data, not strings that need to be parsed or encoded
Methods should work like the ones in other parts of the catalog. For example, drop should return a boolean and we may want alter to be more specific about the changes that are made. That, or maybe it should be replacePartitionMetadata to be clear what the behavior is (replace the map of properties, I think).

I like that your TablePartition is a partition identifier and a map of properties. That seems reasonably simple to me. Have you considered what to do for tables that don't support storing metadata at the partition level? (I don't think path-based tables have partition metadata, right?)

jackylee-ch · 2020-07-24T10:31:25Z

@rdblue Thanks for your attention and advices. It would be great if we can have this in 3.1, a lots of things can be done after Partition API defined.
Your points are all reasonable to me, thanks for your help. I will reconsider about the partition methods.

BTW, the reason using parameters to contains partition metadata is to suit other sources, like mysql. They may have other kind of metadata for partition. As for other sources which does not have partition metadata, the can use TablePartition with partition identifier as well, the parameters can be an empty map.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java

Change-Id: Ifa7655acf23f3ae6cfd70c41c91ee190ae78d4b8

SparkQA · 2020-07-26T11:59:49Z

Test build #126569 has finished for PR 28617 at commit 9ff1c6c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I1438992637bfb20a68b71c078610171fd576ade8

SparkQA · 2020-07-26T12:56:02Z

Test build #126570 has finished for PR 28617 at commit f9288aa.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PartitionsAlreadyExistException(db: String, table: String, specs: Seq[TablePartitionSpec])

Change-Id: I510fabc80caec1a4514273970fc72f6f8fa17d76

jackylee-ch · 2020-08-10T07:07:39Z

retest this please

SparkQA · 2020-08-10T12:13:57Z

Test build #127273 has finished for PR 28617 at commit b570496.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-11T09:09:43Z

.../src/main/java/org/apache/spark/sql/connector/catalog/SupportsAtomicPartitionManagement.java

+ * These APIs are used to modify table partition or partition metadata,
+ * they will change the table data as well.
+ * ${@link #createPartitions}:
+ *     add an array of partitions and any data that their location contains to the table


nit: their location contains -> they contain, to be consistent with the doc of dropPartitions.

.../src/main/java/org/apache/spark/sql/connector/catalog/SupportsAtomicPartitionManagement.java

cloud-fan · 2020-08-11T09:10:50Z

.../src/main/java/org/apache/spark/sql/connector/catalog/SupportsAtomicPartitionManagement.java

+   * @throws NoSuchPartitionsException If any partition identifier to alter doesn't exist
+   * @throws UnsupportedOperationException If partition property is not supported
+   */
+  void replacePartitionMetadatas(


which command needs it?

Currently, AlterTableSerDePropertiesCommand and AlterTableSetLocationCommand use this API.

I checked the parser rules, they can't change multiple partitions at once:

| ALTER TABLE multipartIdentifier (partitionSpec)? SET SERDE STRING (WITH SERDEPROPERTIES tablePropertyList)? #setTableSerDe | ALTER TABLE multipartIdentifier (partitionSpec)? SET SERDEPROPERTIES tablePropertyList #setTableSerDe

I think we don't need this batch API.

I added this because replacePartitionMetadata also operate on partition data, it should also be atomically if support multiple partition operations. If we don’t need it currently, it can be deleted.

cloud-fan · 2020-08-11T09:11:20Z

...talyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java

+ * These APIs are used to modify table partition identifier or partition metadata.
+ * In some cases, they will change the table data as well.
+ * ${@link #createPartition}:
+ *     add a partition and any data that its location contains to the table


ditto: its location contains -> it contains

Change-Id: I4870b050b2979991d039407e7c99a638b2eb8cd0

SparkQA · 2020-08-11T13:47:31Z

Test build #127325 has finished for PR 28617 at commit b3a6e2b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I0c3c6c44c3f7a6187159c22aa91c7c2de204acee

SparkQA · 2020-08-11T14:52:53Z

Test build #127328 has finished for PR 28617 at commit 279cea6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I902fe987e6685aa51386bda4c65e19998f317ee8

Change-Id: I14207ee891274be86c591c68abd9e66666c15d03

SparkQA · 2020-08-11T20:22:07Z

Test build #127334 has finished for PR 28617 at commit 4bf9711.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-11T21:25:22Z

Test build #127338 has finished for PR 28617 at commit c96e0fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-12T08:31:03Z

.../src/main/java/org/apache/spark/sql/connector/catalog/SupportsAtomicPartitionManagement.java

+   * the operation of dropPartitions need to be safely rolled back.
+   *
+   * @param idents an array of partition identifiers
+   * @throws NoSuchPartitionsException If any partition identifier to drop doesn't exist


shall we be consistent with dropPartition? which doesn't require you to throw exception for non-existing partitions.

The partitions will be checked before dropPartitions in AlterTableDropPartitionExec, thus NoSuchPartitionsException won't need here.
It's ok we return boolean.

cloud-fan · 2020-08-12T08:32:23Z

...talyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java

+    /**
+     * @return the partition schema of table
+     */
+    StructType partitionSchema();


Shall we mention that, this must be consistent with Table.partitioning?

ok, sound reasonable to me.

cloud-fan · 2020-08-12T08:33:42Z

...talyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java

+    boolean dropPartition(InternalRow ident);
+
+    /**
+     * Test whether a partition exists using an {@link Identifier identifier} from the table.


an {@link Identifier identifier} from the table. This is nothing about Identifier. I think you mean partition identifier?

Ah, wrong comment, I change it.

cloud-fan

LGTM except several minor comments

Change-Id: Id45c2cfd3acbbd6fbcb88d0ac1452fc8aa4c19fa

Change-Id: Icbe0edf5eec106dd7dff5ce3463cff82ec0310a0

SparkQA · 2020-08-12T15:21:07Z

Test build #127383 has finished for PR 28617 at commit 4f1bff3.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

Change-Id: I2d07560c289b6b90092f778a4763a578f938c887

SparkQA · 2020-08-12T17:00:12Z

Test build #127379 has finished for PR 28617 at commit 4686a24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-12T17:25:42Z

thanks, merging to master!

dongjoon-hyun · 2020-08-12T17:28:32Z

Thank you all!

SparkQA · 2020-08-12T20:08:56Z

Test build #127384 has finished for PR 28617 at commit bfd17d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jackylee-ch · 2020-08-13T00:07:28Z

Thanks for your help and support, @rdblue @cloud-fan @dongjoon-hyun

…asourcev2 ### What changes were proposed in this pull request? This patch is trying to add `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec` with the new table partition API, defined in #28617. ### Does this PR introduce _any_ user-facing change? Yes. User can use `alter table add partition` or `alter table drop partition` to create/drop partition in V2Table. ### How was this patch tested? Run suites and fix old tests. Closes #29339 from stczwd/SPARK-32512-new. Lead-authored-by: stczwd <qcsd2011@163.com> Co-authored-by: Jacky Lee <qcsd2011@163.com> Co-authored-by: Jackey Lee <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

add supports partitions catalog api

c8d5846

Change-Id: If8ae497644895167fd0f75de863411c3c37e2662

probot-autolabeler bot added the SQL label May 23, 2020

jackylee-ch changed the title ~~[SPARK-32694][SQL][WIP] Add SupportsPartitions Catalog APIs on DataSourceV2~~ [SPARK-31694][SQL][WIP] Add SupportsPartitions Catalog APIs on DataSourceV2 May 23, 2020

jackylee-ch added 2 commits May 23, 2020 10:07

add partition catalog api test

7a83f30

Change-Id: I56cd7d5f02b4fe9018a25bcc901bc23e6acaaed4

fix suite tests

4a77db0

Change-Id: I39d0a5457b11dc071962f8e60d9a580fb9db1ed6

jackylee-ch changed the title ~~[SPARK-31694][SQL][WIP] Add SupportsPartitions Catalog APIs on DataSourceV2~~ [SPARK-31694][SQL] Add SupportsPartitions Catalog APIs on DataSourceV2 Jun 9, 2020

jackylee-ch closed this Jul 15, 2020

jackylee-ch reopened this Jul 15, 2020

RussellSpitzer reviewed Jul 16, 2020

View reviewed changes

emkornfield reviewed Jul 17, 2020

View reviewed changes

rdblue reviewed Jul 23, 2020

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 23, 2020

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java Outdated Show resolved Hide resolved

rdblue reviewed Jul 23, 2020

View reviewed changes

rdblue reviewed Jul 24, 2020

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitions.java Outdated Show resolved Hide resolved

rename partition API

9ff1c6c

Change-Id: Ifa7655acf23f3ae6cfd70c41c91ee190ae78d4b8

fix code error

f9288aa

Change-Id: I1438992637bfb20a68b71c078610171fd576ade8

fix error

866f46a

Change-Id: I510fabc80caec1a4514273970fc72f6f8fa17d76

cloud-fan reviewed Aug 11, 2020

View reviewed changes

.../src/main/java/org/apache/spark/sql/connector/catalog/SupportsAtomicPartitionManagement.java Show resolved Hide resolved

cloud-fan reviewed Aug 11, 2020

View reviewed changes

change comments and add default API in SupportsAtomicPartitionManagement

b3a6e2b

Change-Id: I4870b050b2979991d039407e7c99a638b2eb8cd0

add override api in InMemoryAtomicPartitionTable

279cea6

Change-Id: I0c3c6c44c3f7a6187159c22aa91c7c2de204acee

jackylee-ch added 2 commits August 11, 2020 23:25

fix suites

4bf9711

Change-Id: I902fe987e6685aa51386bda4c65e19998f317ee8

remove replacePartitionMetadatas

c96e0fc

Change-Id: I14207ee891274be86c591c68abd9e66666c15d03

cloud-fan reviewed Aug 12, 2020

View reviewed changes

cloud-fan approved these changes Aug 12, 2020

View reviewed changes

jackylee-ch added 2 commits August 12, 2020 20:13

change dropPartition returns

4686a24

Change-Id: Id45c2cfd3acbbd6fbcb88d0ac1452fc8aa4c19fa

restart git actions

4f1bff3

Change-Id: Icbe0edf5eec106dd7dff5ce3463cff82ec0310a0

restart git actions

bfd17d4

Change-Id: I2d07560c289b6b90092f778a4763a578f938c887

cloud-fan closed this in 60fa8e3 Aug 12, 2020

MaxGekk mentioned this pull request Nov 23, 2020

[SPARK-33509][SQL] List partition by names from a V2 table which supports partition management #30452

Closed

[SPARK-31694][SQL] Add SupportsPartitions APIs on DataSourceV2 #28617

[SPARK-31694][SQL] Add SupportsPartitions APIs on DataSourceV2 #28617

Conversation

jackylee-ch commented May 23, 2020 • edited Loading

What changes were proposed in this pull request?

Does this PR introduce any user-facing change?

How was this patch tested?

jackylee-ch commented Jun 9, 2020

dongjoon-hyun commented Jul 5, 2020

SparkQA commented Jul 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue Jul 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylee-ch Jul 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylee-ch Jul 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylee-ch Jul 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackylee-ch Jul 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Jul 23, 2020 • edited Loading

jackylee-ch commented Jul 24, 2020 • edited Loading

SparkQA commented Jul 26, 2020

SparkQA commented Jul 26, 2020

jackylee-ch commented Aug 10, 2020

SparkQA commented Aug 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 11, 2020

SparkQA commented Aug 11, 2020

SparkQA commented Aug 11, 2020

SparkQA commented Aug 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

SparkQA commented Aug 12, 2020

SparkQA commented Aug 12, 2020

cloud-fan commented Aug 12, 2020

dongjoon-hyun commented Aug 12, 2020

SparkQA commented Aug 12, 2020

jackylee-ch commented Aug 13, 2020

jackylee-ch commented May 23, 2020 •

edited

Loading

rdblue Jul 23, 2020 •

edited

Loading

jackylee-ch Jul 24, 2020 •

edited

Loading

jackylee-ch Jul 25, 2020 •

edited

Loading

jackylee-ch Jul 24, 2020 •

edited

Loading

jackylee-ch Jul 24, 2020 •

edited

Loading

rdblue commented Jul 23, 2020 •

edited

Loading

jackylee-ch commented Jul 24, 2020 •

edited

Loading