[SPARK-25528][SQL] data source v2 API refactor (batch read) #23086

cloud-fan · 2018-11-19T14:53:21Z

What changes were proposed in this pull request?

This is the first step of the data source v2 API refactor proposal

It adds the new API for batch read, without removing the old APIs, as they are still needed for streaming sources.

More concretely, it adds

TableProvider, works like an anonymous catalog
Table, represents a structured data set.
ScanBuilder and Scan, a logical represents of data source scan
Batch, a physical representation of data source batch scan.

How was this patch tested?

existing tests

cloud-fan · 2018-11-19T14:59:43Z

...main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2StreamingScanExec.scala

+ */
+// TODO: micro-batch should be handled by `DataSourceV2ScanExec`, after we finish the API refactor
+// completely.
+case class DataSourceV2StreamingScanExec(


I have to use two physical nodes, since batch and streaming have different APIs now.

cloud-fan · 2018-11-19T15:19:09Z

@rxin @rdblue @jose-torres @gatorsmile @gengliangwang @mccheah

SparkQA · 2018-11-19T18:28:19Z

Test build #99003 has finished for PR 23086 at commit 207b0b9.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DataSourceV2StreamingScanExec(

SparkQA · 2018-11-19T18:30:01Z

Test build #99004 has finished for PR 23086 at commit 77a2c08.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-19T18:30:47Z

Test build #99005 has finished for PR 23086 at commit f06b5c5.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

mccheah · 2018-11-19T18:45:50Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java

+   *                topic name, etc. It's an immutable case-insensitive string-to-string map.
+   * @param schema the user-specified schema.
+   */
+  default Table getTable(DataSourceOptions options, StructType schema) {


I know that this is from prior DataSourceV2 semantics, but what's the difference between providing the schema here and the column pruning aspect of ScanBuilder?

Basically just saying we should just push down this requested schema into the ScanBuilder.

It's a different thing. Think about you are reading a parquet file, and you know exactly what its physical schema is, and you don't want Spark to waste a job to infer the schema. Then you can specify the schema when reading.

Next, Spark will analyze the query, and figure out what the required schema is. This step is automatic and driven by Spark.

I agree with @cloud-fan. These are slightly different uses.

Here, it is supplying a schema for how to interpret data files. Say you have CSV files with columns id, ts, and data and no headers. This tells the CSV reader what the columns are and how to convert the data to useful types (bigint, timestamp, and string). Column projection will later request those columns, maybe just id and data. If you only passed the projection schema, then the ts values would be returned for the data column.

mccheah · 2018-11-19T18:46:29Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Batch.java

+ * records from the partitions.
+ */
+@InterfaceStability.Evolving
+public interface Batch {


BatchScan, perhaps?

I don't have a strong preference. I feel it's a little more clear to distinguish between scan and batch

SparkQA · 2018-11-20T02:58:18Z

Test build #99038 has finished for PR 23086 at commit 83818fa.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DataSourceV2StreamingScanExec(

SparkQA · 2018-11-20T03:33:37Z

Test build #99042 has finished for PR 23086 at commit 4407d51.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DataSourceV2StreamingScanExec(

SparkQA · 2018-11-21T06:02:15Z

Test build #99088 has finished for PR 23086 at commit 188be4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class DataSourceV2StreamingScanExec(

jose-torres

No serious problems I see, but I've mostly looked to ensure the streaming components will still work.

jose-torres · 2018-11-21T17:01:38Z

project/MimaExcludes.scala

@@ -149,7 +149,8 @@ object MimaExcludes {
    ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.sources.v2.reader.streaming.MicroBatchReader"),
    ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.sources.v2.writer.DataSourceWriter"),
    ProblemFilters.exclude[ReversedMissingMethodProblem]("org.apache.spark.sql.sources.v2.writer.DataWriterFactory.createWriter"),
-    ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter")
+    ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter"),


This list of exclusions is getting kinda silly. Is there some way to just completely exclude this package from compatibility checks until we've stabilized it?

SparkQA · 2018-11-21T21:39:35Z

Test build #99135 has finished for PR 23086 at commit d769a47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2018-11-27T01:08:16Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2;


#21306 (TableCatalog support) adds this class as org.apache.spark.sql.catalog.v2.Table in the spark-catalyst module. I think it needs to be in the catalyst module and should probably be in the o.a.s.sql.catalog.v2 package as well.

The important one is moving this to the catalyst module. The analyzer is in catalyst and all of the v2 logical plans and analysis rules will be in catalyst as well, because we are standardizing behavior. The standard validation rules should be in catalyst, not in a source-specific or hive-specific package in the sql-core or hive modules.

Because the logical plans and validation rules are in the catalyst package, the TableCatalog API needs to be there as well. For example, when a catalog table identifier is resolved for a read query, one of the results is a TableCatalog instance for the catalog portion of the identifier. That catalog is used to load the v2 table, which is then wrapped in a v2 relation for further analysis. Similarly, the write path should also validate that the catalog exists during analysis by loading it, and would then pass the catalog in a v2 logical plan for CreateTable or CreateTableAsSelect.

I also think that it makes sense to use the org.apache.spark.sql.catalog.v2 package for Table because Table is more closely tied to the TableCatalog API than to the data source API. The link to DSv2 is that Table carries newScanBuilder, but the rest of the methods exposed by Table are for catalog functions, like inspecting a table's partitioning or table properties.

Moving this class would make adding TableCatalog less intrusive.

Moving this to the Catalyst package would set a precedent for user-overridable behavior to live in the catalyst project. I'm not aware of anything in the Catalyst package being considered as public API right now. Are we allowed to start such a convention at this juncture?

Everything in catalyst is considered private (although public visibility for debugging) and it's best to stay that way.

why does this Table API need to be in catalyst? It's not even a plan. We can define a table LogicalPlan interface in catalyst, and implement it in the SQL module with this Table API.

I can understand wanting to keep everything in Catalyst private. That's fine with me, but I think that Catalyst does need to be able to interact with tables and catalogs that are supplied by users.

For example: Our tables support schema evolution. Specifically, reading files that were written before a column was added. When we add a column, Spark shouldn't start failing in analysis for an AppendData operation in a scheduled job (as it would today). We need to be able to signal to the validation rule that the table supports reading files that are missing columns, so that Spark can do the right validation and allow writes that used to work to continue.

How would that information -- support for reading missing columns -- be communicated to the analyzer?

Also, what about my example above: how will the analyzer load tables using a user-supplied catalog if catalyst can't use any user-supplied implementations?

We could move all of the v2 analysis rules, like ResolveRelations, into the core module, but it seems to me that this requirement is no longer providing value if we have to do that. I think that catalyst is the right place for common plans and analysis rules to live because it is the library of common SQL components.

Wherever the rules and plans end up, they will need to access to the TableCatalog API.

It's unclear to me what would be the best choice:

move data source API to catalyst module

move data source related rules to SQL core module

define private catalog related APIs in catalyst module and implement them in SQL core

Can we delay the discussion when we have a PR to add catalog support after the refactor?

Can we delay the discussion when we have a PR to add catalog support after the refactor?

Yes, that works.

But, can we move Table to the org.apache.spark.sql.catalog.v2 package where TableCatalog is defined in the other PR? I think Table should be defined with the catalog API and moving that later would require import changes to any file that references Table.

for other reviewers: in the ds v2 community sync, we decided to move data source v2 into a new module sql-api, and make catalyst depends on it. This will be done in a followup.

I just went to make this change, but it requires moving any SQL class from catalyst referenced by the API into the API module as well... Let's discuss the options more on the dev list thread.

rdblue · 2018-11-27T01:32:22Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java

+   * The builder can take some query specific information to do operators pushdown, and keep these
+   * information in the created {@link Scan}.
+   */
+  ScanBuilder newScanBuilder(DataSourceOptions options);


DataSourceOptions isn't simply a map for two main reasons that I can tell: first, it forces options to be case insensitive, and second, it exposes helper methods to identify tables, like tableName, databaseName, and paths. In the new abstraction, the second use of DataSourceOptions is no longer needed. The table is already instantiated by the time that this is called.

We should to reconsider DataSourceOptions. The tableName methods aren't needed and we also no longer need to forward properties from the session config because the way tables are configured has changed (catalogs handle that). I think we should remove this class and instead use the more direct implementation, CaseInsensitiveStringMap from #21306. The behavior of that class is obvious from its name and it would be shared between the v2 APIs, both catalog and data source.

Makes sense to me - DataSourceOptions was carrying along identifiers that really belong to a table identifier and that should be interpreted at the catalog level, not the data read level. In other words the implementation of this Table should already know what locations to look up (e.g. "files comprising dataset D"), now it's a matter of how (e.g. pushdown, filter predicates).

I agree with it. Since CaseInsensitiveStringMap is not in the code base yet, shall we do it in the followup?

Either in a follow-up or you can add the class in this PR. Either way works for me.

rdblue · 2018-11-27T19:47:05Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java

+  /**
+   * Return a {@link Table} instance to do read/write with user-specified schema and options.
+   *
+   * By default this method throws {@link UnsupportedOperationException}, implementations should


Javadoc would normally also add @throws with this information. I agree it should be here as well.

what I learned is that, we should only declare checked exceptions. See http://www.javapractices.com/topic/TopicAction.do?Id=171

Strange, that page links to one with the opposite advice: http://www.javapractices.com/topic/TopicAction.do?Id=44

I think that @throws is a good idea whenever you want to document an exception type as part of the method contract. Since it is expected that this method isn't always implemented and may throw this exception, I think you were right to document it. And documenting exceptions is best done with @throws to highlight them in Javadoc.

The page you linked to makes the argument that unchecked exceptions aren't part of the method contract and cannot be relied on. But documenting this shows that it is part of the contract or expected behavior, so I think docs are appropriate.

added the throw clause.

rdblue · 2018-11-27T19:47:48Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java

+
+  /**
+   * Return a {@link Table} instance to do read/write with user-specified schema and options.
+   *


Minor: Javadoc doesn't automatically parse empty lines as new paragraphs. If you want to have one in documentation, then use <p>.

thanks for the hint about new paragraph!

rdblue · 2018-11-27T19:50:00Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Batch.java

+   *
+   * Note that, this may not be a full scan if the data source supports optimization like filter
+   * push-down. Implementations should check the status of {@link Scan} that creates this batch,
+   * and adjust the resulting {@link InputPartition input partitions}.


I think this is a little unclear. Implementations do not necessarily check the scan. This Batch is likely configured with a filter and is responsible for creating splits for that filter.

rdblue · 2018-11-27T19:56:54Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/Scan.java

+   * {@link Table} that creates this scan implements {@link SupportsBatchRead}.
+   */
+  default Batch toBatch() {
+    throw new UnsupportedOperationException("Do not support batch scan.");


Nit: text should be "Batch scans are not supported". Starting with "Do not" makes the sentence a command.

rdblue · 2018-11-27T21:01:11Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

@@ -38,7 +38,7 @@ import org.apache.spark.sql.execution.datasources.jdbc._
 import org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource
 import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
 import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Utils
-import org.apache.spark.sql.sources.v2.{BatchReadSupportProvider, DataSourceOptions, DataSourceV2}
+import org.apache.spark.sql.sources.v2._


Nit: using wildcard imports makes it harder to review without an IDE because it is more difficult to find out where symbols come from.

I do think this one is too nitpicking. If this gets long it should be wildcard. Use an IDE for large reviews like this if needed.

It's the IDE that turns it into wildcard, because it gets too long.

I am using an IDE for this review, but this makes future reviews harder. I realize it isn't a major issue, but I think it is a best practice to not use wildcard imports.

rdblue · 2018-11-27T21:16:51Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

@@ -40,8 +40,8 @@ import org.apache.spark.sql.types.StructType
 * @param userSpecifiedSchema The user-specified schema for this scan.
 */
 case class DataSourceV2Relation(
-    source: DataSourceV2,
-    readSupport: BatchReadSupport,
+    source: TableProvider,


May want to note that TableProvider will be removed when the write side is finished, since it is only used for createWriteSupport, which will be exposed through Table.

rdblue · 2018-11-27T21:19:43Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/Table.java

+ * </ul>
+ */
+@Evolving
+public interface Table {


It would be helpful for a Table to also expose a name or identifier of some kind. The TableIdentifier passed into DataSourceV2Relation is only used in name to identify the relation's table. If the name (or location for path-based tables) were supplied by the table instead, it would remove the need to pass it in the relation.

rdblue · 2018-11-27T21:25:06Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

+      provider, table, output, options, ident, userSpecifiedSchema)
+  }
+
+  def createRelationForWrite(


Also note that this is temporary until the write side is finished?

rdblue · 2018-11-27T21:39:14Z