-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply data source v2 changes #576
Commits on May 15, 2019
-
[SPARK-26865][SQL] DataSourceV2Strategy should push normalized filters
## What changes were proposed in this pull request? This PR aims to make `DataSourceV2Strategy` normalize filters like [FileSourceStrategy](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L150-L158) when it pushes them into `SupportsPushDownFilters.pushFilters`. ## How was this patch tested? Pass the Jenkins with the newly added test case. Closes apache#23770 from dongjoon-hyun/SPARK-26865. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for ae0e2ca - Browse repository at this point
Copy the full SHA ae0e2caView commit details -
[SPARK-26666][SQL] Support DSv2 overwrite and dynamic partition overw…
…rite. ## What changes were proposed in this pull request? This adds two logical plans that implement the ReplaceData operation from the [logical plans SPIP](https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d). These two plans will be used to implement Spark's `INSERT OVERWRITE` behavior for v2. Specific changes: * Add `SupportsTruncate`, `SupportsOverwrite`, and `SupportsDynamicOverwrite` to DSv2 write API * Add `OverwriteByExpression` and `OverwritePartitionsDynamic` plans (logical and physical) * Add new plans to DSv2 write validation rule `ResolveOutputRelation` * Refactor `WriteToDataSourceV2Exec` into trait used by all DSv2 write exec nodes ## How was this patch tested? * The v2 analysis suite has been updated to validate the new overwrite plans * The analysis suite for `OverwriteByExpression` checks that the delete expression is resolved using the table's columns * Existing tests validate that overwrite exec plan works * Updated existing v2 test because schema is used to validate overwrite Closes apache#23606 from rdblue/SPARK-26666-add-overwrite. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for dc26348 - Browse repository at this point
Copy the full SHA dc26348View commit details -
[SPARK-26785][SQL] data source v2 API refactor: streaming write
## What changes were proposed in this pull request? Continue the API refactor for streaming write, according to the [doc](https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing). The major changes: 1. rename `StreamingWriteSupport` to `StreamingWrite` 2. add `WriteBuilder.buildForStreaming` 3. update existing sinks, to move the creation of `StreamingWrite` to `Table` ## How was this patch tested? existing tests Closes apache#23702 from cloud-fan/stream-write. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for af5d187 - Browse repository at this point
Copy the full SHA af5d187View commit details -
[SPARK-24252][SQL] Add v2 catalog plugin system
## What changes were proposed in this pull request? This adds a v2 API for adding new catalog plugins to Spark. * Catalog implementations extend `CatalogPlugin` and are loaded via reflection, similar to data sources * `Catalogs` loads and initializes catalogs using configuration from a `SQLConf` * `CaseInsensitiveStringMap` is used to pass configuration to `CatalogPlugin` via `initialize` Catalogs are configured by adding config properties starting with `spark.sql.catalog.(name)`. The name property must specify a class that implements `CatalogPlugin`. Other properties under the namespace (`spark.sql.catalog.(name).(prop)`) are passed to the provider during initialization along with the catalog name. This replaces apache#21306, which will be implemented in two multiple parts: the catalog plugin system (this commit) and specific catalog APIs, like `TableCatalog`. ## How was this patch tested? Added test suites for `CaseInsensitiveStringMap` and for catalog loading. Closes apache#23915 from rdblue/SPARK-24252-add-v2-catalog-plugins. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 167ffec - Browse repository at this point
Copy the full SHA 167ffecView commit details -
[SPARK-26946][SQL] Identifiers for multi-catalog
## What changes were proposed in this pull request? - Support N-part identifier in SQL - N-part identifier extractor in Analyzer ## How was this patch tested? - A new unit test suite ResolveMultipartRelationSuite - CatalogLoadingSuite rblue cloud-fan mccheah Closes apache#23848 from jzhuge/SPARK-26946. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 3da923b - Browse repository at this point
Copy the full SHA 3da923bView commit details -
[SPARK-27250][TEST-MAVEN][BUILD] Scala 2.11 maven compile should targ…
…et Java 1.8 ## What changes were proposed in this pull request? Fix Scala 2.11 maven build issue after merging SPARK-26946. ## How was this patch tested? Maven Scala 2.11 and 2.12 builds with `-Phadoop-provided -Phadoop-2.7 -Pyarn -Phive -Phive-thriftserver`. Closes apache#24184 from jzhuge/SPARK-26946-1. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Sean Owen <sean.owen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 85d0f08 - Browse repository at this point
Copy the full SHA 85d0f08View commit details -
[SPARK-26673][FOLLOWUP][SQL] File Source V2: check existence of outpu…
…t path before delete it ## What changes were proposed in this pull request? This is a followup PR to resolve comment: apache#23601 (review) When Spark writes DataFrame with "overwrite" mode, it deletes the output path before actual writes. To safely handle the case that the output path doesn't exist, it is suggested to follow the V1 code by checking the existence. ## How was this patch tested? Apply apache#23836 and run unit tests Closes apache#23889 from gengliangwang/checkFileBeforeOverwrite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 4db1e19 - Browse repository at this point
Copy the full SHA 4db1e19View commit details -
[SPARK-26952][SQL] Row count statics should respect the data reported…
… by data source ## What changes were proposed in this pull request? In data source v2, if the data source scan implemented `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row count reported by the data source. ## How was this patch tested? New UT test. Closes apache#23853 from ConeyLiu/report-row-count. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 2733301 - Browse repository at this point
Copy the full SHA 2733301View commit details -
[SPARK-26871][SQL] File Source V2: avoid creating unnecessary FileInd…
…ex in the write path ## What changes were proposed in this pull request? In apache#23383, the file source V2 framework is implemented. In the PR, `FileIndex` is created as a member of `FileTable`, so that we can implement partition pruning like apache@0f9fcab in the future(As data source V2 catalog is under development, partition pruning is removed from the PR) However, after write path of file source V2 is implemented, I find that a simple write will create an unnecessary `FileIndex`, which is required by `FileTable`. This is a sort of regression. And we can see there is a warning message when writing to ORC files ``` WARN InMemoryFileIndex: The directory file:/tmp/foo was not found. Was it deleted very recently? ``` This PR is to make `FileIndex` as a lazy value in `FileTable`, so that we can avoid creating unnecessary `FileIndex` in the write path. ## How was this patch tested? Existing unit test Closes apache#23774 from gengliangwang/moveFileIndexInV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 4094211 - Browse repository at this point
Copy the full SHA 4094211View commit details -
[SPARK-26744][SQL] Support schema validation in FileDataSourceV2 fram…
…ework ## What changes were proposed in this pull request? The file source has a schema validation feature, which validates 2 schemas: 1. the user-specified schema when reading. 2. the schema of input data when writing. If a file source doesn't support the schema, we can fail the query earlier. This PR is to implement the same feature in the `FileDataSourceV2` framework. Comparing to `FileFormat`, `FileDataSourceV2` has multiple layers. The API is added in two places: 1. Read path: the table schema is determined in `TableProvider.getTable`. The actual read schema can be a subset of the table schema. This PR proposes to validate the actual read schema in `FileScan`. 2. Write path: validate the actual output schema in `FileWriteBuilder`. ## How was this patch tested? Unit test Closes apache#23714 from gengliangwang/schemaValidationV2. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for caa5fab - Browse repository at this point
Copy the full SHA caa5fabView commit details -
[SPARK-26956][SS] remove streaming output mode from data source v2 APIs
## What changes were proposed in this pull request? Similar to `SaveMode`, we should remove streaming `OutputMode` from data source v2 API, and use operations that has clear semantic. The changes are: 1. append mode: create `StreamingWrite` directly. By default, the `WriteBuilder` will create `Write` to append data. 2. complete mode: call `SupportsTruncate#truncate`. Complete mode means truncating all the old data and appending new data of the current epoch. `SupportsTruncate` has exactly the same semantic. 3. update mode: fail. The current streaming framework can't propagate the update keys, so v2 sinks are not able to implement update mode. In the future we can introduce a `SupportsUpdate` trait. The behavior changes: 1. all the v2 sinks(foreach, console, memory, kafka, noop) don't support update mode. The fact is, previously all the v2 sinks implement the update mode wrong. None of them can really support it. 2. kafka sink doesn't support complete mode. The fact is, the kafka sink can only append data. ## How was this patch tested? existing tests Closes apache#23859 from cloud-fan/update. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d2f0dd5 - Browse repository at this point
Copy the full SHA d2f0dd5View commit details -
[SPARK-26389][SS] Add force delete temp checkpoint configuration
## What changes were proposed in this pull request? Not all users wants to keep temporary checkpoint directories. Additionally hard to restore from it. In this PR I've added a force delete flag which is default `false`. Additionally not clear for users when temporary checkpoint directory deleted so added log messages to explain this a bit more. ## How was this patch tested? Existing + additional unit tests. Closes apache#23732 from gaborgsomogyi/SPARK-26389. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for 49dd067 - Browse repository at this point
Copy the full SHA 49dd067View commit details -
[SPARK-26824][SS] Fix the checkpoint location and _spark_metadata whe…
…n it contains special chars ## What changes were proposed in this pull request? When a user specifies a checkpoint location or a file sink output using a path containing special chars that need to be escaped in a path, the streaming query will store checkpoint and file sink metadata in a wrong place. In this PR, I uploaded a checkpoint that was generated by the following codes using Spark 2.4.0 to show this issue: ``` implicit val s = spark.sqlContext val input = org.apache.spark.sql.execution.streaming.MemoryStream[Int] input.addData(1, 2, 3) val q = input.toDF.writeStream.format("parquet").option("checkpointLocation", ".../chk %#chk").start(".../output %#output") q.stop() ``` Here is the structure of the directory: ``` sql/core/src/test/resources/structured-streaming/escaped-path-2.4.0 ├── chk%252520%252525%252523chk │ ├── commits │ │ └── 0 │ ├── metadata │ └── offsets │ └── 0 ├── output %#output │ └── part-00000-97f675a2-bb82-4201-8245-05f3dae4c372-c000.snappy.parquet └── output%20%25%23output └── _spark_metadata └── 0 ``` In this checkpoint, the user specified checkpoint location is `.../chk %#chk` but the real path to store the checkpoint is `.../chk%252520%252525%252523chk` (this is generated by escaping the original path three times). The user specified output path is `.../output %#output` but the path to store `_spark_metadata` is `.../output%20%25%23output/_spark_metadata` (this is generated by escaping the original path once). The data files are still in the correct path (such as `.../output %#output/part-00000-97f675a2-bb82-4201-8245-05f3dae4c372-c000.snappy.parquet`). This checkpoint will be used in unit tests in this PR. The fix is just simply removing improper `Path.toUri` calls to fix the issue. However, as the user may not read the release note and is not aware of this checkpoint location change, if they upgrade Spark without moving checkpoint to the new location, their query will just start from the scratch. In order to not surprise the users, this PR also adds a check to **detect the impacted paths and throws an error** to include the migration guide. This check can be turned off by an internal sql conf `spark.sql.streaming.checkpoint.escapedPathCheck.enabled`. Here are examples of errors that will be reported: - Streaming checkpoint error: ``` Error: we detected a possible problem with the location of your checkpoint and you likely need to move it before restarting this query. Earlier version of Spark incorrectly escaped paths when writing out checkpoints for structured streaming. While this was corrected in Spark 3.0, it appears that your query was started using an earlier version that incorrectly handled the checkpoint path. Correct Checkpoint Directory: /.../chk %#chk Incorrect Checkpoint Directory: /.../chk%252520%252525%252523chk Please move the data from the incorrect directory to the correct one, delete the incorrect directory, and then restart this query. If you believe you are receiving this message in error, you can disable it with the SQL conf spark.sql.streaming.checkpoint.escapedPathCheck.enabled. ``` - File sink error (`_spark_metadata`): ``` Error: we detected a possible problem with the location of your "_spark_metadata" directory and you likely need to move it before restarting this query. Earlier version of Spark incorrectly escaped paths when writing out the "_spark_metadata" directory for structured streaming. While this was corrected in Spark 3.0, it appears that your query was started using an earlier version that incorrectly handled the "_spark_metadata" path. Correct "_spark_metadata" Directory: /.../output %#output/_spark_metadata Incorrect "_spark_metadata" Directory: /.../output%20%25%23output/_spark_metadata Please move the data from the incorrect directory to the correct one, delete the incorrect directory, and then restart this query. If you believe you are receiving this message in error, you can disable it with the SQL conf spark.sql.streaming.checkpoint.escapedPathCheck.enabled. ``` ## How was this patch tested? The new unit tests. Closes apache#23733 from zsxwing/path-fix. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 1f5d3d4 - Browse repository at this point
Copy the full SHA 1f5d3d4View commit details -
[SPARK-27111][SS] Fix a race that a continuous query may fail with In…
…terruptedException ## What changes were proposed in this pull request? Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally. To handle this case, this PR has the following changes: - Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread). - Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly. I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up. ## How was this patch tested? Jenkins Closes apache#24034 from zsxwing/SPARK-27111. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 982df04 - Browse repository at this point
Copy the full SHA 982df04View commit details -
[SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExe…
…cution ## What changes were proposed in this pull request? Continuous processing is waiting on epochs which are not yet complete (for example one partition is not making progress) and stores pending items in queues. These queues are unbounded and can consume up all the memory easily. In this PR I've added `spark.sql.streaming.continuous.epochBacklogQueueSize` configuration possibility to make them bounded. If the related threshold reached then the query will stop with `IllegalStateException`. ## How was this patch tested? Existing + additional unit tests. Closes apache#23156 from gaborgsomogyi/SPARK-24063. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 38556e7 - Browse repository at this point
Copy the full SHA 38556e7View commit details -
[SPARK-27064][SS] create StreamingWrite at the beginning of streaming…
… execution ## What changes were proposed in this pull request? According to the [design](https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing), the life cycle of `StreamingWrite` should be the same as the read side `MicroBatch/ContinuousStream`, i.e. each run of the stream query, instead of each epoch. This PR fixes it. ## How was this patch tested? existing tests Closes apache#23981 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 3fecdd9 - Browse repository at this point
Copy the full SHA 3fecdd9View commit details -
[SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions
It's a little awkward to have 2 different classes(`CaseInsensitiveStringMap` and `DataSourceOptions`) to present the options in data source and catalog API. This PR merges these 2 classes, while keeping the name `CaseInsensitiveStringMap`, which is more precise. existing tests Closes apache#24025 from cloud-fan/option. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 1609b3f - Browse repository at this point
Copy the full SHA 1609b3fView commit details -
[SPARK-26811][SQL] Add capabilities to v2.Table
This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation. Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks. Closes apache#24012 from rdblue/SPARK-26811-add-capabilities. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 8f9c5ac - Browse repository at this point
Copy the full SHA 8f9c5acView commit details -
[SPARK-27209][SQL] Split parsing of SELECT and INSERT into two top-le…
…vel rules in the grammar file. Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples : ```SQL select * from (insert into bar values (2)); ``` ``` Error in query: unresolved operator 'Project [*]; 'Project [*] +- SubqueryAlias `__auto_generated_subquery_name` +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] +- Project [cast(col1#18 as int) AS c1#20] +- LocalRelation [col1#18] ``` ```SQL select * from foo where c1 in (insert into bar values (2)) ``` ``` Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch: The number of columns in the left hand side of an IN subquery does not match the number of columns in the output of subquery. Left side columns: [default.foo.`c1`]. Right side columns: [].;; 'Project [*] +- 'Filter c1#6 IN (list#5 []) : +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] : +- Project [cast(col1#7 as int) AS c1#9] : +- LocalRelation [col1#7] +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6] ``` For both the cases above, we should reject the syntax at parser level. In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively. I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in. Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites. Closes apache#24150 from dilipbiswal/split-query-insert. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for c46db75 - Browse repository at this point
Copy the full SHA c46db75View commit details -
Revert "[SPARK-27209][SQL] Split parsing of SELECT and INSERT into tw…
…o top-level rules in the grammar file." This reverts commit c46db75.
Configuration menu - View commit details
-
Copy full SHA for bc6ece7 - Browse repository at this point
Copy the full SHA bc6ece7View commit details -
[SPARK-27209][SQL] Split parsing of SELECT and INSERT into two top-le…
…vel rules in the grammar file. Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples : ```SQL select * from (insert into bar values (2)); ``` ``` Error in query: unresolved operator 'Project [*]; 'Project [*] +- SubqueryAlias `__auto_generated_subquery_name` +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] +- Project [cast(col1#18 as int) AS c1#20] +- LocalRelation [col1#18] ``` ```SQL select * from foo where c1 in (insert into bar values (2)) ``` ``` Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch: The number of columns in the left hand side of an IN subquery does not match the number of columns in the output of subquery. Left side columns: [default.foo.`c1`]. Right side columns: [].;; 'Project [*] +- 'Filter c1#6 IN (list#5 []) : +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1] : +- Project [cast(col1#7 as int) AS c1#9] : +- LocalRelation [col1#7] +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6] ``` For both the cases above, we should reject the syntax at parser level. In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively. I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in. Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites. Closes apache#24150 from dilipbiswal/split-query-insert. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for b9a2061 - Browse repository at this point
Copy the full SHA b9a2061View commit details -
Revert "[SPARK-27209][SQL] Split parsing of SELECT and INSERT into tw…
…o top-level rules in the grammar file." This reverts commit b9a2061.
Configuration menu - View commit details
-
Copy full SHA for e68a36c - Browse repository at this point
Copy the full SHA e68a36cView commit details -
[SPARK-26215][SQL] Define reserved/non-reserved keywords based on the…
… ANSI SQL standard ## What changes were proposed in this pull request? This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in `docs/sql-reserved-and-non-reserved-key-words.md`. NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities. References: - The reserved/non-reserved SQL keywords in the ANSI SQL standards: https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf - SQL Key Words in PostgreSQL: https://www.postgresql.org/docs/current/sql-keywords-appendix.html ## How was this patch tested? Added tests in `TableIdentifierParserSuite`. Closes apache#23259 from maropu/SPARK-26215-WIP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Configuration menu - View commit details
-
Copy full SHA for 942ac18 - Browse repository at this point
Copy the full SHA 942ac18View commit details -
[SPARK-26215][SQL][FOLLOW-UP][MINOR] Fix the warning from ANTR4
## What changes were proposed in this pull request? I see the following new warning from ANTR4 after SPARK-26215 after it added `SCHEMA` keyword in the reserved/unreserved list. This is a minor PR to cleanup the warning. ``` WARNING] warning(125): org/apache/spark/sql/catalyst/parser/SqlBase.g4:784:90: implicit definition of token SCHEMA in parser [WARNING] .../apache/spark/org/apache/spark/sql/catalyst/parser/SqlBase.g4 [784:90]: implicit definition of token SCHEMA in parser ``` ## How was this patch tested? Manually built catalyst after the fix to verify Closes apache#23897 from dilipbiswal/minor_parser_token. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for 4a7c007 - Browse repository at this point
Copy the full SHA 4a7c007View commit details -
[SPARK-26982][SQL] Enhance describe framework to describe the output …
…of a query. Currently we can use `df.printSchema` to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface. Example: DESCRIBE SELECT * FROM desc_table DESCRIBE QUERY SELECT * FROM desc_table ```SQL spark-sql> create table desc_table (c1 int comment 'c1-comment', c2 decimal comment 'c2-comment', c3 string); spark-sql> desc select * from desc_table; c1 int c1-comment c2 decimal(10,0) c2-comment c3 string NULL ``` Added a new test under SQLQueryTestSuite and SparkSqlParserSuite Closes apache#23883 from dilipbiswal/dkb_describe_query. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 6cb9234 - Browse repository at this point
Copy the full SHA 6cb9234View commit details -
[SPARK-27108][SQL] Add parsed SQL plans for create, CTAS.
This moves parsing `CREATE TABLE ... USING` statements into catalyst. Catalyst produces logical plans with the parsed information and those plans are converted to v1 `DataSource` plans in `DataSourceAnalysis`. This prepares for adding v2 create plans that should receive the information parsed from SQL without being translated to v1 plans first. This also makes it possible to parse in catalyst instead of breaking the parser across the abstract `AstBuilder` in catalyst and `SparkSqlParser` in core. For more information, see the [mailing list thread](https://lists.apache.org/thread.html/54f4e1929ceb9a2b0cac7cb058000feb8de5d6c667b2e0950804c613%3Cdev.spark.apache.org%3E). This uses existing tests to catch regressions. This introduces no behavior changes. Closes apache#24029 from rdblue/SPARK-27108-add-parsed-create-logical-plans. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for f0d9915 - Browse repository at this point
Copy the full SHA f0d9915View commit details -
[SPARK-27181][SQL] Add public transform API
## What changes were proposed in this pull request? This adds a public Expression API that can be used to pass partition transformations to data sources. ## How was this patch tested? Existing tests to validate no regressions. Added transform cases to DDL suite and v1 conversions suite. Closes apache#24117 from rdblue/add-public-transform-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 0f9ac2a - Browse repository at this point
Copy the full SHA 0f9ac2aView commit details -
[SPARK-24252][SQL] Add TableCatalog API
## What changes were proposed in this pull request? This adds the TableCatalog API proposed in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d). For `TableCatalog` to use `Table`, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required moving `TableCapability`. Most of the files touched by this PR are import changes needed by this move. ## How was this patch tested? This adds a test implementation and contract tests. Closes apache#24246 from rdblue/SPARK-24252-add-table-catalog-api. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d70253e - Browse repository at this point
Copy the full SHA d70253eView commit details -
[SPARK-24923][SQL] Implement v2 CreateTableAsSelect
This adds a v2 implementation for CTAS queries * Update the SQL parser to parse CREATE queries using multi-part identifiers * Update `CheckAnalysis` to validate partitioning references with the CTAS query schema * Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan * Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan * Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan * Add `findNestedField` to `StructType` to support reference validation We have been running these changes in production for several months. Also: * Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks * Add a test suite for v2 SQL, `DataSourceV2SQLSuite` * Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`) * Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation Closes apache#24570 from rdblue/SPARK-24923-add-v2-ctas. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d2b526c - Browse repository at this point
Copy the full SHA d2b526cView commit details -
Configuration menu - View commit details
-
Copy full SHA for e133b92 - Browse repository at this point
Copy the full SHA e133b92View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6dbc1d3 - Browse repository at this point
Copy the full SHA 6dbc1d3View commit details -
[SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensiti…
…veStringMap Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method `.option()`. E.g, the following test case should be passed in both ORC V1 and V2 ``` class TestFileFilter extends PathFilter { override def accept(path: Path): Boolean = path.getParent.getName != "p=2" } withTempPath { dir => val path = dir.getCanonicalPath val df = spark.range(2) df.write.orc(path + "/p=1") df.write.orc(path + "/p=2") val extraOptions = Map( "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName, "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName ) assert(spark.read.options(extraOptions).orc(path).count() === 2) } } ``` While Hadoop Configurations are case sensitive, the current data source V2 APIs are using `CaseInsensitiveStringMap` in the top level entry `TableProvider`. To create Hadoop configurations correctly, I suggest 1. adding a new method `asCaseSensitiveMap` in `CaseInsensitiveStringMap`. 2. Make `CaseInsensitiveStringMap` read-only to ambiguous conversion in `asCaseSensitiveMap` Unit test Closes apache#24094 from gengliangwang/originalMap. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d49a179 - Browse repository at this point
Copy the full SHA d49a179View commit details -
Configuration menu - View commit details
-
Copy full SHA for affb14b - Browse repository at this point
Copy the full SHA affb14bView commit details -
Configuration menu - View commit details
-
Copy full SHA for 4661671 - Browse repository at this point
Copy the full SHA 4661671View commit details -
[SPARK-26744][SPARK-26744][SQL][HOTFOX] Disable schema validation tes…
…ts for FileDataSourceV2 (partially revert ) ## What changes were proposed in this pull request? This PR partially revert SPARK-26744. apache@60caa92 and apache@4dce45a were merged at similar time range independently. So the test failures were not caught. - apache@60caa92 happened to add a schema reading logic in writing path for overwrite mode as well. - apache@4dce45a added some tests with overwrite modes with migrated ORC v2. And the tests looks starting to fail. I guess the discussion won't be short (see apache#23606 (comment)) and this PR proposes to disable the tests added at apache@4dce45a to unblock other PRs for now. ## How was this patch tested? Existing tests. Closes apache#23828 from HyukjinKwon/SPARK-26744. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for ee834f7 - Browse repository at this point
Copy the full SHA ee834f7View commit details
Commits on Jun 6, 2019
-
Configuration menu - View commit details
-
Copy full SHA for 294eaef - Browse repository at this point
Copy the full SHA 294eaefView commit details -
[SPARK-26811][SQL][FOLLOWUP] fix some documentation
## What changes were proposed in this pull request? It's a followup of apache#24012 , to fix 2 documentation: 1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now. 2. `Scan` should link the `BATCH_READ` instead of hardcoding it. ## How was this patch tested? N/A Closes apache#24285 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 5e7eb12 - Browse repository at this point
Copy the full SHA 5e7eb12View commit details -
[MINOR][TEST][DOC] Execute action miss name message
## What changes were proposed in this pull request? some minor updates: - `Execute` action miss `name` message - typo in SS document - typo in SQLConf ## How was this patch tested? N/A Closes apache#24466 from uncleGen/minor-fix. Authored-by: uncleGen <hustyugm@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 57153b4 - Browse repository at this point
Copy the full SHA 57153b4View commit details -
[SPARK-27576][SQL] table capability to skip the output column resolution
Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table. However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all. This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write. Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon. new test cases Closes apache#24469 from cloud-fan/schema-check. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for 0c2d6aa - Browse repository at this point
Copy the full SHA 0c2d6aaView commit details -
[SPARK-26356][SQL] remove SaveMode from data source v2
In data source v1, save mode specified in `DataFrameWriter` is passed to data source implementation directly, and each data source can define its own behavior about save mode. This is confusing and we want to get rid of save mode in data source v2. For data source v2, we expect data source to implement the `TableCatalog` API, and end-users use SQL(or the new write API described in [this doc](https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5ace0718#heading=h.e9v1af12g5zo)) to acess data sources. The SQL API has very clear semantic and we don't need save mode at all. However, for simple data sources that do not have table management (like a JIRA data source, a noop sink, etc.), it's not ideal to ask them to implement the `TableCatalog` API, and throw exception here and there. `TableProvider` API is created for simple data sources. It can only get tables, without any other table management methods. This means, it can only deal with existing tables. `TableProvider` fits well with `DataStreamReader` and `DataStreamWriter`, as they can only read/write existing tables. However, `TableProvider` doesn't fit `DataFrameWriter` well, as the save mode requires more than just get table. More specifically, `ErrorIfExists` mode needs to check if table exists, and create table. `Ignore` mode needs to check if table exists. When end-users specify `ErrorIfExists` or `Ignore` mode and write data to `TableProvider` via `DataFrameWriter`, Spark fails the query and asks users to use `Append` or `Overwrite` mode. The file source is in the middle of `TableProvider` and `TableCatalog`: it's simple but it can check table(path) exists and create table(path). That said, file source supports all the save modes. Currently file source implements `TableProvider`, and it's not working because `TableProvider` doesn't support `ErrorIfExists` and `Ignore` modes. Ideally we should create a new API for path-based data sources, but to unblock the work of file source v2 migration, this PR proposes to special-case file source v2 in `DataFrameWriter`, to make it work. This PR also removes `SaveMode` from data source v2, as now only the internal file source v2 needs it. existing tests Closes apache#24233 from cloud-fan/file. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d3e9b94 - Browse repository at this point
Copy the full SHA d3e9b94View commit details -
Configuration menu - View commit details
-
Copy full SHA for c0ffa90 - Browse repository at this point
Copy the full SHA c0ffa90View commit details -
Configuration menu - View commit details
-
Copy full SHA for c968388 - Browse repository at this point
Copy the full SHA c968388View commit details -
[SPARK-27521][SQL] Move data source v2 to catalyst module
Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core. I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go. Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module. This PR can solve the problem we have in apache#24246 existing tests Closes apache#24416 from cloud-fan/move. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d7e3943 - Browse repository at this point
Copy the full SHA d7e3943View commit details -
Configuration menu - View commit details
-
Copy full SHA for e0edb6c - Browse repository at this point
Copy the full SHA e0edb6cView commit details -
[SPARK-27732][SQL] Add v2 CreateTable implementation.
## What changes were proposed in this pull request? This adds a v2 implementation of create table: * `CreateV2Table` is the logical plan, named using v2 to avoid conflicting with the existing plan * `CreateTableExec` is the physical plan ## How was this patch tested? Added resolution and v2 SQL tests. Closes apache#24617 from rdblue/SPARK-27732-add-v2-create-table. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for c7c5d84 - Browse repository at this point
Copy the full SHA c7c5d84View commit details -
[SPARK-26946][SQL][FOLLOWUP] Require lookup function
## What changes were proposed in this pull request? Require the lookup function with interface LookupCatalog. Rationale is in the review comments below. Make `Analyzer` abstract. BaseSessionStateBuilder and HiveSessionStateBuilder implements lookupCatalog with a call to SparkSession.catalog(). Existing test cases and those that don't need catalog lookup will use a newly added `TestAnalyzer` with a default lookup function that throws` CatalogNotFoundException("No catalog lookup function")`. Rewrote the unit test for LookupCatalog to demonstrate the interface can be used anywhere, not just Analyzer. Removed Analyzer parameter `lookupCatalog` because we can override in the following manner: ``` new Analyzer() { override def lookupCatalog(name: String): CatalogPlugin = ??? } ``` ## How was this patch tested? Existing unit tests. Closes apache#24689 from jzhuge/SPARK-26946-follow. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 6244b77 - Browse repository at this point
Copy the full SHA 6244b77View commit details -
[SPARK-27813][SQL] DataSourceV2: Add DropTable logical operation
## What changes were proposed in this pull request? Support DROP TABLE from V2 catalogs. Move DROP TABLE into catalyst. Move parsing tests for DROP TABLE/VIEW to PlanResolutionSuite to validate existing behavior. Add new tests fo catalyst parser suite. Separate DROP VIEW into different code path from DROP TABLE. Move DROP VIEW into catalyst as a new operator. Add a meaningful exception to indicate view is not currently supported in v2 catalog. ## How was this patch tested? New unit tests. Existing unit tests in catalyst and sql core. Closes apache#24686 from jzhuge/SPARK-27813-pr. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 0db2aa0 - Browse repository at this point
Copy the full SHA 0db2aa0View commit details -
[SPARK-27103][SQL][MINOR] List SparkSql reserved keywords in alphabet…
… order ## What changes were proposed in this pull request? This PR tries to correct spark-sql reserved keywords' position in list if they are not in alphabetical order. In test suite some repeated words are removed. Also some comments are added for remind. ## How was this patch tested? Existing unit tests. Closes apache#23985 from SongYadong/sql_reserved_alphabet. Authored-by: SongYadong <song.yadong1@zte.com.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for d9e0cca - Browse repository at this point
Copy the full SHA d9e0ccaView commit details -
[SPARK-27857][SQL] Move ALTER TABLE parsing into Catalyst
This moves parsing logic for `ALTER TABLE` into Catalyst and adds parsed logical plans for alter table changes that use multi-part identifiers. This PR is similar to SPARK-27108, PR apache#24029, that created parsed logical plans for create and CTAS. * Create parsed logical plans * Move parsing logic into Catalyst's AstBuilder * Convert to DataSource plans in DataSourceResolution * Parse `ALTER TABLE ... SET LOCATION ...` separately from the partition variant * Parse `ALTER TABLE ... ALTER COLUMN ... [TYPE dataType] [COMMENT comment]` [as discussed on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Syntax-for-table-DDL-td25197.html#a25270) * Parse `ALTER TABLE ... RENAME COLUMN ... TO ...` * Parse `ALTER TABLE ... DROP COLUMNS ...` * Added new tests in Catalyst's `DDLParserSuite` * Moved converted plan tests from SQL `DDLParserSuite` to `PlanResolutionSuite` * Existing tests for regressions Closes apache#24723 from rdblue/SPARK-27857-add-alter-table-statements-in-catalyst. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for e1365ba - Browse repository at this point
Copy the full SHA e1365baView commit details -
Configuration menu - View commit details
-
Copy full SHA for 1a45142 - Browse repository at this point
Copy the full SHA 1a45142View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8bcc74d - Browse repository at this point
Copy the full SHA 8bcc74dView commit details -
Revert "[SPARK-27857][SQL] Move ALTER TABLE parsing into Catalyst"
This reverts commit e1365ba.
Configuration menu - View commit details
-
Copy full SHA for a3debfd - Browse repository at this point
Copy the full SHA a3debfdView commit details -
Revert "[SPARK-27103][SQL][MINOR] List SparkSql reserved keywords in …
…alphabet order" This reverts commit d9e0cca.
Configuration menu - View commit details
-
Copy full SHA for 7c1eb92 - Browse repository at this point
Copy the full SHA 7c1eb92View commit details -
[SPARK-27675][SQL] do not use MutableColumnarRow in ColumnarBatch
## What changes were proposed in this pull request? To move DS v2 API to the catalyst module, we can't refer to an internal class (`MutableColumnarRow`) in `ColumnarBatch`. This PR creates a read-only version of `MutableColumnarRow`, and use it in `ColumnarBatch`. close apache#24546 ## How was this patch tested? existing tests Closes apache#24581 from cloud-fan/mutable-row. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Configuration menu - View commit details
-
Copy full SHA for cfa37b0 - Browse repository at this point
Copy the full SHA cfa37b0View commit details -
[MINOR] Move java file to java directory
## What changes were proposed in this pull request? move ```scala org.apache.spark.sql.execution.streaming.BaseStreamingSource org.apache.spark.sql.execution.streaming.BaseStreamingSink ``` to java directory ## How was this patch tested? Existing UT. Closes apache#24222 from ConeyLiu/move-scala-to-java. Authored-by: Xianyang Liu <xianyang.liu@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d8f503e - Browse repository at this point
Copy the full SHA d8f503eView commit details -
[SPARK-27190][SQL] add table capability for streaming
This is a followup of apache#24012 , to add the corresponding capabilities for streaming. existing tests Closes apache#24129 from cloud-fan/capability. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for b28de53 - Browse repository at this point
Copy the full SHA b28de53View commit details -
Configuration menu - View commit details
-
Copy full SHA for a022526 - Browse repository at this point
Copy the full SHA a022526View commit details -
[SPARK-23014][SS] Fully remove V1 memory sink.
There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely. What this PR contains: * V1 memory sink removal * V2 memory sink renamed to become the only implementation * Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant * Adapted all the tests Existing unit tests. Closes apache#24403 from gaborgsomogyi/SPARK-23014. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Configuration menu - View commit details
-
Copy full SHA for 4e5087f - Browse repository at this point
Copy the full SHA 4e5087fView commit details -
Configuration menu - View commit details
-
Copy full SHA for 287e9d7 - Browse repository at this point
Copy the full SHA 287e9d7View commit details -
[SPARK-27579][SQL] remove BaseStreamingSource and BaseStreamingSink
## What changes were proposed in this pull request? `BaseStreamingSource` and `BaseStreamingSink` is used to unify v1 and v2 streaming data source API in some code paths. This PR removes these 2 interfaces, and let the v1 API extend v2 API to keep API compatibility. The motivation is apache#24416 . We want to move data source v2 to catalyst module, but `BaseStreamingSource` and `BaseStreamingSink` are in sql/core. ## How was this patch tested? existing tests Closes apache#24471 from cloud-fan/streaming. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for d97de74 - Browse repository at this point
Copy the full SHA d97de74View commit details -
[SPARK-27642][SS] make v1 offset extends v2 offset
## What changes were proposed in this pull request? To move DS v2 to the catalyst module, we can't make v2 offset rely on v1 offset, as v1 offset is in sql/core. ## How was this patch tested? existing tests Closes apache#24538 from cloud-fan/offset. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 12746c1 - Browse repository at this point
Copy the full SHA 12746c1View commit details -
Configuration menu - View commit details
-
Copy full SHA for c99b896 - Browse repository at this point
Copy the full SHA c99b896View commit details -
[SPARK-27693][SQL] Add default catalog property
Add a SQL config property for the default v2 catalog. Existing tests for regressions. Closes apache#24594 from rdblue/SPARK-27693-add-default-catalog-config. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Configuration menu - View commit details
-
Copy full SHA for 5d2096e - Browse repository at this point
Copy the full SHA 5d2096eView commit details -
Configuration menu - View commit details
-
Copy full SHA for f7e63d6 - Browse repository at this point
Copy the full SHA f7e63d6View commit details -
Configuration menu - View commit details
-
Copy full SHA for 876d1a0 - Browse repository at this point
Copy the full SHA 876d1a0View commit details -
Configuration menu - View commit details
-
Copy full SHA for b714508 - Browse repository at this point
Copy the full SHA b714508View commit details -
Configuration menu - View commit details
-
Copy full SHA for c018fba - Browse repository at this point
Copy the full SHA c018fbaView commit details -
Configuration menu - View commit details
-
Copy full SHA for 2cd8078 - Browse repository at this point
Copy the full SHA 2cd8078View commit details
Commits on Jun 7, 2019
-
[SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery
In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters. We have a sql with a scalar subquery: ``` scala val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)") plan.explain(true) ``` And we get the log info of DataSourceV2Strategy: ``` Pushing operators to csv:examples/src/main/resources/t2.txt Pushed Filters: Post-Scan Filters: isnotnull(t2a#30) Output: t2a#30, t2b#31 ``` The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it by mistake. ``` == Parsed Logical Plan == 'Project [*] +- 'Filter ('t2a > scalar-subquery#56 []) : +- 'Project [unresolvedalias('max('t1a), None)] : +- 'UnresolvedRelation `t1` +- 'UnresolvedRelation `t2` == Analyzed Logical Plan == t2a: string, t2b: string Project [t2a#30, t2b#31] +- Filter (t2a#30 > scalar-subquery#56 []) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- SubqueryAlias `t1` : +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- SubqueryAlias `t2` +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Optimized Logical Plan == Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 [])) : +- Aggregate [max(t1a#13) AS max(t1a)#63] : +- Project [t1a#13] : +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt == Physical Plan == *(1) Project [t2a#30, t2b#31] +- *(1) Filter isnotnull(t2a#30) +- *(1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan ``` ut Closes apache#24321 from francis0407/SPARK-27411. Authored-by: francis0407 <hanmingcong123@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Configuration menu - View commit details
-
Copy full SHA for 5346dcf - Browse repository at this point
Copy the full SHA 5346dcfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 17bb20c - Browse repository at this point
Copy the full SHA 17bb20cView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5a8ea0b - Browse repository at this point
Copy the full SHA 5a8ea0bView commit details