Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply data source v2 changes #576

Closed
wants to merge 71 commits into from
Closed

Apply data source v2 changes #576

wants to merge 71 commits into from

Commits on May 15, 2019

  1. [SPARK-26865][SQL] DataSourceV2Strategy should push normalized filters

    ## What changes were proposed in this pull request?
    
    This PR aims to make `DataSourceV2Strategy` normalize filters like [FileSourceStrategy](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala#L150-L158) when it pushes them into `SupportsPushDownFilters.pushFilters`.
    
    ## How was this patch tested?
    
    Pass the Jenkins with the newly added test case.
    
    Closes apache#23770 from dongjoon-hyun/SPARK-26865.
    
    Authored-by: Dongjoon Hyun <dhyun@apple.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dongjoon-hyun authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    ae0e2ca View commit details
    Browse the repository at this point in the history
  2. [SPARK-26666][SQL] Support DSv2 overwrite and dynamic partition overw…

    …rite.
    
    ## What changes were proposed in this pull request?
    
    This adds two logical plans that implement the ReplaceData operation from the [logical plans SPIP](https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5a987801#heading=h.m45webtwxf2d). These two plans will be used to implement Spark's `INSERT OVERWRITE` behavior for v2.
    
    Specific changes:
    * Add `SupportsTruncate`, `SupportsOverwrite`, and `SupportsDynamicOverwrite` to DSv2 write API
    * Add `OverwriteByExpression` and `OverwritePartitionsDynamic` plans (logical and physical)
    * Add new plans to DSv2 write validation rule `ResolveOutputRelation`
    * Refactor `WriteToDataSourceV2Exec` into trait used by all DSv2 write exec nodes
    
    ## How was this patch tested?
    
    * The v2 analysis suite has been updated to validate the new overwrite plans
    * The analysis suite for `OverwriteByExpression` checks that the delete expression is resolved using the table's columns
    * Existing tests validate that overwrite exec plan works
    * Updated existing v2 test because schema is used to validate overwrite
    
    Closes apache#23606 from rdblue/SPARK-26666-add-overwrite.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    dc26348 View commit details
    Browse the repository at this point in the history
  3. [SPARK-26785][SQL] data source v2 API refactor: streaming write

    ## What changes were proposed in this pull request?
    
    Continue the API refactor for streaming write, according to the [doc](https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing).
    
    The major changes:
    1. rename `StreamingWriteSupport` to `StreamingWrite`
    2. add `WriteBuilder.buildForStreaming`
    3. update existing sinks, to move the creation of `StreamingWrite` to `Table`
    
    ## How was this patch tested?
    
    existing tests
    
    Closes apache#23702 from cloud-fan/stream-write.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    af5d187 View commit details
    Browse the repository at this point in the history
  4. [SPARK-24252][SQL] Add v2 catalog plugin system

    ## What changes were proposed in this pull request?
    
    This adds a v2 API for adding new catalog plugins to Spark.
    
    * Catalog implementations extend `CatalogPlugin` and are loaded via reflection, similar to data sources
    * `Catalogs` loads and initializes catalogs using configuration from a `SQLConf`
    * `CaseInsensitiveStringMap` is used to pass configuration to `CatalogPlugin` via `initialize`
    
    Catalogs are configured by adding config properties starting with `spark.sql.catalog.(name)`. The name property must specify a class that implements `CatalogPlugin`. Other properties under the namespace (`spark.sql.catalog.(name).(prop)`) are passed to the provider during initialization along with the catalog name.
    
    This replaces apache#21306, which will be implemented in two multiple parts: the catalog plugin system (this commit) and specific catalog APIs, like `TableCatalog`.
    
    ## How was this patch tested?
    
    Added test suites for `CaseInsensitiveStringMap` and for catalog loading.
    
    Closes apache#23915 from rdblue/SPARK-24252-add-v2-catalog-plugins.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    167ffec View commit details
    Browse the repository at this point in the history
  5. [SPARK-26946][SQL] Identifiers for multi-catalog

    ## What changes were proposed in this pull request?
    
    - Support N-part identifier in SQL
    - N-part identifier extractor in Analyzer
    
    ## How was this patch tested?
    
    - A new unit test suite ResolveMultipartRelationSuite
    - CatalogLoadingSuite
    
    rblue cloud-fan mccheah
    
    Closes apache#23848 from jzhuge/SPARK-26946.
    
    Authored-by: John Zhuge <jzhuge@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jzhuge authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    3da923b View commit details
    Browse the repository at this point in the history
  6. [SPARK-27250][TEST-MAVEN][BUILD] Scala 2.11 maven compile should targ…

    …et Java 1.8
    
    ## What changes were proposed in this pull request?
    
    Fix Scala 2.11 maven build issue after merging SPARK-26946.
    
    ## How was this patch tested?
    
    Maven Scala 2.11 and 2.12 builds with `-Phadoop-provided -Phadoop-2.7 -Pyarn -Phive -Phive-thriftserver`.
    
    Closes apache#24184 from jzhuge/SPARK-26946-1.
    
    Authored-by: John Zhuge <jzhuge@apache.org>
    Signed-off-by: Sean Owen <sean.owen@databricks.com>
    jzhuge authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    85d0f08 View commit details
    Browse the repository at this point in the history
  7. [SPARK-26673][FOLLOWUP][SQL] File Source V2: check existence of outpu…

    …t path before delete it
    
    ## What changes were proposed in this pull request?
    This is a followup PR to resolve comment: apache#23601 (review)
    
    When Spark writes DataFrame with "overwrite" mode, it deletes the output path before actual writes. To safely handle the case that the output path doesn't exist,  it is suggested to follow the V1 code by checking the existence.
    
    ## How was this patch tested?
    
    Apply apache#23836 and run unit tests
    
    Closes apache#23889 from gengliangwang/checkFileBeforeOverwrite.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    gengliangwang authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    4db1e19 View commit details
    Browse the repository at this point in the history
  8. [SPARK-26952][SQL] Row count statics should respect the data reported…

    … by data source
    
    ## What changes were proposed in this pull request?
    
    In data source v2, if the data source scan implemented `SupportsReportStatistics`. `DataSourceV2Relation` should respect the row count reported by the data source.
    
    ## How was this patch tested?
    
    New UT test.
    
    Closes apache#23853 from ConeyLiu/report-row-count.
    
    Authored-by: Xianyang Liu <xianyang.liu@intel.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    ConeyLiu authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    2733301 View commit details
    Browse the repository at this point in the history
  9. [SPARK-26871][SQL] File Source V2: avoid creating unnecessary FileInd…

    …ex in the write path
    
    ## What changes were proposed in this pull request?
    
    In apache#23383, the file source V2 framework is implemented. In the PR, `FileIndex` is created as a member of `FileTable`, so that we can implement partition pruning like apache@0f9fcab in the future(As data source V2 catalog is under development, partition pruning is removed from the PR)
    
    However, after write path of file source V2 is implemented, I find that a simple write will create an unnecessary `FileIndex`, which is required by `FileTable`. This is a sort of regression. And we can see there is a warning message when writing to ORC files
    ```
    WARN InMemoryFileIndex: The directory file:/tmp/foo was not found. Was it deleted very recently?
    ```
    This PR is to make `FileIndex` as a lazy value in `FileTable`, so that we can avoid creating unnecessary `FileIndex` in the write path.
    
    ## How was this patch tested?
    
    Existing unit test
    
    Closes apache#23774 from gengliangwang/moveFileIndexInV2.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gengliangwang authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    4094211 View commit details
    Browse the repository at this point in the history
  10. [SPARK-26744][SQL] Support schema validation in FileDataSourceV2 fram…

    …ework
    
    ## What changes were proposed in this pull request?
    
    The file source has a schema validation feature, which validates 2 schemas:
    1. the user-specified schema when reading.
    2. the schema of input data when writing.
    
    If a file source doesn't support the schema, we can fail the query earlier.
    
    This PR is to implement the same feature  in the `FileDataSourceV2` framework. Comparing to `FileFormat`, `FileDataSourceV2` has multiple layers. The API is added in two places:
    1. Read path: the table schema is determined in `TableProvider.getTable`. The actual read schema can be a subset of the table schema.  This PR proposes to validate the actual read schema in  `FileScan`.
    2.  Write path: validate the actual output schema in `FileWriteBuilder`.
    
    ## How was this patch tested?
    
    Unit test
    
    Closes apache#23714 from gengliangwang/schemaValidationV2.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gengliangwang authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    caa5fab View commit details
    Browse the repository at this point in the history
  11. [SPARK-26956][SS] remove streaming output mode from data source v2 APIs

    ## What changes were proposed in this pull request?
    
    Similar to `SaveMode`, we should remove streaming `OutputMode` from data source v2 API, and use operations that has clear semantic.
    
    The changes are:
    1. append mode: create `StreamingWrite` directly. By default, the `WriteBuilder` will create `Write` to append data.
    2. complete mode: call `SupportsTruncate#truncate`. Complete mode means truncating all the old data and appending new data of the current epoch. `SupportsTruncate` has exactly the same semantic.
    3. update mode: fail. The current streaming framework can't propagate the update keys, so v2 sinks are not able to implement update mode. In the future we can introduce a `SupportsUpdate` trait.
    
    The behavior changes:
    1. all the v2 sinks(foreach, console, memory, kafka, noop) don't support update mode. The fact is, previously all the v2 sinks implement the update mode wrong. None of them can really support it.
    2. kafka sink doesn't support complete mode. The fact is, the kafka sink can only append data.
    
    ## How was this patch tested?
    
    existing tests
    
    Closes apache#23859 from cloud-fan/update.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    d2f0dd5 View commit details
    Browse the repository at this point in the history
  12. [SPARK-26389][SS] Add force delete temp checkpoint configuration

    ## What changes were proposed in this pull request?
    
    Not all users wants to keep temporary checkpoint directories. Additionally hard to restore from it.
    
    In this PR I've added a force delete flag which is default `false`. Additionally not clear for users when temporary checkpoint directory deleted so added log messages to explain this a bit more.
    
    ## How was this patch tested?
    
    Existing + additional unit tests.
    
    Closes apache#23732 from gaborgsomogyi/SPARK-26389.
    
    Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    gaborgsomogyi authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    49dd067 View commit details
    Browse the repository at this point in the history
  13. [SPARK-26824][SS] Fix the checkpoint location and _spark_metadata whe…

    …n it contains special chars
    
    ## What changes were proposed in this pull request?
    
    When a user specifies a checkpoint location or a file sink output using a path containing special chars that need to be escaped in a path, the streaming query will store checkpoint and file sink metadata in a wrong place. In this PR, I uploaded a checkpoint that was generated by the following codes using Spark 2.4.0 to show this issue:
    
    ```
    implicit val s = spark.sqlContext
    val input = org.apache.spark.sql.execution.streaming.MemoryStream[Int]
    input.addData(1, 2, 3)
    val q = input.toDF.writeStream.format("parquet").option("checkpointLocation", ".../chk %#chk").start(".../output %#output")
    q.stop()
    ```
    Here is the structure of the directory:
    ```
    sql/core/src/test/resources/structured-streaming/escaped-path-2.4.0
    ├── chk%252520%252525%252523chk
    │   ├── commits
    │   │   └── 0
    │   ├── metadata
    │   └── offsets
    │       └── 0
    ├── output %#output
    │   └── part-00000-97f675a2-bb82-4201-8245-05f3dae4c372-c000.snappy.parquet
    └── output%20%25%23output
        └── _spark_metadata
            └── 0
    ```
    
    In this checkpoint, the user specified checkpoint location is `.../chk %#chk` but the real path to store the checkpoint is `.../chk%252520%252525%252523chk` (this is generated by escaping the original path three times). The user specified output path is `.../output %#output` but the path to store `_spark_metadata` is `.../output%20%25%23output/_spark_metadata` (this is generated by escaping the original path once). The data files are still in the correct path (such as `.../output %#output/part-00000-97f675a2-bb82-4201-8245-05f3dae4c372-c000.snappy.parquet`).
    
    This checkpoint will be used in unit tests in this PR.
    
    The fix is just simply removing improper `Path.toUri` calls to fix the issue.
    
    However, as the user may not read the release note and is not aware of this checkpoint location change, if they upgrade Spark without moving checkpoint to the new location, their query will just start from the scratch. In order to not surprise the users, this PR also adds a check to **detect the impacted paths and throws an error** to include the migration guide. This check can be turned off by an internal sql conf `spark.sql.streaming.checkpoint.escapedPathCheck.enabled`. Here are examples of errors that will be reported:
    
    - Streaming checkpoint error:
    ```
    Error: we detected a possible problem with the location of your checkpoint and you
    likely need to move it before restarting this query.
    
    Earlier version of Spark incorrectly escaped paths when writing out checkpoints for
    structured streaming. While this was corrected in Spark 3.0, it appears that your
    query was started using an earlier version that incorrectly handled the checkpoint
    path.
    
    Correct Checkpoint Directory: /.../chk %#chk
    Incorrect Checkpoint Directory: /.../chk%252520%252525%252523chk
    
    Please move the data from the incorrect directory to the correct one, delete the
    incorrect directory, and then restart this query. If you believe you are receiving
    this message in error, you can disable it with the SQL conf
    spark.sql.streaming.checkpoint.escapedPathCheck.enabled.
    ```
    
    - File sink error (`_spark_metadata`):
    ```
    Error: we detected a possible problem with the location of your "_spark_metadata"
    directory and you likely need to move it before restarting this query.
    
    Earlier version of Spark incorrectly escaped paths when writing out the
    "_spark_metadata" directory for structured streaming. While this was corrected in
    Spark 3.0, it appears that your query was started using an earlier version that
    incorrectly handled the "_spark_metadata" path.
    
    Correct "_spark_metadata" Directory: /.../output %#output/_spark_metadata
    Incorrect "_spark_metadata" Directory: /.../output%20%25%23output/_spark_metadata
    
    Please move the data from the incorrect directory to the correct one, delete the
    incorrect directory, and then restart this query. If you believe you are receiving
    this message in error, you can disable it with the SQL conf
    spark.sql.streaming.checkpoint.escapedPathCheck.enabled.
    ```
    
    ## How was this patch tested?
    
    The new unit tests.
    
    Closes apache#23733 from zsxwing/path-fix.
    
    Authored-by: Shixiong Zhu <zsxwing@gmail.com>
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    zsxwing authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    1f5d3d4 View commit details
    Browse the repository at this point in the history
  14. [SPARK-27111][SS] Fix a race that a continuous query may fail with In…

    …terruptedException
    
    ## What changes were proposed in this pull request?
    
    Before a Kafka consumer gets assigned with partitions, its offset will contain 0 partitions. However, runContinuous will still run and launch a Spark job having 0 partitions. In this case, there is a race that epoch may interrupt the query execution thread after `lastExecution.toRdd`, and either `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` or the next `runContinuous` will get interrupted unintentionally.
    
    To handle this case, this PR has the following changes:
    
    - Clean up the resources in `queryExecutionThread.runUninterruptibly`. This may increase the waiting time of `stop` but should be minor because the operations here are very fast (just sending an RPC message in the same process and stopping a very simple thread).
    - Clear the interrupted status at the end so that it won't impact the `runContinuous` call. We may clear the interrupted status set by `stop`, but it doesn't affect the query termination because `runActivatedStream` will check `state` and exit accordingly.
    
    I also updated the clean up codes to make sure exceptions thrown from `epochEndpoint.askSync[Unit](StopContinuousExecutionWrites)` won't stop the clean up.
    
    ## How was this patch tested?
    
    Jenkins
    
    Closes apache#24034 from zsxwing/SPARK-27111.
    
    Authored-by: Shixiong Zhu <zsxwing@gmail.com>
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
    zsxwing authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    982df04 View commit details
    Browse the repository at this point in the history
  15. [SPARK-24063][SS] Add maximum epoch queue threshold for ContinuousExe…

    …cution
    
    ## What changes were proposed in this pull request?
    
    Continuous processing is waiting on epochs which are not yet complete (for example one partition is not making progress) and stores pending items in queues. These queues are unbounded and can consume up all the memory easily. In this PR I've added `spark.sql.streaming.continuous.epochBacklogQueueSize` configuration possibility to make them bounded. If the related threshold reached then the query will stop with `IllegalStateException`.
    
    ## How was this patch tested?
    
    Existing + additional unit tests.
    
    Closes apache#23156 from gaborgsomogyi/SPARK-24063.
    
    Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    gaborgsomogyi authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    38556e7 View commit details
    Browse the repository at this point in the history
  16. [SPARK-27064][SS] create StreamingWrite at the beginning of streaming…

    … execution
    
    ## What changes were proposed in this pull request?
    
    According to the [design](https://docs.google.com/document/d/1vI26UEuDpVuOjWw4WPoH2T6y8WAekwtI7qoowhOFnI4/edit?usp=sharing), the life cycle of `StreamingWrite` should be the same as the read side `MicroBatch/ContinuousStream`, i.e. each run of the stream query, instead of each epoch.
    
    This PR fixes it.
    
    ## How was this patch tested?
    
    existing tests
    
    Closes apache#23981 from cloud-fan/dsv2.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    3fecdd9 View commit details
    Browse the repository at this point in the history
  17. [SPARK-27106][SQL] merge CaseInsensitiveStringMap and DataSourceOptions

    It's a little awkward to have 2 different classes(`CaseInsensitiveStringMap` and `DataSourceOptions`) to present the options in data source and catalog API.
    
    This PR merges these 2 classes, while keeping the name `CaseInsensitiveStringMap`, which is more precise.
    
    existing tests
    
    Closes apache#24025 from cloud-fan/option.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    1609b3f View commit details
    Browse the repository at this point in the history
  18. [SPARK-26811][SQL] Add capabilities to v2.Table

    This adds a new method, `capabilities` to `v2.Table` that returns a set of `TableCapability`. Capabilities are used to fail queries during analysis checks, `V2WriteSupportCheck`, when the table does not support operations, like truncation.
    
    Existing tests for regressions, added new analysis suite, `V2WriteSupportCheckSuite`, for new capability checks.
    
    Closes apache#24012 from rdblue/SPARK-26811-add-capabilities.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    8f9c5ac View commit details
    Browse the repository at this point in the history
  19. [SPARK-27209][SQL] Split parsing of SELECT and INSERT into two top-le…

    …vel rules in the grammar file.
    
    Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples :
    
    ```SQL
    select * from (insert into bar values (2));
    ```
    ```
    Error in query: unresolved operator 'Project [*];
    'Project [*]
    +- SubqueryAlias `__auto_generated_subquery_name`
       +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
          +- Project [cast(col1#18 as int) AS c1#20]
             +- LocalRelation [col1#18]
    ```
    
    ```SQL
    select * from foo where c1 in (insert into bar values (2))
    ```
    ```
    Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch:
    The number of columns in the left hand side of an IN subquery does not match the
    number of columns in the output of subquery.
    
    Left side columns:
    [default.foo.`c1`].
    Right side columns:
    [].;;
    'Project [*]
    +- 'Filter c1#6 IN (list#5 [])
       :  +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
       :     +- Project [cast(col1#7 as int) AS c1#9]
       :        +- LocalRelation [col1#7]
       +- SubqueryAlias `default`.`foo`
          +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6]
    ```
    
    For both the cases above, we should reject the syntax at parser level.
    
    In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively.
    I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in.
    Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites.
    
    Closes apache#24150 from dilipbiswal/split-query-insert.
    
    Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    dilipbiswal authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    c46db75 View commit details
    Browse the repository at this point in the history
  20. Revert "[SPARK-27209][SQL] Split parsing of SELECT and INSERT into tw…

    …o top-level rules in the grammar file."
    
    This reverts commit c46db75.
    mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    bc6ece7 View commit details
    Browse the repository at this point in the history
  21. [SPARK-27209][SQL] Split parsing of SELECT and INSERT into two top-le…

    …vel rules in the grammar file.
    
    Currently in the grammar file the rule `query` is responsible to parse both select and insert statements. As a result, we need to have more semantic checks in the code to guard against in-valid insert constructs in a query. Couple of examples are in the `visitCreateView` and `visitAlterView` functions. One other issue is that, we don't catch the `invalid insert constructs` in all the places until checkAnalysis (the errors we raise can be confusing as well). Here are couple of examples :
    
    ```SQL
    select * from (insert into bar values (2));
    ```
    ```
    Error in query: unresolved operator 'Project [*];
    'Project [*]
    +- SubqueryAlias `__auto_generated_subquery_name`
       +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
          +- Project [cast(col1#18 as int) AS c1#20]
             +- LocalRelation [col1#18]
    ```
    
    ```SQL
    select * from foo where c1 in (insert into bar values (2))
    ```
    ```
    Error in query: cannot resolve '(default.foo.`c1` IN (listquery()))' due to data type mismatch:
    The number of columns in the left hand side of an IN subquery does not match the
    number of columns in the output of subquery.
    
    Left side columns:
    [default.foo.`c1`].
    Right side columns:
    [].;;
    'Project [*]
    +- 'Filter c1#6 IN (list#5 [])
       :  +- InsertIntoHiveTable `default`.`bar`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, false, false, [c1]
       :     +- Project [cast(col1#7 as int) AS c1#9]
       :        +- LocalRelation [col1#7]
       +- SubqueryAlias `default`.`foo`
          +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#6]
    ```
    
    For both the cases above, we should reject the syntax at parser level.
    
    In this PR, we create two top-level parser rules to parse `SELECT` and `INSERT` respectively.
    I will create a small PR to allow CTEs in DESCRIBE QUERY after this PR is in.
    Added tests to PlanParserSuite and removed the semantic check tests from SparkSqlParserSuites.
    
    Closes apache#24150 from dilipbiswal/split-query-insert.
    
    Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    dilipbiswal authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    b9a2061 View commit details
    Browse the repository at this point in the history
  22. Revert "[SPARK-27209][SQL] Split parsing of SELECT and INSERT into tw…

    …o top-level rules in the grammar file."
    
    This reverts commit b9a2061.
    mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    e68a36c View commit details
    Browse the repository at this point in the history
  23. [SPARK-26215][SQL] Define reserved/non-reserved keywords based on the…

    … ANSI SQL standard
    
    ## What changes were proposed in this pull request?
    This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in `docs/sql-reserved-and-non-reserved-key-words.md`.
    
    NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities.
    
    References:
     - The reserved/non-reserved SQL keywords in the ANSI SQL standards: https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf
     - SQL Key Words in PostgreSQL: https://www.postgresql.org/docs/current/sql-keywords-appendix.html
    
    ## How was this patch tested?
    Added tests in `TableIdentifierParserSuite`.
    
    Closes apache#23259 from maropu/SPARK-26215-WIP.
    
    Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
    Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
    maropu authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    942ac18 View commit details
    Browse the repository at this point in the history
  24. [SPARK-26215][SQL][FOLLOW-UP][MINOR] Fix the warning from ANTR4

    ## What changes were proposed in this pull request?
    I see the following new warning from ANTR4 after SPARK-26215 after it added `SCHEMA` keyword in the reserved/unreserved list. This is a minor PR to cleanup the warning.
    
    ```
    WARNING] warning(125): org/apache/spark/sql/catalyst/parser/SqlBase.g4:784:90: implicit definition of token SCHEMA in parser
    [WARNING] .../apache/spark/org/apache/spark/sql/catalyst/parser/SqlBase.g4 [784:90]: implicit definition of token SCHEMA in parser
    ```
    ## How was this patch tested?
    Manually built catalyst after the fix to verify
    
    Closes apache#23897 from dilipbiswal/minor_parser_token.
    
    Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    dilipbiswal authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    4a7c007 View commit details
    Browse the repository at this point in the history
  25. [SPARK-26982][SQL] Enhance describe framework to describe the output …

    …of a query.
    
    Currently we can use `df.printSchema` to discover the schema information for a query. We should have a way to describe the output schema of a query using SQL interface.
    
    Example:
    
    DESCRIBE SELECT * FROM desc_table
    DESCRIBE QUERY SELECT * FROM desc_table
    ```SQL
    
    spark-sql> create table desc_table (c1 int comment 'c1-comment', c2 decimal comment 'c2-comment', c3 string);
    
    spark-sql> desc select * from desc_table;
    c1	int	        c1-comment
    c2	decimal(10,0)	c2-comment
    c3	string	        NULL
    
    ```
    Added a new test under SQLQueryTestSuite and SparkSqlParserSuite
    
    Closes apache#23883 from dilipbiswal/dkb_describe_query.
    
    Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    dilipbiswal authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    6cb9234 View commit details
    Browse the repository at this point in the history
  26. [SPARK-27108][SQL] Add parsed SQL plans for create, CTAS.

    This moves parsing `CREATE TABLE ... USING` statements into catalyst. Catalyst produces logical plans with the parsed information and those plans are converted to v1 `DataSource` plans in `DataSourceAnalysis`.
    
    This prepares for adding v2 create plans that should receive the information parsed from SQL without being translated to v1 plans first.
    
    This also makes it possible to parse in catalyst instead of breaking the parser across the abstract `AstBuilder` in catalyst and `SparkSqlParser` in core.
    
    For more information, see the [mailing list thread](https://lists.apache.org/thread.html/54f4e1929ceb9a2b0cac7cb058000feb8de5d6c667b2e0950804c613%3Cdev.spark.apache.org%3E).
    
    This uses existing tests to catch regressions. This introduces no behavior changes.
    
    Closes apache#24029 from rdblue/SPARK-27108-add-parsed-create-logical-plans.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    f0d9915 View commit details
    Browse the repository at this point in the history
  27. [SPARK-27181][SQL] Add public transform API

    ## What changes were proposed in this pull request?
    
    This adds a public Expression API that can be used to pass partition transformations to data sources.
    
    ## How was this patch tested?
    
    Existing tests to validate no regressions. Added transform cases to DDL suite and v1 conversions suite.
    
    Closes apache#24117 from rdblue/add-public-transform-api.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    0f9ac2a View commit details
    Browse the repository at this point in the history
  28. [SPARK-24252][SQL] Add TableCatalog API

    ## What changes were proposed in this pull request?
    
    This adds the TableCatalog API proposed in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d).
    
    For `TableCatalog` to use `Table`, it needed to be moved into the catalyst module where the v2 catalog API is located. This also required moving `TableCapability`. Most of the files touched by this PR are import changes needed by this move.
    
    ## How was this patch tested?
    
    This adds a test implementation and contract tests.
    
    Closes apache#24246 from rdblue/SPARK-24252-add-table-catalog-api.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    d70253e View commit details
    Browse the repository at this point in the history
  29. [SPARK-24923][SQL] Implement v2 CreateTableAsSelect

    This adds a v2 implementation for CTAS queries
    
    * Update the SQL parser to parse CREATE queries using multi-part identifiers
    * Update `CheckAnalysis` to validate partitioning references with the CTAS query schema
    * Add `CreateTableAsSelect` v2 logical plan and `CreateTableAsSelectExec` v2 physical plan
    * Update create conversion from `CreateTableAsSelectStatement` to support the new v2 logical plan
    * Update `DataSourceV2Strategy` to convert v2 CTAS logical plan to the new physical plan
    * Add `findNestedField` to `StructType` to support reference validation
    
    We have been running these changes in production for several months. Also:
    
    * Add a test suite `CreateTablePartitioningValidationSuite` for new analysis checks
    * Add a test suite for v2 SQL, `DataSourceV2SQLSuite`
    * Update catalyst `DDLParserSuite` to use multi-part identifiers (`Seq[String]`)
    * Add test cases to `PlanResolutionSuite` for v2 CTAS: known catalog and v2 source implementation
    
    Closes apache#24570 from rdblue/SPARK-24923-add-v2-ctas.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    d2b526c View commit details
    Browse the repository at this point in the history
  30. Fix scala 2.11 compilation

    mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    e133b92 View commit details
    Browse the repository at this point in the history
  31. Fix style

    mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    6dbc1d3 View commit details
    Browse the repository at this point in the history
  32. [SPARK-27162][SQL] Add new method asCaseSensitiveMap in CaseInsensiti…

    …veStringMap
    
    Currently, DataFrameReader/DataFrameReader supports setting Hadoop configurations via method `.option()`.
    E.g, the following test case should be passed in both ORC V1 and V2
    ```
      class TestFileFilter extends PathFilter {
        override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
      }
    
      withTempPath { dir =>
          val path = dir.getCanonicalPath
    
          val df = spark.range(2)
          df.write.orc(path + "/p=1")
          df.write.orc(path + "/p=2")
          val extraOptions = Map(
            "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
            "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
          )
          assert(spark.read.options(extraOptions).orc(path).count() === 2)
        }
      }
    ```
    While Hadoop Configurations are case sensitive, the current data source V2 APIs are using `CaseInsensitiveStringMap` in the top level entry `TableProvider`.
    To create Hadoop configurations correctly, I suggest
    1. adding a new method `asCaseSensitiveMap` in `CaseInsensitiveStringMap`.
    2. Make `CaseInsensitiveStringMap` read-only to ambiguous conversion in `asCaseSensitiveMap`
    
    Unit test
    
    Closes apache#24094 from gengliangwang/originalMap.
    
    Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    gengliangwang authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    d49a179 View commit details
    Browse the repository at this point in the history
  33. Fix compilation

    mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    affb14b View commit details
    Browse the repository at this point in the history
  34. More Scala 2.11 stuff

    mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    4661671 View commit details
    Browse the repository at this point in the history
  35. [SPARK-26744][SPARK-26744][SQL][HOTFOX] Disable schema validation tes…

    …ts for FileDataSourceV2 (partially revert )
    
    ## What changes were proposed in this pull request?
    
    This PR partially revert SPARK-26744.
    
    apache@60caa92 and apache@4dce45a were merged at similar time range independently. So the test failures were not caught.
    
    - apache@60caa92 happened to add a schema reading logic in writing path for overwrite mode as well.
    
    - apache@4dce45a added some tests with overwrite modes with migrated ORC v2.
    
    And the tests looks starting to fail.
    
    I guess the discussion won't be short (see apache#23606 (comment)) and this PR proposes to disable the tests added at apache@4dce45a to unblock other PRs for now.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Closes apache#23828 from HyukjinKwon/SPARK-26744.
    
    Authored-by: Hyukjin Kwon <gurwls223@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    HyukjinKwon authored and mccheah committed May 15, 2019
    Configuration menu
    Copy the full SHA
    ee834f7 View commit details
    Browse the repository at this point in the history

Commits on Jun 6, 2019

  1. Configuration menu
    Copy the full SHA
    294eaef View commit details
    Browse the repository at this point in the history
  2. [SPARK-26811][SQL][FOLLOWUP] fix some documentation

    ## What changes were proposed in this pull request?
    
    It's a followup of apache#24012 , to fix 2 documentation:
    1. `SupportsRead` and `SupportsWrite` are not internal anymore. They are public interfaces now.
    2. `Scan` should link the `BATCH_READ` instead of hardcoding it.
    
    ## How was this patch tested?
    N/A
    
    Closes apache#24285 from cloud-fan/doc.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    5e7eb12 View commit details
    Browse the repository at this point in the history
  3. [MINOR][TEST][DOC] Execute action miss name message

    ## What changes were proposed in this pull request?
    
    some minor updates:
    - `Execute` action miss `name` message
    -  typo in SS document
    -  typo in SQLConf
    
    ## How was this patch tested?
    
    N/A
    
    Closes apache#24466 from uncleGen/minor-fix.
    
    Authored-by: uncleGen <hustyugm@gmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    uncleGen authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    57153b4 View commit details
    Browse the repository at this point in the history
  4. [SPARK-27576][SQL] table capability to skip the output column resolution

    Currently we have an analyzer rule, which resolves the output columns of data source v2 writing plans, to make sure the schema of input query is compatible with the table.
    
    However, not all data sources need this check. For example, the `NoopDataSource` doesn't care about the schema of input query at all.
    
    This PR introduces a new table capability: ACCEPT_ANY_SCHEMA. If a table reports this capability, we skip resolving output columns for it during write.
    
    Note that, we already skip resolving output columns for `NoopDataSource` because it implements `SupportsSaveMode`. However, `SupportsSaveMode` is a hack and will be removed soon.
    
    new test cases
    
    Closes apache#24469 from cloud-fan/schema-check.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    cloud-fan authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    0c2d6aa View commit details
    Browse the repository at this point in the history
  5. [SPARK-26356][SQL] remove SaveMode from data source v2

    In data source v1, save mode specified in `DataFrameWriter` is passed to data source implementation directly, and each data source can define its own behavior about save mode. This is confusing and we want to get rid of save mode in data source v2.
    
    For data source v2, we expect data source to implement the `TableCatalog` API, and end-users use SQL(or the new write API described in [this doc](https://docs.google.com/document/d/1gYm5Ji2Mge3QBdOliFV5gSPTKlX4q1DCBXIkiyMv62A/edit?ts=5ace0718#heading=h.e9v1af12g5zo)) to acess data sources. The SQL API has very clear semantic and we don't need save mode at all.
    
    However, for simple data sources that do not have table management (like a JIRA data source, a noop sink, etc.), it's not ideal to ask them to implement the `TableCatalog` API, and throw exception here and there.
    
    `TableProvider` API is created for simple data sources. It can only get tables, without any other table management methods. This means, it can only deal with existing tables.
    
    `TableProvider` fits well with `DataStreamReader` and `DataStreamWriter`, as they can only read/write existing tables. However, `TableProvider` doesn't fit `DataFrameWriter` well, as the save mode requires more than just get table. More specifically, `ErrorIfExists` mode needs to check if table exists, and create table. `Ignore` mode needs to check if table exists. When end-users specify `ErrorIfExists` or `Ignore` mode and write data to `TableProvider` via `DataFrameWriter`, Spark fails the query and asks users to use `Append` or `Overwrite` mode.
    
    The file source is in the middle of `TableProvider` and `TableCatalog`: it's simple but it can check table(path) exists and create table(path). That said, file source supports all the save modes.
    
    Currently file source implements `TableProvider`, and it's not working because `TableProvider` doesn't support `ErrorIfExists` and `Ignore` modes. Ideally we should create a new API for path-based data sources, but to unblock the work of file source v2 migration, this PR proposes to special-case file source v2 in `DataFrameWriter`, to make it work.
    
    This PR also removes `SaveMode` from data source v2, as now only the internal file source v2 needs it.
    
    existing tests
    
    Closes apache#24233 from cloud-fan/file.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    d3e9b94 View commit details
    Browse the repository at this point in the history
  6. Fix compilation issues

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    c0ffa90 View commit details
    Browse the repository at this point in the history
  7. Fix scalastyle

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    c968388 View commit details
    Browse the repository at this point in the history
  8. [SPARK-27521][SQL] Move data source v2 to catalyst module

    Currently we are in a strange status that, some data source v2 interfaces(catalog related) are in sql/catalyst, some data source v2 interfaces(Table, ScanBuilder, DataReader, etc.) are in sql/core.
    
    I don't see a reason to keep data source v2 API in 2 modules. If we should pick one module, I think sql/catalyst is the one to go.
    
    Catalyst module already has some user-facing stuff like DataType, Row, etc. And we have to update `Analyzer` and `SessionCatalog` to support the new catalog plugin, which needs to be in the catalyst module.
    
    This PR can solve the problem we have in apache#24246
    
    existing tests
    
    Closes apache#24416 from cloud-fan/move.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    d7e3943 View commit details
    Browse the repository at this point in the history
  9. Fix merge conflicts

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    e0edb6c View commit details
    Browse the repository at this point in the history
  10. [SPARK-27732][SQL] Add v2 CreateTable implementation.

    ## What changes were proposed in this pull request?
    
    This adds a v2 implementation of create table:
    * `CreateV2Table` is the logical plan, named using v2 to avoid conflicting with the existing plan
    * `CreateTableExec` is the physical plan
    
    ## How was this patch tested?
    
    Added resolution and v2 SQL tests.
    
    Closes apache#24617 from rdblue/SPARK-27732-add-v2-create-table.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    rdblue authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    c7c5d84 View commit details
    Browse the repository at this point in the history
  11. [SPARK-26946][SQL][FOLLOWUP] Require lookup function

    ## What changes were proposed in this pull request?
    
    Require the lookup function with interface LookupCatalog. Rationale is in the review comments below.
    
    Make `Analyzer` abstract. BaseSessionStateBuilder and HiveSessionStateBuilder implements lookupCatalog with a call to SparkSession.catalog().
    
    Existing test cases and those that don't need catalog lookup will use a newly added `TestAnalyzer` with a default lookup function that throws` CatalogNotFoundException("No catalog lookup function")`.
    
    Rewrote the unit test for LookupCatalog to demonstrate the interface can be used anywhere, not just Analyzer.
    
    Removed Analyzer parameter `lookupCatalog` because we can override in the following manner:
    ```
    new Analyzer() {
      override def lookupCatalog(name: String): CatalogPlugin = ???
    }
    ```
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Closes apache#24689 from jzhuge/SPARK-26946-follow.
    
    Authored-by: John Zhuge <jzhuge@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jzhuge authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    6244b77 View commit details
    Browse the repository at this point in the history
  12. [SPARK-27813][SQL] DataSourceV2: Add DropTable logical operation

    ## What changes were proposed in this pull request?
    
    Support DROP TABLE from V2 catalogs.
    Move DROP TABLE into catalyst.
    Move parsing tests for DROP TABLE/VIEW to PlanResolutionSuite to validate existing behavior.
    Add new tests fo catalyst parser suite.
    Separate DROP VIEW into different code path from DROP TABLE.
    Move DROP VIEW into catalyst as a new operator.
    Add a meaningful exception to indicate view is not currently supported in v2 catalog.
    
    ## How was this patch tested?
    
    New unit tests.
    Existing unit tests in catalyst and sql core.
    
    Closes apache#24686 from jzhuge/SPARK-27813-pr.
    
    Authored-by: John Zhuge <jzhuge@apache.org>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    jzhuge authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    0db2aa0 View commit details
    Browse the repository at this point in the history
  13. [SPARK-27103][SQL][MINOR] List SparkSql reserved keywords in alphabet…

    … order
    
    ## What changes were proposed in this pull request?
    
    This PR tries to correct spark-sql reserved keywords' position in list if they are not in alphabetical order.
    In test suite some repeated words are removed. Also some comments are added for remind.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Closes apache#23985 from SongYadong/sql_reserved_alphabet.
    
    Authored-by: SongYadong <song.yadong1@zte.com.cn>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    SongYadong authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    d9e0cca View commit details
    Browse the repository at this point in the history
  14. [SPARK-27857][SQL] Move ALTER TABLE parsing into Catalyst

    This moves parsing logic for `ALTER TABLE` into Catalyst and adds parsed logical plans for alter table changes that use multi-part identifiers. This PR is similar to SPARK-27108, PR apache#24029, that created parsed logical plans for create and CTAS.
    
    * Create parsed logical plans
    * Move parsing logic into Catalyst's AstBuilder
    * Convert to DataSource plans in DataSourceResolution
    * Parse `ALTER TABLE ... SET LOCATION ...` separately from the partition variant
    * Parse `ALTER TABLE ... ALTER COLUMN ... [TYPE dataType] [COMMENT comment]` [as discussed on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Syntax-for-table-DDL-td25197.html#a25270)
    * Parse `ALTER TABLE ... RENAME COLUMN ... TO ...`
    * Parse `ALTER TABLE ... DROP COLUMNS ...`
    
    * Added new tests in Catalyst's `DDLParserSuite`
    * Moved converted plan tests from SQL `DDLParserSuite` to `PlanResolutionSuite`
    * Existing tests for regressions
    
    Closes apache#24723 from rdblue/SPARK-27857-add-alter-table-statements-in-catalyst.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    rdblue authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    e1365ba View commit details
    Browse the repository at this point in the history
  15. Fix merge conflicts

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    1a45142 View commit details
    Browse the repository at this point in the history
  16. Revert "Fix merge conflicts"

    This reverts commit 1a45142.
    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    8bcc74d View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    a3debfd View commit details
    Browse the repository at this point in the history
  18. Revert "[SPARK-27103][SQL][MINOR] List SparkSql reserved keywords in …

    …alphabet order"
    
    This reverts commit d9e0cca.
    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    7c1eb92 View commit details
    Browse the repository at this point in the history
  19. [SPARK-27675][SQL] do not use MutableColumnarRow in ColumnarBatch

    ## What changes were proposed in this pull request?
    
    To move DS v2 API to the catalyst module, we can't refer to an internal class (`MutableColumnarRow`) in `ColumnarBatch`.
    
    This PR creates a read-only version of `MutableColumnarRow`, and use it in `ColumnarBatch`.
    
    close apache#24546
    
    ## How was this patch tested?
    
    existing tests
    
    Closes apache#24581 from cloud-fan/mutable-row.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: HyukjinKwon <gurwls223@apache.org>
    cloud-fan authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    cfa37b0 View commit details
    Browse the repository at this point in the history
  20. [MINOR] Move java file to java directory

    ## What changes were proposed in this pull request?
    
    move
    ```scala
    org.apache.spark.sql.execution.streaming.BaseStreamingSource
    org.apache.spark.sql.execution.streaming.BaseStreamingSink
    ```
    to java directory
    
    ## How was this patch tested?
    
    Existing UT.
    
    Closes apache#24222 from ConeyLiu/move-scala-to-java.
    
    Authored-by: Xianyang Liu <xianyang.liu@intel.com>
    Signed-off-by: Sean Owen <sean.owen@databricks.com>
    ConeyLiu authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    d8f503e View commit details
    Browse the repository at this point in the history
  21. [SPARK-27190][SQL] add table capability for streaming

    This is a followup of apache#24012 , to add the corresponding capabilities for streaming.
    
    existing tests
    
    Closes apache#24129 from cloud-fan/capability.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    b28de53 View commit details
    Browse the repository at this point in the history
  22. Fix merge conflicts

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    a022526 View commit details
    Browse the repository at this point in the history
  23. [SPARK-23014][SS] Fully remove V1 memory sink.

    There is a MemorySink v2 already so v1 can be removed. In this PR I've removed it completely.
    What this PR contains:
    * V1 memory sink removal
    * V2 memory sink renamed to become the only implementation
    * Since DSv2 sends exceptions in a chained format (linking them with cause field) I've made python side compliant
    * Adapted all the tests
    
    Existing unit tests.
    
    Closes apache#24403 from gaborgsomogyi/SPARK-23014.
    
    Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
    gaborgsomogyi authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    4e5087f View commit details
    Browse the repository at this point in the history
  24. Fix merge conflicts

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    287e9d7 View commit details
    Browse the repository at this point in the history
  25. [SPARK-27579][SQL] remove BaseStreamingSource and BaseStreamingSink

    ## What changes were proposed in this pull request?
    
    `BaseStreamingSource` and `BaseStreamingSink` is used to unify v1 and v2 streaming data source API in some code paths.
    
    This PR removes these 2 interfaces, and let the v1 API extend v2 API to keep API compatibility.
    
    The motivation is apache#24416 . We want to move data source v2 to catalyst module, but `BaseStreamingSource` and `BaseStreamingSink` are in sql/core.
    
    ## How was this patch tested?
    
    existing tests
    
    Closes apache#24471 from cloud-fan/streaming.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    cloud-fan authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    d97de74 View commit details
    Browse the repository at this point in the history
  26. [SPARK-27642][SS] make v1 offset extends v2 offset

    ## What changes were proposed in this pull request?
    
    To move DS v2 to the catalyst module, we can't make v2 offset rely on v1 offset, as v1 offset is in sql/core.
    
    ## How was this patch tested?
    
    existing tests
    
    Closes apache#24538 from cloud-fan/offset.
    
    Authored-by: Wenchen Fan <wenchen@databricks.com>
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>
    cloud-fan authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    12746c1 View commit details
    Browse the repository at this point in the history
  27. Fix imports

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    c99b896 View commit details
    Browse the repository at this point in the history
  28. [SPARK-27693][SQL] Add default catalog property

    Add a SQL config property for the default v2 catalog.
    
    Existing tests for regressions.
    
    Closes apache#24594 from rdblue/SPARK-27693-add-default-catalog-config.
    
    Authored-by: Ryan Blue <blue@apache.org>
    Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
    rdblue authored and mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    5d2096e View commit details
    Browse the repository at this point in the history
  29. Fix merge conflicts

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    f7e63d6 View commit details
    Browse the repository at this point in the history
  30. Revert "Fix merge conflicts"

    This reverts commit f7e63d6.
    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    876d1a0 View commit details
    Browse the repository at this point in the history
  31. FIx merge conflicts again

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    b714508 View commit details
    Browse the repository at this point in the history
  32. Fix style

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    c018fba View commit details
    Browse the repository at this point in the history
  33. Fix test build.

    mccheah committed Jun 6, 2019
    Configuration menu
    Copy the full SHA
    2cd8078 View commit details
    Browse the repository at this point in the history

Commits on Jun 7, 2019

  1. [SPARK-27411][SQL] DataSourceV2Strategy should not eliminate subquery

    In DataSourceV2Strategy, it seems we eliminate the subqueries by mistake after normalizing filters.
    We have a sql with a scalar subquery:
    
    ``` scala
    val plan = spark.sql("select * from t2 where t2a > (select max(t1a) from t1)")
    plan.explain(true)
    ```
    
    And we get the log info of DataSourceV2Strategy:
    ```
    Pushing operators to csv:examples/src/main/resources/t2.txt
    Pushed Filters:
    Post-Scan Filters: isnotnull(t2a#30)
    Output: t2a#30, t2b#31
    ```
    
    The `Post-Scan Filters` should contain the scalar subquery, but we eliminate it by mistake.
    ```
    == Parsed Logical Plan ==
    'Project [*]
    +- 'Filter ('t2a > scalar-subquery#56 [])
       :  +- 'Project [unresolvedalias('max('t1a), None)]
       :     +- 'UnresolvedRelation `t1`
       +- 'UnresolvedRelation `t2`
    
    == Analyzed Logical Plan ==
    t2a: string, t2b: string
    Project [t2a#30, t2b#31]
    +- Filter (t2a#30 > scalar-subquery#56 [])
       :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
       :     +- SubqueryAlias `t1`
       :        +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
       +- SubqueryAlias `t2`
          +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
    
    == Optimized Logical Plan ==
    Filter (isnotnull(t2a#30) && (t2a#30 > scalar-subquery#56 []))
    :  +- Aggregate [max(t1a#13) AS max(t1a)#63]
    :     +- Project [t1a#13]
    :        +- RelationV2[t1a#13, t1b#14] csv:examples/src/main/resources/t1.txt
    +- RelationV2[t2a#30, t2b#31] csv:examples/src/main/resources/t2.txt
    
    == Physical Plan ==
    *(1) Project [t2a#30, t2b#31]
    +- *(1) Filter isnotnull(t2a#30)
       +- *(1) BatchScan[t2a#30, t2b#31] class org.apache.spark.sql.execution.datasources.v2.csv.CSVScan
    ```
    
    ut
    
    Closes apache#24321 from francis0407/SPARK-27411.
    
    Authored-by: francis0407 <hanmingcong123@hotmail.com>
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>
    francis0407 authored and mccheah committed Jun 7, 2019
    Configuration menu
    Copy the full SHA
    5346dcf View commit details
    Browse the repository at this point in the history
  2. Fix merge conflict

    mccheah committed Jun 7, 2019
    Configuration menu
    Copy the full SHA
    17bb20c View commit details
    Browse the repository at this point in the history
  3. Fix build

    mccheah committed Jun 7, 2019
    Configuration menu
    Copy the full SHA
    5a8ea0b View commit details
    Browse the repository at this point in the history