Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4394][SQL] Data Sources API Improvements #3260

Closed
wants to merge 4 commits into from

Conversation

marmbrus
Copy link
Contributor

This PR adds two features to the data sources API:

  • Support for pushing down IN filters
  • The ability for relations to optionally provide information about their sizeInBytes.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23343 has started for PR 3260 at commit 9a5e171.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 14, 2014

Test build #23343 has finished for PR 3260 at commit 9a5e171.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InSet(value: Expression, hset: Set[Any])
    • case class In(attribute: String, values: Array[Any]) extends Filter

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23343/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Nov 14, 2014

LGTM.

@rxin
Copy link
Contributor

rxin commented Nov 14, 2014

Merging in master & branch-1.2. Thanks.

@asfgit asfgit closed this in 77e845c Nov 14, 2014
asfgit pushed a commit that referenced this pull request Nov 14, 2014
This PR adds two features to the data sources API:
 - Support for pushing down `IN` filters
 - The ability for relations to optionally provide information about their `sizeInBytes`.

Author: Michael Armbrust <michael@databricks.com>

Closes #3260 from marmbrus/sourcesImprovements and squashes the following commits:

9a5e171 [Michael Armbrust] Use method instead of configuration directly
99c0e6b [Michael Armbrust] Add support for sizeInBytes.
416f167 [Michael Armbrust] Support for IN in data sources API.
2a04ab3 [Michael Armbrust] Simplify implementation of InSet.

(cherry picked from commit 77e845c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@marmbrus marmbrus deleted the sourcesImprovements branch November 19, 2014 02:47
yaooqinn pushed a commit that referenced this pull request Aug 26, 2024
…42.7.4 and `mssql` to 12.8.1.jre11

### What changes were proposed in this pull request?

This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11.

### Why are the changes needed?

1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html):

    - [Issue #3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause
    - [Issue #4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230)
    - [Issue #3982](h2database/h2database#3982): Potential issue when using ROUND
    - [Issue #3894](h2database/h2database#3894): Race condition causing stale data in query last result cache
    - [Issue #4075](h2database/h2database#4075): infinite loop in compact
    - [Issue #4091](h2database/h2database#4091): Wrong case with linked table to postgresql
    - [Issue #4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs

2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/):

    - fix: PgInterval ignores case for represented interval string [PR #3344](pgjdbc/pgjdbc#3344)
    - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR #3295](pgjdbc/pgjdbc#3295)
    - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR #3304](pgjdbc/pgjdbc#3304)
    - fix: Ensure order of results for getDouble [PR #3301](pgjdbc/pgjdbc#3301)
    - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR #3248](pgjdbc/pgjdbc#3248)
    - fix: Fix SSL tests [PR #3260](pgjdbc/pgjdbc#3260)
    - fix: Support bytea in preferQueryMode=simple [PR #3243](pgjdbc/pgjdbc#3243)
    - fix: Fix [Issue #3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR #3235](pgjdbc/pgjdbc#3235)
    - fix: Fix [Issue #3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR #3225](pgjdbc/pgjdbc#3225)

3. For `mssql`,  there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1):

    - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR #2492](microsoft/mssql-jdbc#2492)
    - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR #2493](microsoft/mssql-jdbc#2493)
    - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR #2494](microsoft/mssql-jdbc#2494)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47810 from wayneguow/ug_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…42.7.4 and `mssql` to 12.8.1.jre11

### What changes were proposed in this pull request?

This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11.

### Why are the changes needed?

1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html):

    - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause
    - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230)
    - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND
    - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache
    - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact
    - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql
    - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs

2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/):

    - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344)
    - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295)
    - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304)
    - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301)
    - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248)
    - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260)
    - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243)
    - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235)
    - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225)

3. For `mssql`,  there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1):

    - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492)
    - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493)
    - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47810 from wayneguow/ug_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…42.7.4 and `mssql` to 12.8.1.jre11

### What changes were proposed in this pull request?

This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11.

### Why are the changes needed?

1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html):

    - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause
    - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230)
    - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND
    - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache
    - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact
    - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql
    - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs

2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/):

    - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344)
    - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295)
    - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304)
    - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301)
    - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248)
    - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260)
    - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243)
    - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235)
    - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225)

3. For `mssql`,  there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1):

    - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492)
    - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493)
    - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47810 from wayneguow/ug_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
…42.7.4 and `mssql` to 12.8.1.jre11

### What changes were proposed in this pull request?

This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11.

### Why are the changes needed?

1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html):

    - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause
    - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230)
    - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND
    - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache
    - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact
    - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql
    - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs

2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/):

    - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344)
    - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295)
    - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304)
    - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301)
    - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248)
    - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260)
    - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243)
    - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235)
    - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225)

3. For `mssql`,  there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1):

    - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492)
    - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493)
    - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass GA.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47810 from wayneguow/ug_h2.

Authored-by: Wei Guo <guow93@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants