Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table Scan: Add Row Selection Filtering #565

Merged
merged 5 commits into from
Sep 24, 2024

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Aug 19, 2024

Row Selection Skipping

This PR adds the PageIndexEvaluator, which applies an iceberg filter predicate to a parquet Page Index, allowing the fetching into memory and decoding of entire slices of rows within a row group to be skipped if the PageIndexEvaluator determines from the Page Index that they don't match the filter. The PageIndexEvaluator builds a RowSelection that it passes to ParquetRecordBatchStreamBuilder::with_row_selection.

Unlike the row group metadata, the Page Index is not parsed by default when reading in the parquet file metadata. When present, especially for files with many columns, it can be quite large and intensive to parse. In fact, during experiments on the performance testing suite on my other PR, it turns out that often the additional time required to parse the Page Index outweighs the time saved fetching, decoding, and filtering all of the rows within a row group. For this reason, I've kept Row Selection Skipping disabled by default. It can be enabled by calling with_row_selection_enabled(true) when building the scan (There's a corresponding method on the ArrowReaderBuilder if you are using that directly). If you're interested in this feature, you may want to experiment with the write.parquet.page-size-bytes and write.parquet.page-row-limit table settings (which default to 1Mb and 20,000 rows respectively) in order to get optimal performance, as well as the sort settings for your table.

More info can be found in the preceding PR to this one, #558

@sdd sdd changed the title Row group and page filtering part 2 Table Scan: Add Row Selection Filtering Aug 19, 2024
@sdd sdd force-pushed the row-group-and-page-filtering-part-2 branch from 18dd9f9 to 7d61858 Compare August 29, 2024 07:24
@sdd sdd force-pushed the row-group-and-page-filtering-part-2 branch from 9e12056 to c2e118b Compare August 29, 2024 18:43
@sdd sdd marked this pull request as ready for review August 29, 2024 18:43
@sdd
Copy link
Contributor Author

sdd commented Sep 3, 2024

@liurenjie1024 / @Xuanwo - PTAL when you get chance, thanks 👍🏼

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this great pr! Genarally LGTM, and I've left some minor points to improve, but overall it looks good!

crates/iceberg/src/expr/visitors/page_index_evaluator.rs Outdated Show resolved Hide resolved
crates/iceberg/src/expr/visitors/page_index_evaluator.rs Outdated Show resolved Hide resolved
));
};

let row_counts = self.calc_row_counts(offset_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be computed only once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot. I think the optimal approach here would not be to calc row counts for all columns up-front because only the ones for columns that appear in the predicate would be needed. Caching the result would make more sense. If it is ok with you I'll add a TODO for this rather than address in this PR, and we can address in a follow-up, as it is not broken and caching would not save us a huge amount of time here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.


if datum.is_nan() {
// NaN indicates unreliable bounds.
return Ok(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but java's implementation follows jdk's default behavior: https://docs.oracle.com/javase/8/docs/api/java/lang/Double.html#compareTo-java.lang.Double-

I think in future we could unify the logic of datum comparison, but it doesn't have to be a blocker here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's defer then considering we want this in a 0.4.0 release.

@sdd
Copy link
Contributor Author

sdd commented Sep 6, 2024

Back to you @liurenjie1024, thanks for the thorough review - some good spots! 😄

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sdd for this great pr!

));
};

let row_counts = self.calc_row_counts(offset_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

@liurenjie1024 liurenjie1024 merged commit 420b4e2 into apache:main Sep 24, 2024
17 checks passed
c-thiel added a commit to hansetag/iceberg-rust that referenced this pull request Oct 2, 2024
commit cda4a0c
Author: congyi wang <58715567+wcy-fdu@users.noreply.github.com>
Date:   Mon Sep 30 11:07:36 2024 +0800

    chore: fix typo in FileIO Schemes  (apache#653)

    * fix typo

    * fix typo

commit af9609d
Author: Scott Donnelly <scott@donnel.ly>
Date:   Mon Sep 30 04:06:14 2024 +0100

    fix: page index evaluator min/max args inverted (apache#648)

    * fix: page index evaluator min/max args inverted

    * style: fix clippy lint in test

commit a6a3fd7
Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com>
Date:   Sat Sep 28 10:10:08 2024 +0300

    test (datafusion): add test for table provider creation (apache#651)

    * add test for table provider creation

    * fix formatting

    * fixing yet another formatting issue

    * testing schema using data fusion

    ---------

    Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com>

commit 87483b4
Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com>
Date:   Fri Sep 27 04:40:08 2024 +0300

    making table provider pub (apache#650)

    Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com>

commit 984c91e
Author: ZENOTME <43447882+ZENOTME@users.noreply.github.com>
Date:   Thu Sep 26 17:56:02 2024 +0800

    avoid to create memory schema operator every time (apache#635)

    Co-authored-by: ZENOTME <st810918843@gmail.com>

commit 4171275
Author: Matheus Alcantara <mths.dev@pm.me>
Date:   Wed Sep 25 08:28:42 2024 -0300

    scan: change ErrorKind when table dont have spanshots (apache#608)

commit ab51355
Author: xxchan <xxchan22f@gmail.com>
Date:   Tue Sep 24 21:25:45 2024 +0800

    fix: compile error due to merge stale PR (apache#646)

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit 420b4e2
Author: Scott Donnelly <scott@donnel.ly>
Date:   Tue Sep 24 08:20:23 2024 +0100

    Table Scan: Add Row Selection Filtering (apache#565)

    * feat(scan): add row selection capability via PageIndexEvaluator

    * test(row-selection): add first few row selection tests

    * feat(scan): add more tests, fix bug where min/max args swapped

    * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler

    * feat: changes suggested from PR review

commit b3709ba
Author: Christian <christian@hansetag.com>
Date:   Tue Sep 24 04:47:04 2024 +0200

    feat: Add NamespaceIdent.parent() (apache#641)

    * Add NamespaceIdent.parent()

    * Use split_last

commit 1533c43
Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com>
Date:   Mon Sep 23 13:39:46 2024 +0300

    feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate (apache#588)

    * adding main function and tests

    * adding tests, removing integration test for now

    * fixing typos and lints

    * fixing typing issue

    * - added support in schmema to convert Date32 to correct arrow type
    - refactored scan to use new predicate converter as visitor and seperated it to a new mod
    - added support for simple predicates with column cast expressions
    - added testing, mostly around date functions

    * fixing format and lic

    * reducing number of tests (17 -> 7)

    * fix formats

    * fix naming

    * refactoring to use TreeNodeVisitor

    * fixing fmt

    * small refactor

    * adding swapped op and fixing CR comments

    ---------

    Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com>

commit e967deb
Author: xxchan <xxchan22f@gmail.com>
Date:   Mon Sep 23 18:34:59 2024 +0800

    feat: expose remove_all in FileIO (apache#643)

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit d03c4f8
Author: Scott Donnelly <scott@donnel.ly>
Date:   Mon Sep 23 08:28:52 2024 +0100

    Migrate to arrow-* v53 (apache#626)

    * chore: migrate to arrow-* v53

    * chore: update datafusion to 42

    * test: fix incorrect test assertion

    * chore: update python bindings to arrow 53

commit 88e5e4a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Sep 23 15:26:18 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.24.5 to 1.24.6 (apache#640)

    Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
    - [Release notes](https://github.com/crate-ci/typos/releases)
    - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
    - [Commits](crate-ci/typos@v1.24.5...v1.24.6)

    ---
    updated-dependencies:
    - dependency-name: crate-ci/typos
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit c354983
Author: xxchan <xxchan22f@gmail.com>
Date:   Mon Sep 23 14:50:18 2024 +0800

    doc: improve FileIO doc (apache#642)

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit 12e12e2
Author: xxchan <xxchan22f@gmail.com>
Date:   Fri Sep 20 19:59:55 2024 +0800

    feat: expose arrow type <-> iceberg type (apache#637)

    * feat: expose arrow type <-> iceberg type

    Previously we only exposed the schema conversion.

    Signed-off-by: xxchan <xxchan22f@gmail.com>

    * add tests

    Signed-off-by: xxchan <xxchan22f@gmail.com>

    ---------

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit 3b27c9e
Author: xxchan <xxchan22f@gmail.com>
Date:   Fri Sep 20 18:32:31 2024 +0800

    feat: add Sync to TransformFunction (apache#638)

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit 34cb81c
Author: Xuanwo <github@xuanwo.io>
Date:   Wed Sep 18 20:18:40 2024 +0800

    chore: Bump opendal to 0.50 (apache#634)

commit cde35ab
Author: FANNG <xiaojing@datastrato.com>
Date:   Fri Sep 13 10:01:16 2024 +0800

    feat: support projection pushdown for datafusion iceberg (apache#594)

    * support projection pushdown for datafusion iceberg

    * support projection pushdown for datafusion iceberg

    * fix ci

    * fix field id

    * remove depencences

    * remove depencences

commit eae9464
Author: Xuanwo <github@xuanwo.io>
Date:   Thu Sep 12 02:06:31 2024 +0800

    refactor(python): Expose transform as a submodule for pyiceberg_core (apache#628)

commit 8a3de4e
Author: Christian <christian@hansetag.com>
Date:   Mon Sep 9 14:45:16 2024 +0200

    Feat: Normalize TableMetadata (apache#611)

    * Normalize Table Metadata

    * Improve readability & comments

commit e08c0e5
Author: Renjie Liu <liurenjie2008@gmail.com>
Date:   Mon Sep 9 11:57:22 2024 +0800

    fix: Correctly calculate highest_field_id in schema (apache#590)

commit f78c59b
Author: Jack <56563911+jdockerty@users.noreply.github.com>
Date:   Mon Sep 9 03:35:16 2024 +0100

    feat: add `client.region` (apache#623)

commit a5aba9a
Author: Christian <christian@hansetag.com>
Date:   Sun Sep 8 18:36:05 2024 +0200

    feat: SortOrder methods should take schema ref if possible (apache#613)

    * SortOrder methods should take schema ref if possible

    * Fix test type

    * with_order_id should not take reference

commit 5812399
Author: Christian <christian@hansetag.com>
Date:   Sun Sep 8 18:18:41 2024 +0200

    feat: partition compatibility (apache#612)

    * Partition compatability

    * Partition compatability

    * Rename compatible_with -> is_compatible_with

commit ede4720
Author: Christian <christian@hansetag.com>
Date:   Sun Sep 8 16:49:39 2024 +0200

    fix: Less Panics for Snapshot timestamps (apache#614)

commit ced661f
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Sep 8 22:43:38 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 (apache#616)

    Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5.
    - [Release notes](https://github.com/crate-ci/typos/releases)
    - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
    - [Commits](crate-ci/typos@v1.24.3...v1.24.5)

    ---
    updated-dependencies:
    - dependency-name: crate-ci/typos
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit cbbd086
Author: Xuanwo <github@xuanwo.io>
Date:   Sun Sep 8 10:29:31 2024 +0800

    feat: Add more fields in FileScanTask (apache#609)

    Signed-off-by: Xuanwo <github@xuanwo.io>

commit 620d58e
Author: Callum Ryan <19956159+callum-ryan@users.noreply.github.com>
Date:   Thu Sep 5 03:44:55 2024 +0100

    feat: SQL Catalog - namespaces (apache#534)

    * feat: SQL Catalog - namespaces

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * feat: use transaction for updates and creates

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: pull out query param builder to fn

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * feat: add drop and tests

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: String to str, remove pub and optimise query builder

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: nested match, remove ok()

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: remove pub, add set, add comments

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: refactor list_namespaces slightly

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: add default properties to all new namespaces

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: remove check for nested namespace

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * chore: add more comments to the CatalogConfig to explain bind styles

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: edit test for nested namespaces

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    ---------

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

commit ae75f96
Author: Søren Dalby Larsen <sdlarsen@gmail.com>
Date:   Tue Sep 3 13:46:48 2024 +0200

    chore: bump crate-ci/typos to 1.24.3 (apache#598)

commit 7aa8bdd
Author: Scott Donnelly <scott@donnel.ly>
Date:   Thu Aug 29 04:37:48 2024 +0100

    Table Scan: Add Row Group Skipping (apache#558)

    * feat(scan): add row group and page index row selection filtering

    * fix(row selection): off-by-one error

    * feat: remove row selection to defer to a second PR

    * feat: better min/max val conversion in RowGroupMetricsEvaluator

    * test(row_group_filtering): first three tests

    * test(row_group_filtering): next few tests

    * test: add more tests for RowGroupMetricsEvaluator

    * chore: refactor test assertions to silence clippy lints

    * refactor: consolidate parquet stat min/max parsing in one place

commit da08e8d
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Aug 28 14:35:55 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.23.6 to 1.24.1 (apache#583)

commit ecbb4c3
Author: Sung Yun <107272191+sungwy@users.noreply.github.com>
Date:   Mon Aug 26 23:57:01 2024 -0400

    Expose Transforms to Python Binding (apache#556)

    * bucket transform rust binding

    * format

    * poetry x maturin

    * ignore poetry.lock in license check

    * update bindings_python_ci to use makefile

    * newline

    * python-poetry/poetry#9135

    * use hatch instead of poetry

    * refactor

    * revert licenserc change

    * adopt review feedback

    * comments

    * unused dependency

    * adopt review comment

    * newline

    * I like this approach a lot better

    * more tests

commit 905ebd2
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Aug 26 20:49:07 2024 +0800

    chore(deps): Update typed-builder requirement from 0.19 to 0.20 (apache#582)

    ---
    updated-dependencies:
    - dependency-name: typed-builder
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit f9c92b7
Author: FANNG <xiaojing@datastrato.com>
Date:   Sun Aug 25 22:31:36 2024 +0800

    fix: Update sqlx from 0.8.0 to 0.8.1 (apache#584)

commit ba66665
Author: FANNG <xiaojing@datastrato.com>
Date:   Sat Aug 24 12:35:36 2024 +0800

    fix: correct partition-id to field-id in UnboundPartitionField (apache#576)

    * correct partition-id to field id in PartitionSpec

    * correct partition-id to field id in PartitionSpec

    * correct partition-id to field id in PartitionSpec

    * xx
c-thiel added a commit to hansetag/iceberg-rust that referenced this pull request Oct 2, 2024
commit 2c9d7e4
Author: Christian Thiel <christian@hansetag.com>
Date:   Wed Oct 2 07:51:50 2024 +0200

    New builder

commit fea1817
Author: Christian Thiel <christian@hansetag.com>
Date:   Tue Oct 1 12:56:10 2024 +0200

    Merge branch 'feat/safe-partition-spec'

commit 1f49c95
Author: Christian Thiel <christian@hansetag.com>
Date:   Tue Oct 1 12:55:02 2024 +0200

    Merge branch 'feat/schema-reassign-field-ids'

commit cda4a0c
Author: congyi wang <58715567+wcy-fdu@users.noreply.github.com>
Date:   Mon Sep 30 11:07:36 2024 +0800

    chore: fix typo in FileIO Schemes  (apache#653)

    * fix typo

    * fix typo

commit af9609d
Author: Scott Donnelly <scott@donnel.ly>
Date:   Mon Sep 30 04:06:14 2024 +0100

    fix: page index evaluator min/max args inverted (apache#648)

    * fix: page index evaluator min/max args inverted

    * style: fix clippy lint in test

commit a6a3fd7
Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com>
Date:   Sat Sep 28 10:10:08 2024 +0300

    test (datafusion): add test for table provider creation (apache#651)

    * add test for table provider creation

    * fix formatting

    * fixing yet another formatting issue

    * testing schema using data fusion

    ---------

    Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com>

commit 87483b4
Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com>
Date:   Fri Sep 27 04:40:08 2024 +0300

    making table provider pub (apache#650)

    Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com>

commit 984c91e
Author: ZENOTME <43447882+ZENOTME@users.noreply.github.com>
Date:   Thu Sep 26 17:56:02 2024 +0800

    avoid to create memory schema operator every time (apache#635)

    Co-authored-by: ZENOTME <st810918843@gmail.com>

commit 4171275
Author: Matheus Alcantara <mths.dev@pm.me>
Date:   Wed Sep 25 08:28:42 2024 -0300

    scan: change ErrorKind when table dont have spanshots (apache#608)

commit ab51355
Author: xxchan <xxchan22f@gmail.com>
Date:   Tue Sep 24 21:25:45 2024 +0800

    fix: compile error due to merge stale PR (apache#646)

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit 420b4e2
Author: Scott Donnelly <scott@donnel.ly>
Date:   Tue Sep 24 08:20:23 2024 +0100

    Table Scan: Add Row Selection Filtering (apache#565)

    * feat(scan): add row selection capability via PageIndexEvaluator

    * test(row-selection): add first few row selection tests

    * feat(scan): add more tests, fix bug where min/max args swapped

    * fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler

    * feat: changes suggested from PR review

commit b3709ba
Author: Christian <christian@hansetag.com>
Date:   Tue Sep 24 04:47:04 2024 +0200

    feat: Add NamespaceIdent.parent() (apache#641)

    * Add NamespaceIdent.parent()

    * Use split_last

commit 1533c43
Author: Alon Agmon <54080741+a-agmon@users.noreply.github.com>
Date:   Mon Sep 23 13:39:46 2024 +0300

    feat (datafusion integration): convert datafusion expr filters to Iceberg Predicate (apache#588)

    * adding main function and tests

    * adding tests, removing integration test for now

    * fixing typos and lints

    * fixing typing issue

    * - added support in schmema to convert Date32 to correct arrow type
    - refactored scan to use new predicate converter as visitor and seperated it to a new mod
    - added support for simple predicates with column cast expressions
    - added testing, mostly around date functions

    * fixing format and lic

    * reducing number of tests (17 -> 7)

    * fix formats

    * fix naming

    * refactoring to use TreeNodeVisitor

    * fixing fmt

    * small refactor

    * adding swapped op and fixing CR comments

    ---------

    Co-authored-by: Alon Agmon <alon.agmon@appsflyer.com>

commit e967deb
Author: xxchan <xxchan22f@gmail.com>
Date:   Mon Sep 23 18:34:59 2024 +0800

    feat: expose remove_all in FileIO (apache#643)

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit d03c4f8
Author: Scott Donnelly <scott@donnel.ly>
Date:   Mon Sep 23 08:28:52 2024 +0100

    Migrate to arrow-* v53 (apache#626)

    * chore: migrate to arrow-* v53

    * chore: update datafusion to 42

    * test: fix incorrect test assertion

    * chore: update python bindings to arrow 53

commit 88e5e4a
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Sep 23 15:26:18 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.24.5 to 1.24.6 (apache#640)

    Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.5 to 1.24.6.
    - [Release notes](https://github.com/crate-ci/typos/releases)
    - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
    - [Commits](crate-ci/typos@v1.24.5...v1.24.6)

    ---
    updated-dependencies:
    - dependency-name: crate-ci/typos
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit c354983
Author: xxchan <xxchan22f@gmail.com>
Date:   Mon Sep 23 14:50:18 2024 +0800

    doc: improve FileIO doc (apache#642)

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit 12e12e2
Author: xxchan <xxchan22f@gmail.com>
Date:   Fri Sep 20 19:59:55 2024 +0800

    feat: expose arrow type <-> iceberg type (apache#637)

    * feat: expose arrow type <-> iceberg type

    Previously we only exposed the schema conversion.

    Signed-off-by: xxchan <xxchan22f@gmail.com>

    * add tests

    Signed-off-by: xxchan <xxchan22f@gmail.com>

    ---------

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit 3b27c9e
Author: xxchan <xxchan22f@gmail.com>
Date:   Fri Sep 20 18:32:31 2024 +0800

    feat: add Sync to TransformFunction (apache#638)

    Signed-off-by: xxchan <xxchan22f@gmail.com>

commit 34cb81c
Author: Xuanwo <github@xuanwo.io>
Date:   Wed Sep 18 20:18:40 2024 +0800

    chore: Bump opendal to 0.50 (apache#634)

commit cde35ab
Author: FANNG <xiaojing@datastrato.com>
Date:   Fri Sep 13 10:01:16 2024 +0800

    feat: support projection pushdown for datafusion iceberg (apache#594)

    * support projection pushdown for datafusion iceberg

    * support projection pushdown for datafusion iceberg

    * fix ci

    * fix field id

    * remove depencences

    * remove depencences

commit eae9464
Author: Xuanwo <github@xuanwo.io>
Date:   Thu Sep 12 02:06:31 2024 +0800

    refactor(python): Expose transform as a submodule for pyiceberg_core (apache#628)

commit 8a3de4e
Author: Christian <christian@hansetag.com>
Date:   Mon Sep 9 14:45:16 2024 +0200

    Feat: Normalize TableMetadata (apache#611)

    * Normalize Table Metadata

    * Improve readability & comments

commit e08c0e5
Author: Renjie Liu <liurenjie2008@gmail.com>
Date:   Mon Sep 9 11:57:22 2024 +0800

    fix: Correctly calculate highest_field_id in schema (apache#590)

commit f78c59b
Author: Jack <56563911+jdockerty@users.noreply.github.com>
Date:   Mon Sep 9 03:35:16 2024 +0100

    feat: add `client.region` (apache#623)

commit a5aba9a
Author: Christian <christian@hansetag.com>
Date:   Sun Sep 8 18:36:05 2024 +0200

    feat: SortOrder methods should take schema ref if possible (apache#613)

    * SortOrder methods should take schema ref if possible

    * Fix test type

    * with_order_id should not take reference

commit 5812399
Author: Christian <christian@hansetag.com>
Date:   Sun Sep 8 18:18:41 2024 +0200

    feat: partition compatibility (apache#612)

    * Partition compatability

    * Partition compatability

    * Rename compatible_with -> is_compatible_with

commit ede4720
Author: Christian <christian@hansetag.com>
Date:   Sun Sep 8 16:49:39 2024 +0200

    fix: Less Panics for Snapshot timestamps (apache#614)

commit ced661f
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Sun Sep 8 22:43:38 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.24.3 to 1.24.5 (apache#616)

    Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.3 to 1.24.5.
    - [Release notes](https://github.com/crate-ci/typos/releases)
    - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md)
    - [Commits](crate-ci/typos@v1.24.3...v1.24.5)

    ---
    updated-dependencies:
    - dependency-name: crate-ci/typos
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit cbbd086
Author: Xuanwo <github@xuanwo.io>
Date:   Sun Sep 8 10:29:31 2024 +0800

    feat: Add more fields in FileScanTask (apache#609)

    Signed-off-by: Xuanwo <github@xuanwo.io>

commit 620d58e
Author: Callum Ryan <19956159+callum-ryan@users.noreply.github.com>
Date:   Thu Sep 5 03:44:55 2024 +0100

    feat: SQL Catalog - namespaces (apache#534)

    * feat: SQL Catalog - namespaces

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * feat: use transaction for updates and creates

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: pull out query param builder to fn

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * feat: add drop and tests

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: String to str, remove pub and optimise query builder

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: nested match, remove ok()

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: remove pub, add set, add comments

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: refactor list_namespaces slightly

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: add default properties to all new namespaces

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: remove check for nested namespace

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * chore: add more comments to the CatalogConfig to explain bind styles

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    * fix: edit test for nested namespaces

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

    ---------

    Signed-off-by: callum-ryan <19956159+callum-ryan@users.noreply.github.com>

commit ae75f96
Author: Søren Dalby Larsen <sdlarsen@gmail.com>
Date:   Tue Sep 3 13:46:48 2024 +0200

    chore: bump crate-ci/typos to 1.24.3 (apache#598)

commit 7aa8bdd
Author: Scott Donnelly <scott@donnel.ly>
Date:   Thu Aug 29 04:37:48 2024 +0100

    Table Scan: Add Row Group Skipping (apache#558)

    * feat(scan): add row group and page index row selection filtering

    * fix(row selection): off-by-one error

    * feat: remove row selection to defer to a second PR

    * feat: better min/max val conversion in RowGroupMetricsEvaluator

    * test(row_group_filtering): first three tests

    * test(row_group_filtering): next few tests

    * test: add more tests for RowGroupMetricsEvaluator

    * chore: refactor test assertions to silence clippy lints

    * refactor: consolidate parquet stat min/max parsing in one place

commit da08e8d
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Aug 28 14:35:55 2024 +0800

    chore(deps): Bump crate-ci/typos from 1.23.6 to 1.24.1 (apache#583)

commit ecbb4c3
Author: Sung Yun <107272191+sungwy@users.noreply.github.com>
Date:   Mon Aug 26 23:57:01 2024 -0400

    Expose Transforms to Python Binding (apache#556)

    * bucket transform rust binding

    * format

    * poetry x maturin

    * ignore poetry.lock in license check

    * update bindings_python_ci to use makefile

    * newline

    * python-poetry/poetry#9135

    * use hatch instead of poetry

    * refactor

    * revert licenserc change

    * adopt review feedback

    * comments

    * unused dependency

    * adopt review comment

    * newline

    * I like this approach a lot better

    * more tests

commit 905ebd2
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Aug 26 20:49:07 2024 +0800

    chore(deps): Update typed-builder requirement from 0.19 to 0.20 (apache#582)

    ---
    updated-dependencies:
    - dependency-name: typed-builder
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit f9c92b7
Author: FANNG <xiaojing@datastrato.com>
Date:   Sun Aug 25 22:31:36 2024 +0800

    fix: Update sqlx from 0.8.0 to 0.8.1 (apache#584)

commit ba66665
Author: FANNG <xiaojing@datastrato.com>
Date:   Sat Aug 24 12:35:36 2024 +0800

    fix: correct partition-id to field-id in UnboundPartitionField (apache#576)

    * correct partition-id to field id in PartitionSpec

    * correct partition-id to field id in PartitionSpec

    * correct partition-id to field id in PartitionSpec

    * xx
shaeqahmed pushed a commit to matanolabs/iceberg-rust that referenced this pull request Dec 9, 2024
* feat(scan): add row selection capability via PageIndexEvaluator

* test(row-selection): add first few row selection tests

* feat(scan): add more tests, fix bug where min/max args swapped

* fix: ad test and fix for logic bug in PageIndexEvaluator in-clause handler

* feat: changes suggested from PR review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants