Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: merge operation #1522

Merged
merged 24 commits into from
Sep 11, 2023
Merged

feat: merge operation #1522

merged 24 commits into from
Sep 11, 2023

Conversation

Blajda
Copy link
Collaborator

@Blajda Blajda commented Jul 9, 2023

Description

Implement the Merge operation using Datafusion.

Currently the implementation rewrites the entire DeltaTable limiting the files that are rewritten will be performed in future work.

Related Issue(s)

Documentation

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels Jul 9, 2023
@github-actions
Copy link

github-actions bot commented Jul 9, 2023

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@Blajda Blajda changed the title feat: Merge Operation feat: merge operation Jul 9, 2023
@Blajda Blajda marked this pull request as ready for review July 19, 2023 02:34
@ion-elgreco
Copy link
Collaborator

@Blajda @wjones127 does that mean the full table will be loaded in memory first so it can rewrite the whole table?

@Blajda
Copy link
Collaborator Author

Blajda commented Jul 29, 2023

does that mean the full table will be loaded in memory first so it can rewrite the whole table?

Yes since this is using a nested loop join both the source and target need to be fully loaded into memory. There would need to be follow up improvements to perform pruning and to use a Hash Join / Merge Join.

@ion-elgreco
Copy link
Collaborator

does that mean the full table will be loaded in memory first so it can rewrite the whole table?

Yes since this is using a nested loop join both the source and target need to be fully loaded into memory. There would need to be follow up improvements to perform pruning and to use a Hash Join / Merge Join.

Good to know, still useful for datasets that fit in memory then. Hopefully it can be merged soon ;) Then I could try taking a go on the Python bindings and start using it as :)

@wjones127
Copy link
Collaborator

What do you think about putting this operation behind a feature flag?

I don't feature flag is necessary, but we could label it "experimental" maybe. I think as long as the docs give up-to-date notes on limitations this should be fine.

wjones127
wjones127 previously approved these changes Aug 15, 2023
@Blajda
Copy link
Collaborator Author

Blajda commented Aug 15, 2023

Thanks @wjones127 for taking the time to review this. I'll break future enhancements into smaller chunks to make it easier.

@wjones127 wjones127 enabled auto-merge (squash) August 15, 2023 18:52
@roeap
Copy link
Collaborator

roeap commented Aug 17, 2023

Hoping that merging main will resolve some of the failures, which seem to be related to recently added deletion vector fields...

auto-merge was automatically disabled August 17, 2023 23:30

Head branch was pushed to by a user without write access

@Blajda
Copy link
Collaborator Author

Blajda commented Aug 17, 2023

Updated code to add new dv field. hopefully it passes all test now.

@roeap
Copy link
Collaborator

roeap commented Aug 18, 2023

There is one more failure - probably related to a mypy update o.a. - we could just # type: ignore that line for now.

Other then that we are good to go :)

@github-actions github-actions bot added the binding/python Issues for the Python package label Aug 18, 2023
@roeap roeap enabled auto-merge (squash) August 18, 2023 05:40
@Blajda
Copy link
Collaborator Author

Blajda commented Aug 18, 2023

😕 Seems to fail with the ignore and without. check-python is passes locally for me

@roeap
Copy link
Collaborator

roeap commented Aug 18, 2023

yea, this is a bit annoying :( - unfortunately I do not have permissions to disable or bypass required checks ...

Should we try once more without? Local checks are passing for me as well without the comment.

Alternatively we could try # noqa, which is quite common, but not sure if mypy honors it.

@roeap
Copy link
Collaborator

roeap commented Aug 18, 2023

or maybe even both, not sure how make works. i.e. would it fail early if now the ruff check is complaining? otherwise # noqa would silence ruff ...

auto-merge was automatically disabled August 22, 2023 04:08

Head branch was pushed to by a user without write access

@Blajda
Copy link
Collaborator Author

Blajda commented Aug 25, 2023

@roeap @wjones127 Please provide approval on the workflows. The lint change should hopefully allow everything to pass now.

@ion-elgreco
Copy link
Collaborator

@roeap @wjones127 can you approve the workflow, so this can get merged in the next release?

@github-actions github-actions bot removed the binding/python Issues for the Python package label Sep 10, 2023
@Blajda
Copy link
Collaborator Author

Blajda commented Sep 10, 2023

@wjones127 Please approve the workflow. Merged with the latest changes in main.

@wjones127 wjones127 merged commit 6e00c75 into delta-io:main Sep 11, 2023
@Blajda
Copy link
Collaborator Author

Blajda commented Sep 11, 2023

@wjones127 Thanks for reviewing and merging.

@Blajda Blajda deleted the merge-op branch September 11, 2023 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants