Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python connector CDF reads kernel integration #613

Merged

Conversation

PatrickJin-db
Copy link
Collaborator

@PatrickJin-db PatrickJin-db commented Dec 11, 2024

Main differences compared to the prototype #609 include:

  • Not using a prototype branch of delta-kernel-rs
  • Better code organization
  • Ensuring that files are closed and tempfiles are cleaned up even upon failure
  • Fixed some of the tests

Regarding performance, most of the time is spent in kernel, and writing the delta log is negligible in comparison.
image

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/cdf_python_changes branch 7 times, most recently from eda025a to 75835d3 Compare December 11, 2024 18:01
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most pressing issue is updating to the main branch.

python/delta-kernel-rust-sharing-wrapper/Cargo.toml Outdated Show resolved Hide resolved
@@ -4,17 +4,19 @@ use arrow::compute::filter_record_batch;
use arrow::datatypes::SchemaRef;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Both arrow and kernel have their own SchemaRef. This is up to personal preference, but you may want to give this an alias so that the type of SchemaRef doesn't become confusing.

Suggested change
use arrow::datatypes::SchemaRef;
use arrow::datatypes::SchemaRef as ArrowSchemaRef;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also want to do the same for TableChangesScan:

use delta_kernel::table_changes::scan::TableChangesScan as KernelTableChangesScan

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ArrowSchemaRef makes sense, but TableChangesScan is only used once so I don't think using an alias for it is as beneficial.

).table_changes_to_pandas(CdfOptions(
starting_version=starting_version,
ending_version=ending_version,
starting_timestamp=starting_timestamp,
ending_timestamp=ending_timestamp,
include_historical_metadata=use_delta_format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm? why include_historical_metadata=use_delta_format?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why, let's add a comment to this line about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, PTAL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

# os.utime accepts seconds while delta log timestamp is in ms
os.utime(log_file_path, times=(0, version_to_timestamp[version] // 1000))

if min_version > 0 and num_versions_with_action > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case will num_versions_with_action be 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One case could be if only metadata changes occurred during the queried version range.

protocol_json = json.loads(next(lines))
metadata_json = json.loads(next(lines))
actions: List[FileAction] = []
for line in lines:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and let's try some logs for the ease of debugging in the future when issues happen.
Such as printing out # of lines received from the server.

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/cdf_python_changes branch 8 times, most recently from 6cb527d to e72d39f Compare December 12, 2024 07:31
Copy link
Collaborator

@linzhou-db linzhou-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python code changes look good!
Please get approval for the rust code before merging!
And could you also check the failure in the wheel build jobs and see if it's possible to fix them?

).table_changes_to_pandas(CdfOptions(
starting_version=starting_version,
ending_version=ending_version,
starting_timestamp=starting_timestamp,
ending_timestamp=ending_timestamp,
include_historical_metadata=use_delta_format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/cdf_python_changes branch 8 times, most recently from 22b77cc to 4032dc6 Compare December 18, 2024 01:46
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/cdf_python_changes branch from 4032dc6 to 4a9f4db Compare December 18, 2024 02:42
@PatrickJin-db PatrickJin-db merged commit 73dea97 into delta-io:main Dec 18, 2024
5 checks passed
@PatrickJin-db PatrickJin-db mentioned this pull request Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants