-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python connector CDF reads kernel integration #613
Python connector CDF reads kernel integration #613
Conversation
eda025a
to
75835d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most pressing issue is updating to the main branch.
@@ -4,17 +4,19 @@ use arrow::compute::filter_record_batch; | |||
use arrow::datatypes::SchemaRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Both arrow and kernel have their own SchemaRef
. This is up to personal preference, but you may want to give this an alias so that the type of SchemaRef
doesn't become confusing.
use arrow::datatypes::SchemaRef; | |
use arrow::datatypes::SchemaRef as ArrowSchemaRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also want to do the same for TableChangesScan
:
use delta_kernel::table_changes::scan::TableChangesScan as KernelTableChangesScan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ArrowSchemaRef
makes sense, but TableChangesScan
is only used once so I don't think using an alias for it is as beneficial.
).table_changes_to_pandas(CdfOptions( | ||
starting_version=starting_version, | ||
ending_version=ending_version, | ||
starting_timestamp=starting_timestamp, | ||
ending_timestamp=ending_timestamp, | ||
include_historical_metadata=use_delta_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm? why include_historical_metadata=use_delta_format
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why, let's add a comment to this line about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
# os.utime accepts seconds while delta log timestamp is in ms | ||
os.utime(log_file_path, times=(0, version_to_timestamp[version] // 1000)) | ||
|
||
if min_version > 0 and num_versions_with_action > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case will num_versions_with_action be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One case could be if only metadata changes occurred during the queried version range.
protocol_json = json.loads(next(lines)) | ||
metadata_json = json.loads(next(lines)) | ||
actions: List[FileAction] = [] | ||
for line in lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and let's try some logs for the ease of debugging in the future when issues happen.
Such as printing out # of lines received from the server.
6cb527d
to
e72d39f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python code changes look good!
Please get approval for the rust code before merging!
And could you also check the failure in the wheel build jobs and see if it's possible to fix them?
).table_changes_to_pandas(CdfOptions( | ||
starting_version=starting_version, | ||
ending_version=ending_version, | ||
starting_timestamp=starting_timestamp, | ||
ending_timestamp=ending_timestamp, | ||
include_historical_metadata=use_delta_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
22b77cc
to
4032dc6
Compare
4032dc6
to
4a9f4db
Compare
Main differences compared to the prototype #609 include:
Regarding performance, most of the time is spent in kernel, and writing the delta log is negligible in comparison.