-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python PyArrow Dataset Writer #542
Comments
API DraftStandalone function for writing, to allow for idempotent create or append/overwrite: def write_deltalake(
table: Union[str, DeltaTable],
data,
mode: Literal['append', 'overwrite'] = 'append',
backend: str = 'pyarrow'
):
pass I'm thinking Add methods to class DeltaTable:
...
def write(self, data, mode: Literal['append', 'overwrite'] = 'append', backend: str = 'pyarrow'):
write_deltalake(self, data, mode, backend)
def delete_where(self, where_expr, backend: str = 'pyarrow'):
'''Delete rows matching the expression'''
pass
def update(self, where_expr, set_values: Dict[str, Any], backend: str = 'pyarrow'):
'''Modify values in rows matching the expression'''
pass I'll leave the signature for merge for later; it likely involves a builder. Draft Usage DocsFor overwrites and appends, use from deltalake.writer import write_deltalake
df = pd.DataFrame({'x': [1, 2, 3]})
write_deltalake('path/to/table', df) By default, writes append to the table. To overwrite, pass in write_deltalake('path/to/table', df, mode='overwrite') If you have a DeltaTable('path/to/table').write(df, mode='overwrite') To delete rows based on an expression, use from deltalake.writer import delete_deltalake
import pyarrow.dataset as ds
DeltaTable('path/to/table').delete(ds.field('x') == 2) To update a subset of rows with new values, use from deltalake.writer import delete_deltalake
import pyarrow.dataset as ds
# Increment y where x = 2
DeltaTable('path/to/table').update(
where_expr=ds.field('x') == 2,
set_values={
'y': ds.field('y') + 1
}
) |
I'm not sure what the convention is, but it might be a good idea to have overwrite be the default argument for mode of .write() |
The default in PySpark (which I think most users will be coming from) is to error if any data already exists. That makes sense for the standalone function |
@wjones127 I'm interested in helping support this but haven't contributed to the project before. Are any of these reasonable for a first time contributor? |
@GraemeCliffe-inspirato One good first issue might be the #576 would be a good one if you want to get more familiar with the Rust part of the project. |
@wjones127 I have submitted a small PR for the first part of #575. I'm interested in learning more about the invariants part of #575 . |
Is the functionality of "table creation" still a WIP? I know that the grid shows transactions are not yet up and running. Note that "./test_deltalake_table" does not exist on the filesystem for the below code example: import deltalake
import pandas
import numpy as np
df = pandas.DataFrame(np.random.uniform(0,1, (40,3)))
df.columns = ['X','Y','Z']
deltalake.writer.write_deltalake('./test_deltalake_table',df) yields
referencing the following: I know the grid shows that the "write transactions" is not yet enabled. I'm posting to check if this includes inital creation of the deltalake folder/file structure too. It seems like it does, but just checking. |
No that part should work now. Could you create a new issue for the error you are showing? Make sure to provide the version of |
Description
We have a PyArrow Dataset reader that works for Delta tables. Looking through the writer, I think we might have enough functionality to create a one.
Here are my rough notes on how that might work:
pyarrow.dataset.write_dataset
to write the parquet files.basename_template
could be set to a UUID, guaranteeing file uniqueness.existing_data_behavior
could be set tooverwrite_or_ignore
. (Not great behavior if there's ever a UUID collision, though. Might make a ticket to give a better option in PyArrow.)file_visitor
will be set to a callback that will push the filename and metadata to the back of a list. The metadata contains the file statistics.CommitInfo
in delta-rs?create_transaction
to create the transaction, using the file path and stats retrieved earlier.try_commit_transaction
to commit it.There's probably some protocol details I am overlooking, so would welcome any guidance.
The text was updated successfully, but these errors were encountered: