-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support write operation #44
Comments
Thats great to hear :) Let me know if you need any help, feel free to reach out to me on the duckdb discord or sam@duckdblabs.com |
Hello duckdb team, nice to see that this has already been posted as an issue. I and my team would also love to have this as a feature. Just to add some context here - we are working with ETL pipelines in my company that mostly use Just a small example, CREATE TABLE weather (
city VARCHAR,
temp_lo INTEGER, -- minimum temperature on a day
temp_hi INTEGER, -- maximum temperature on a day
prcp REAL,
date DATE
);
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
INSERT INTO weather VALUES ('San Francisco', 46, 50, 0.25, '1994-11-27');
COPY weather TO 'az://⟨my_container⟩/⟨my_file⟩.⟨parquet_or_csv⟩' Additionally, would love to receive advice on any temporary workarounds that can enable us to write from |
@csubhodeep you can try using fsspec if you're on python, they should have azure support |
ok thanks a lot. I will try it. |
Thanks again! I tried to use the Here are what I tried: >>> storage_account_name = "our_account"
>>> container_name = "our_container"
>>> account_creds = <our_key>
>>> duckdb.register_filesystem(filesystem('abfs', connection_string=account_creds))
>>> duckdb.sql("CREATE OR REPLACE TABLE test_table (a INTEGER, b VARCHAR(100))")
>>> duckdb.sql("INSERT INTO test_table VALUES (1, 'a'), (2, 'b'), (3, 'c')")
>>> duckdb.sql("SELECT * FROM test_table")
┌───────┬─────────┐
│ a │ b │
│ int32 │ varchar │
├───────┼─────────┤
│ 1 │ a │
│ 2 │ b │
│ 3 │ c │
└───────┴─────────┘
>>> write_query = f"COPY test_table TO 'https://{storage_account_name}.blob.core.windows.net/{container_name}/test.parquet' (FORMAT 'parquet')"
---------------------------------------------------------------------------
IOException Traceback (most recent call last)
Cell In[41], [line 2](vscode-notebook-cell:?execution_count=41&line=2)
[1](vscode-notebook-cell:?execution_count=41&line=1) # dump it as parquet
----> [2](vscode-notebook-cell:?execution_count=41&line=2) duckdb.sql(write_query)
IOException: IO Error: Cannot open file "https://<storage_account_name>.blob.core.windows.net/<container_name>/test.parquet": No such file or directory
>>> write_query = f"COPY test_table TO 'az://{storage_account_name}.blob.core.windows.net/{container_name}/test.parquet' (FORMAT 'parquet')"
---------------------------------------------------------------------------
NotImplementedException Traceback (most recent call last)
Cell In[43], [line 2](vscode-notebook-cell:?execution_count=43&line=2)
[1](vscode-notebook-cell:?execution_count=43&line=1) # dump it as parquet
----> [2](vscode-notebook-cell:?execution_count=43&line=2) duckdb.sql(write_query)
NotImplementedException: Not implemented Error: Writing to Azure containers is currently not supported Please let me know if I am doing something wrong. |
could you try the |
After trying the suggestion above, here are the results: Exception ignored in: <function AzureBlobFile.__del__ at 0x7feb3d5a4280>
Traceback (most recent call last):
File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/adlfs/spec.py", line 2166, in __del__
self.close()
File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/adlfs/spec.py", line 1983, in close
super().close()
File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/spec.py", line 1932, in close
self.flush(force=True)
File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/spec.py", line 1803, in flush
if self._upload_chunk(final=force) is not False:
File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 118, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/workspaces/rev_man_sys/venv/lib/python3.8/site-packages/adlfs/spec.py", line 2147, in _async_upload_chunk
raise RuntimeError(f"Failed to upload block: {e}!") from e
RuntimeError: Failed to upload block: The specifed resource name contains invalid characters.
RequestId:3545381a-d01e-0083-2346-6efd88000000
Time:2024-03-04T15:11:28.7881722Z
ErrorCode:InvalidResourceName
Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidResourceName</Code><Message>The specifed resource name contains invalid characters.
RequestId:3545381a-d01e-0083-2346-6efd88000000
Time:2024-03-04T15:11:28.7881722Z</Message></Error>! Is it more of an |
I think you're not setting the connection string correctly, you're setting it to your key it appears. Let's move this discussion elsewhere though as this is no longer about this issue. Please check if you're actually using fsspec correctly. If things are still wrong and it appears to be duckdb side, feel free to open an issue in duckdb/duckdb |
I have managed to make it work the issue at my end was this part of the path - |
Hi. I guess I am facing the same issue. I am using kotlin. Is there any workaround for this? Thanks |
Hello, For kotlin I'm not aware of a workaround :( |
+1 for this feature |
Did the write operation feature via duckdb extensions end up being merged in the 1.0.0 release? I am currently using the 1.0.0 release and write operations via the azure extension still fails when using the duckdb node.js libraries. I am using both az:// and abfss:// when trying to write back to a parquet file hosted in Azure Storage Account (az://) or Azure Data Lake (abfss://). Write operation error message: Error: Not implemented Error: AzureDfsStorageFileSystem: FileExists is not implemented! {errno: -1, code: 'DUCKDB_NODEJS_ERROR', errorType: 'Not implemented', stack: 'Error: Not implemented Error: AzureDfsStorageFile Any clarification on the write operation support state to Azure via node.js is highly appreciated. Thank you. |
@IlijaStankovski It has not. I would like to pick this up at some point but I can't give a timeline here unfortunately. |
@samansmink thanks for the quick update, if you have a branch of code to build from that you want testers for, please let me know. Cheers ... |
+1 I support this request. It is really necessary. Thank you |
it will be nice to have :) |
@samansmink any updates on this? |
@davidsteinar not yet, sorry! I agree this is one the more high prio features we should look into but i can not give a timeline yet |
@samansmink for sure, it would enable me to throw away our existing data warehouse completely! Do you have any idea on the size/complexity of implementing this? |
+1 to support the feature |
I don't think duckdb has azure on top of its list. don't know why though. Azure gen2 is absolutely the top storage service for now.. Hope someday somebody implements this |
+1 for support to writing a parquet to azure |
I'm trying to do the workaround @csubhodeep did and I'm having some issues. I have an azure storage account Here is a simple python code example to replicate my issue: import duckdb
import os
from fsspec import filesystem
duckdb.register_filesystem(filesystem("abfs", connection_string=os.getenv("AZURE_CONN_STR")))
con = duckdb.connect()
con.execute("CREATE TABLE parqdata AS SELECT * FROM read_parquet('C:\\Users\\dre11620\\Downloads\\NY\\smallNYTaxi.parquet\\*.parquet', hive_partitioning = true);")
con.execute(f'''COPY parqdata TO 'abfs://main/test.parquet' (FORMAT PARQUET, PER_THREAD_OUTPUT);''') But I keep getting back a duckdb error:
What am I missing? |
Figured out the issue, I needed to do |
+1 vote for this feature |
@DrewScatterday is the connection string the storage account URL? And if not, how can I find it for a given storage container? |
@natemcintosh the connection string is like an api key so its not an account url. To find it you have to go into the azure web portal and go to your storage account. Then go to Also you may need permissions on the account to see it as its sort of an api key |
Hi,
It's not really an issue but more an insight on what I plan to work on.
For the moment I didn't start but when I do I will post here a message to notify. If someone start before me please let me know ;)
The text was updated successfully, but these errors were encountered: