Support Writing Arrow files #8608

devinjdangelo · 2023-12-20T23:07:42Z

Which issue does this PR close?

Closes #8504

Rationale for this change

What changes are included in this PR?

Implements initial support for writing out arrow files via COPY TO and INSERT INTO for listing tables.

Are these changes tested?

Adds new sqllogictests to cover.

Are there any user-facing changes?

Writing arrow files is now possible

datafusion/sqllogictest/test_files/insert_to_external.slt

devinjdangelo · 2023-12-20T23:29:17Z

datafusion/sqllogictest/test_files/copy.slt

+1 Foo
+2 Bar
+
+# Copy from dict encoded values to single arrow file


I know @tustvold's main concern was dictionaries. I think this test shows we are OK, but let me know if I am overlooking something.

viirya · 2023-12-21T02:44:10Z

Cargo.toml

 datafusion-proto = { path = "datafusion/proto", version = "34.0.0" }
+datafusion-sql = { path = "datafusion/sql", version = "34.0.0" }


Hmm, are these changes related? Looks like just moving lines around? Maybe you can revert unrelated change to keep diff smaller?

I added arrow-ipc as a dependency for core, and ran cargo tomlfmt. I'm not sure why cargo tomlfmt changed so much of the formatting 🤔

alamb

Thank you for this PR @devinjdangelo -- I tried it out and it looks quite awesome.

I also tried it out locally and it works great.

I left a few suggestions which would be nice to address but I don't think are required to merge. We can do them as a follow on PR as well

❯ copy (values (1), (2)) to '/tmp/foo.arrow';
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.030 seconds.

datafusion-cli -c "select * from '/tmp/foo.arrow'";
DataFusion CLI v34.0.0
+---------+
| column1 |
+---------+
| 1       |
| 2       |
+---------+
2 rows in set. Query took 0.028 seconds.

``shell
$ datafusion-cli -c "select arrow_typeof(column1) from '/tmp/foo.arrow'";
DataFusion CLI v34.0.0
+--------------------------------------+
| arrow_typeof(/tmp/foo.arrow.column1) |
+--------------------------------------+
| Int64 |
| Int64 |
+--------------------------------------+


👍

datafusion/sqllogictest/test_files/insert_to_external.slt

alamb · 2023-12-22T21:15:55Z

datafusion/core/src/datasource/file_format/arrow.rs

+    ) -> Result<u64> {
+        // No props are supported yet, but can be by updating FileTypeWriterOptions
+        // to populate this struct and use those options to initialize the arrow_ipc::writer::FileWriter
+        let _arrow_props = self.config.file_type_writer_options.try_into_arrow()?;


I think we should track this as a follow on ticket and ideally leave a comment in the code pointing to the ticket so it eventually gets cleaned up

Agreed. When we update FileTypeWriterProperties for arrow files, we should also take care to maintain serialization support which @andygrove has been working on.

Filed #8635 and added comment linking to it

Also filed apache/arrow-rs#5236 which would help with #8635 (though is not blocking)

alamb · 2023-12-22T21:16:30Z

datafusion/core/src/datasource/file_format/arrow.rs

+        let mut file_write_tasks: JoinSet<std::result::Result<usize, DataFusionError>> =
+            JoinSet::new();
+        while let Some((path, mut rx)) = file_stream_rx.recv().await {
+            let shared_buffer = SharedBuffer::new(1048576);


Does this mean that if any record batch takes more than 1MB to write out we'll get an error?

Would it be possible to make this constant and 1024000 below into names constants with comments that explain what they do ?

The initial buffer size is just a size hint for efficiency. It will grow beyond the set value if needed.

We can definitely make it a named constant, or even make it configurable.

This buffer holds serialized bytes in memory until it is periodically uploaded to an object store. This is similar to how the parquet AsyncArrowWriter for parquet is implemented.

Filed #8642 to name the constants

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan · 2023-12-22T21:39:52Z

datafusion/core/src/datasource/file_format/arrow.rs

+                &self.get_writer_schema(),
+            )?;
+            let mut object_store_writer = create_writer(
+                FileCompressionType::UNCOMPRESSED,


We probably want to default this to a compressed version (and make it configurable later)? Default of pyarrow is lz4.

I took a quick look at the arrow_ipc::FileWriter code and it appears that the writer manages compression internally in batches. The referenced line controls whole file compression (like for CSV and JSON).

Since we are not setting a compression explicitly in DataFusion in this PR, we are inheriting the arrow-rs default compression. I think the arrow-rs default is also lz4, but I am not 100% sure glancing over the code.

Looking more closely, I was wrong. arrow_ipc defaults to uncompressed arrow files. I pushed up a change to default to lz4.

write arrow files

e1817d2

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Dec 20, 2023

devinjdangelo commented Dec 20, 2023

View reviewed changes

datafusion/sqllogictest/test_files/insert_to_external.slt Show resolved Hide resolved

devinjdangelo added 2 commits December 20, 2023 18:13

update datafusion-cli lock

524edfe

fix toml formatting

232084e

devinjdangelo commented Dec 20, 2023

View reviewed changes

viirya reviewed Dec 21, 2023

View reviewed changes

alamb mentioned this pull request Dec 21, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 18, 2023 #8577

Closed

7 tasks

alamb approved these changes Dec 22, 2023

View reviewed changes

Update insert_to_external.slt

099247a

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Dandandan reviewed Dec 22, 2023

View reviewed changes

devinjdangelo mentioned this pull request Dec 22, 2023

Support Configuring arrow_ipc::FileWriter via ArrowWriterOptions #8635

Open

devinjdangelo added 5 commits December 22, 2023 17:19

add ticket tracking arrow options

d105423

default to lz4 compression

0fa4a59

update datafusion-cli lock

c3e2255

resolve conflict

b9e0c9e

cargo update

171d0a3

Dandandan approved these changes Dec 23, 2023

View reviewed changes

alamb merged commit d5704f7 into apache:main Dec 24, 2023
22 checks passed

alamb mentioned this pull request Dec 24, 2023

Minor: name some constant values in arrow writer, parquet writer #8642

Merged

matthewgapp mentioned this pull request Jan 11, 2024

matt/feat/recursive ctes/config flag matthewgapp/arrow-datafusion#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Writing Arrow files #8608

Support Writing Arrow files #8608

devinjdangelo commented Dec 20, 2023

devinjdangelo Dec 20, 2023

viirya Dec 21, 2023

devinjdangelo Dec 21, 2023

alamb left a comment

alamb Dec 22, 2023

devinjdangelo Dec 22, 2023

devinjdangelo Dec 22, 2023 •

edited

Loading

devinjdangelo Dec 22, 2023

alamb Dec 22, 2023

devinjdangelo Dec 22, 2023

devinjdangelo Dec 22, 2023

alamb Dec 24, 2023

Dandandan Dec 22, 2023

devinjdangelo Dec 22, 2023

devinjdangelo Dec 22, 2023

		datafusion-proto = { path = "datafusion/proto", version = "34.0.0" }
		datafusion-sql = { path = "datafusion/sql", version = "34.0.0" }

Support Writing Arrow files #8608

Support Writing Arrow files #8608

Conversation

devinjdangelo commented Dec 20, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinjdangelo Dec 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinjdangelo Dec 22, 2023 •

edited

Loading