Update to arrow 32 and Switch to RawDecoder for JSON #5056

tustvold · 2023-01-25T12:48:24Z

Which issue does this PR close?

Closes #.

Rationale for this change

Integration test for apache/arrow-rs#3479 and preparing for the next arrow release apache/arrow-rs#3584

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Switch to RawDecoder for JSON

tustvold · 2023-01-25T12:49:39Z

datafusion/core/src/physical_plan/file_format/json.rs

                    Ok(futures::stream::iter(reader).boxed())
                }
                GetResult::Stream(s) => {
+                    let mut decoder = RawDecoder::try_new(schema, batch_size)?;


I think this interface is pretty cool, it avoids needing to scan the byte stream looking for newlines, and consequently should add some additional performance on top of the faster performance of RawDecoder in general

alamb · 2023-01-25T17:09:54Z

🥳 🦜

alamb

Looks great @tustvold -- the only question I have is on feeding in empty buffers to the csv reader -- but perhaps I am misreading something

alamb · 2023-01-30T19:27:05Z

datafusion/common/src/error.rs

            // walk the next level
            root_error = source;
            // remember the lowest datafusion error so far
            if let Some(e) = root_error.downcast_ref::<DataFusionError>() {
                last_datafusion_error = e;
+            } else if let Some(e) = root_error.downcast_ref::<Arc<DataFusionError>>() {
+                // As `Arc<T>::source()` calls through to `T::source()` we need to


alamb · 2023-01-30T19:27:34Z

datafusion/core/src/avro_to_arrow/schema.rs

@@ -217,6 +217,9 @@ fn default_field_name(dt: &DataType) -> &str {
        DataType::Union(_, _, _) => "union",
        DataType::Dictionary(_, _) => "map",
        DataType::Map(_, _) => unimplemented!("Map support not implemented"),
+        DataType::RunEndEncoded(_, _) => {
+            unimplemented!("RunEndEncoded support not implemented")


alamb · 2023-01-30T19:28:49Z

datafusion/core/src/physical_plan/file_format/csv.rs

@@ -21,7 +21,6 @@ use crate::datasource::file_format::file_type::FileCompressionType;
 use crate::error::{DataFusionError, Result};
 use crate::execution::context::{SessionState, TaskContext};
 use crate::physical_plan::expressions::PhysicalSortExpr;
-use crate::physical_plan::file_format::delimited_stream::newline_delimited_stream;


Can the corresponding newline_delimited_stream module be deleted too?

https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20newline_delimited_stream&type=code

Unfortunately it is still used by the schema inference logic, I'll see about resurrecting the PR to move to using the upstream implementation

alamb · 2023-01-30T19:29:27Z

datafusion/core/src/physical_plan/file_format/csv.rs

+                            }
+                            let decoded = match decoder.decode(buffered.as_ref()) {
+                                // Note: the decoder needs to be called with an empty
+                                // array to delimt the final record


Suggested change

// array to delimt the final record

// array to delimit the final record

I must be missing how the code is called with an empty buffer. If all data in buffered was consumed and then the next poll was empty, won't that break out of the the loop prior to calling decode() 🤔

You are quite correct, I'm investigating how this is working...

alamb · 2023-01-30T19:33:10Z

datafusion/proto/src/logical_plan/to_proto.rs

@@ -218,7 +218,7 @@ impl TryFrom<&DataType> for protobuf::arrow_type::ArrowTypeEnum {
            DataType::Decimal256(_, _) => {
                return Err(Error::General("Proto serialization error: The Decimal256 data type is not yet supported".to_owned()))
            }
-            DataType::Map(_, _) => {
+            DataType::Map(_, _) | DataType::RunEndEncoded(_, _) => {


I recommend either updating the error message here or adding a separate clause for RunEndEncoded

ursabot · 2023-01-31T18:22:42Z

Benchmark runs are scheduled for baseline = a218b70 and contender = bb699eb. bb699eb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Update to arrow 32

72fb2b2

Switch to RawDecoder for JSON

github-actions bot added the core Core DataFusion crate label Jan 25, 2023

tustvold commented Jan 25, 2023

View reviewed changes

tustvold added 3 commits January 25, 2023 13:04

Fix avro

51650ff

Update datafusion-cli

22bb5d9

Fix arrow-flight

b768edc

tustvold added 2 commits January 27, 2023 14:55

Update arrow

7b14715

Use CSV Decoder

bbc0955

github-actions bot added the physical-expr Physical Expressions label Jan 27, 2023

tustvold added 6 commits January 27, 2023 15:26

Merge remote-tracking branch 'upstream/master' into update-arrow-32

3fb0728

Fix avro

bc55adc

Simplify error handling

f3d2ae6

Merge remote-tracking branch 'upstream/master' into update-arrow-32

b3ccfe9

Explicit error conversions

aca274b

Update arrow 23

b96a488

tustvold marked this pull request as ready for review January 30, 2023 16:46

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner labels Jan 30, 2023

alamb added the api change Changes the API exposed to users of the crate label Jan 30, 2023

alamb approved these changes Jan 30, 2023

View reviewed changes

tustvold added 2 commits January 31, 2023 16:52

Fix CSV

0c20937

Reivew feedback

42e1b9c

tustvold merged commit bb699eb into apache:master Jan 31, 2023

korowa mentioned this pull request Feb 10, 2023

JSON projection only work when the index is in ascending order #4832

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to arrow 32 and Switch to RawDecoder for JSON #5056

Update to arrow 32 and Switch to RawDecoder for JSON #5056

tustvold commented Jan 25, 2023

tustvold Jan 25, 2023

alamb commented Jan 25, 2023

alamb left a comment

alamb Jan 30, 2023

alamb Jan 30, 2023

alamb Jan 30, 2023

tustvold Jan 30, 2023

alamb Jan 30, 2023

alamb Jan 30, 2023

tustvold Jan 31, 2023

alamb Jan 30, 2023

ursabot commented Jan 31, 2023

	// array to delimt the final record
	// array to delimit the final record

Update to arrow 32 and Switch to RawDecoder for JSON #5056

Update to arrow 32 and Switch to RawDecoder for JSON #5056

Conversation

tustvold commented Jan 25, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

alamb commented Jan 25, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Jan 31, 2023