Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy array display (#3638) #3647

Merged
merged 18 commits into from
Feb 8, 2023
Merged

Lazy array display (#3638) #3647

merged 18 commits into from
Feb 8, 2023

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Feb 1, 2023

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

The output for nested types has been made more consistent, fields are no longer enclosed in " without any escaping, and the same null sentinel is used as elsewhere.

Invalid timezones will result in an error, as opposed to rendering the error to the output string.

Date64 prints out the encoded time, previously it only printed the date which was incorrect.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Feb 1, 2023
macro_rules! timestamp_display {
($($t:ty),+) => {
$(impl<'a> DisplayIndexState for &'a PrimitiveArray<$t> {
type State = Option<Tz>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the motivation for the "deviation" from the proposal in #3638, we can parse the timezone once and use this for formatting

@@ -1067,7 +1067,7 @@ impl<T: ArrowPrimitiveType> From<ArrayData> for PrimitiveArray<T> {
}
}

impl<T: DecimalType + ArrowPrimitiveType> PrimitiveArray<T> {
impl<T: DecimalType> PrimitiveArray<T> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by cleanup

}
}

macro_rules! primitive_display {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the lack of trait specialization means we are forced to use macros here

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 3, 2023
@tustvold tustvold added the api-change Changes to the arrow API label Feb 3, 2023
@tustvold tustvold marked this pull request as ready for review February 6, 2023 17:26
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this PR looks quite good and make the code cleaner and easier to understand. Thank you so much @tustvold

I didn't quite get through the entire display.rs file yet but what I did get through is quite neat and looks good so far.

I believe this is related to #3638

Date64 prints out the encoded time, previously it only printed the date which was incorrect.

While I think I understand the rationale for this, I think it is worth double checking and maybe considering offering an escape valve formatting option if people want the old behavior.

| DataType::Time32(_)
| DataType::Time64(_)
| DataType::Duration(_) => {
// TODO: Set options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO for this PR or for some future PR? It seems like the default options work well enough for this?

@@ -937,7 +892,7 @@ mod tests {

assert_json_eq(
&buf,
r#"{"date32":"2018-11-13","date64":"2018-11-13","name":"a"}
r#"{"date32":"2018-11-13","date64":"2018-11-13T17:11:10.011","name":"a"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea for this change that since Date64 can repreent milliseconds it should be formatted like a timestamp even thought that may be surprising?

https://github.com/apache/arrow/blob/39bad5442c6447bf07594b09e4b29118b3211460/format/Schema.fbs#L214-L215

I admit the docs aren't very helpful in this respect: https://docs.rs/arrow/32.0.0/arrow/array/type.Date64Array.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps Date64 represents a DateTIme? It is a rather strange data type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3678 <-- I tried to clarify a little

@@ -290,7 +290,7 @@ pub enum IntervalUnit {
}

// Sparse or Dense union layouts
#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)]
#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord, Copy)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"+---------------------+",
"| f |",
"+---------------------+",
"| 2005-03-18T01:58:20 |",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same change as in the json code -- I realize Date64 has precision greater than a day, but I wonder if people will be surprised by this change 🤔

| {"list": [{"leaf_a": 6, "leaf_b": null}, {"leaf_a": 7, "leaf_b": null}, {"leaf_a": 8, "leaf_b": null}, {"leaf_a": 9, "leaf_b": 1}]} |
| {"list": [{"leaf_a": 10, "leaf_b": null}]} |
+-------------------------------------------------------------------------------------------------------------------------------------+
+-------------------------------------------------------------------------------------------------------+
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change certainly looks nicer to me -- I wonder if someone wants the old behavior (which looks like JSON), they should use the JSON writer, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if someone wants actual JSON, they should use the JSON writer, what this previously produced was quasi-JSON that wouldn't illegal escape characters

}

impl<'a> FormatOptions<'a> {
/// If set to `true` any formatting errors will be written to the output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

/// Implements [`Display`] for a specific array value
pub struct ValueFormatter<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite clever 👏

impl<'a> ValueFormatter<'a> {
/// Writes this value to the provided [`Write`]
///
/// Note: this ignores [`FormatOptions::with_display_error`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note: this ignores [`FormatOptions::with_display_error`]
/// Note: this ignores [`FormatOptions::with_display_error`] and
/// will return an error on formatting issue

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finished reading this PR -- it is really nicely done @tustvold 🏅


macro_rules! decimal_display {
($($t:ty),+) => {
$(impl<'a> DisplayIndexState<'a> for &'a PrimitiveArray<$t> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am impressed that you got the types to work out with references within macros 🤯

/// Get the value at the given row in an array as a String.
///
/// Note this function is quite inefficient and is unlikely to be
/// suitable for converting large arrays or record batches.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// suitable for converting large arrays or record batches.
/// suitable for converting large arrays or record batches.
///
/// Please see [`ArrayFormatter`] for a better interface

}

/// A string formatter for an [`Array`]
pub struct ArrayFormatter<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand this is the main new interface. I think if we added an example to the docs here it would be great and help this usability a lot. A follow on PR would be fine (and I would be happy to help write it as well)

///
#[derive(Debug, Clone)]
pub struct FormatOptions<'a> {
safe: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no comment explaining safe

@tustvold tustvold merged commit a3b344d into apache:master Feb 8, 2023
@ursabot
Copy link

ursabot commented Feb 8, 2023

Benchmark runs are scheduled for baseline = 9b48f34 and contender = a3b344d. a3b344d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants