Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adaptive Row Block Size (#4812) #4818

Merged
merged 5 commits into from
Sep 17, 2023
Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #4812

Rationale for this change

Reduces the space amplification for small strings, reducing memory usage and potentially yielding faster comparisons

convert_columns 4096 string(10, 0)
                        time:   [68.766 µs 68.782 µs 68.802 µs]
                        change: [-0.2111% +0.1136% +0.3981%] (p = 0.56 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string(10, 0)
                        time:   [68.588 µs 68.599 µs 68.610 µs]
                        change: [+0.0308% +0.3071% +0.5887%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

convert_rows 4096 string(10, 0)
                        time:   [74.786 µs 74.810 µs 74.838 µs]
                        change: [+18.688% +18.835% +19.078%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

convert_columns 4096 string(30, 0)
                        time:   [73.300 µs 73.469 µs 73.619 µs]
                        change: [+6.1730% +6.5624% +6.9695%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

convert_columns_prepared 4096 string(30, 0)
                        time:   [72.918 µs 73.091 µs 73.262 µs]
                        change: [+6.5088% +6.8782% +7.2161%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

convert_rows 4096 string(30, 0)
                        time:   [87.594 µs 87.620 µs 87.649 µs]
                        change: [+39.299% +39.758% +40.202%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

convert_columns 4096 string(100, 0)
                        time:   [82.717 µs 82.758 µs 82.802 µs]
                        change: [-7.4082% -7.2740% -7.0869%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

convert_columns_prepared 4096 string(100, 0)
                        time:   [83.181 µs 83.212 µs 83.247 µs]
                        change: [-5.2730% -5.1499% -4.9101%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe

convert_rows 4096 string(100, 0)
                        time:   [126.84 µs 126.90 µs 126.97 µs]
                        change: [+21.245% +21.428% +21.575%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

convert_columns 4096 string(100, 0.5)
                        time:   [95.168 µs 95.188 µs 95.214 µs]
                        change: [-6.6299% -6.3326% -6.0687%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  5 (5.00%) high severe

convert_columns_prepared 4096 string(100, 0.5)
                        time:   [95.093 µs 95.118 µs 95.147 µs]
                        change: [-6.2227% -6.1799% -6.1403%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild

convert_rows 4096 string(100, 0.5)
                        time:   [117.71 µs 117.73 µs 117.76 µs]
                        change: [+4.9569% +5.2699% +5.5499%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

convert_columns 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0)
                        time:   [293.24 µs 293.40 µs 293.56 µs]
                        change: [-2.6997% -2.6362% -2.5681%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

convert_columns_prepared 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0)
                        time:   [290.93 µs 291.09 µs 291.26 µs]
                        change: [-3.3535% -3.2713% -3.1558%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

convert_rows 4096 string(20, 0.5), string(30, 0), string(100, 0), i64(0)
                        time:   [315.67 µs 315.75 µs 315.84 µs]
                        change: [+17.810% +17.862% +17.915%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

This does noticeably regress the performance of converting rows back into arrow-arrays, as a result of having to parse additional blocks. I am inclined to think this isn't a major concern as I have only seen this functionality being used following expensive, reducing operations like grouping, which will dominate execution time

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 14, 2023
@@ -2221,7 +2227,7 @@ mod tests {
}

for r in r2.iter() {
assert_eq!(r.data.len(), 34);
assert_eq!(r.data.len(), 10);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice demonstration of how this will reduce the impact of #4811. Whilst 10 is still worse than the 3 for dictionaries, it is a lot better than 34 😅

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to read this PR in detail yet. In terms of performance regression, is there some way to dynamically pick the number of blocks row by row (based on their contents) rather than hard coding it to always use a few small blocks up front)? That way we could avoid the performance regression for larger strings

/// to the output, followed by `0xFF_u8`. The final block is padded to 32-bytes
/// with `0_u8` and written to the output, followed by the un-padded length in bytes
/// of this final block as a `u8`.
/// of this final block as a `u8`. The first 4 blocks have a length of 8, with subsequent
/// blocks using a length of 32.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help to explain the rationale for using smaller blocks up front (to avoid space wastage for smaller stings)

@tustvold
Copy link
Contributor Author

dynamically pick the number of blocks row by row

I can't think of a scheme that would allow this whilst also allowing shorter strings to compare correctly against longer, e.g. "abc" < "bcde" < "cde"

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tustvold -- I went over this change carefully

Here is my summary of expected benefits of this PR. I wonder if you agree with this assesment?

  1. GROUP BY and ORDER BY with low cardinality multi column string grouping / ordering with strings less than 32 characters I would expect smaller keys to make comparisions faster when the Row values are equal many of the times -- which is what happens for GROUP BY on low cardinality grouping operations in DataFusion. TPCH Q1 is an example (return_status is a string)

  2. Memory usage when doing high cardinality multi-column grouping or sorting with strings less than 32 characaters Since there is less memory overhead per value the overall memory requirements are lower.

TLDR is I think this it is the right tradeoff for IOx but I can imagine workloads where one would rather have the faster conversion performance but larger memory usage.

I wonder if anyone else in the community has an opinion?

@@ -1368,12 +1368,14 @@ pub(crate) mod bytes {
}

impl ByteArrayNativeType for [u8] {
#[inline]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline(always)? -- at some point I thought that saying 'always' was required to get cross crate inlining to work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That hasn't been my experience, I believe #[inline] makes the data available to be inlined should LLVM think it a good idea, with '[inline(always)] only necessary when LLVM doesn't think it a good idea (for whatever reason).

@@ -26,6 +26,12 @@ use arrow_schema::{DataType, SortOptions};
/// The block size of the variable length encoding
pub const BLOCK_SIZE: usize = 32;

/// The first block is split into `MINI_BLOCK_COUNT` mini-blocks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this would be a good place to add comments about the why for miniblocks.

Some(a) if a <= BLOCK_SIZE => {
1 + ceil(a, MINI_BLOCK_SIZE) * (MINI_BLOCK_SIZE + 1)
}
Some(a) => MINI_BLOCK_COUNT + ceil(a, BLOCK_SIZE) * (BLOCK_SIZE + 1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the padded length gets a MINI_BLOCK_COUNT on it. Is it because each miniblock contains 1 byte continuation?

Maybe a comment would help

@@ -26,6 +26,12 @@ use arrow_schema::{DataType, SortOptions};
/// The block size of the variable length encoding
pub const BLOCK_SIZE: usize = 32;

/// The first block is split into `MINI_BLOCK_COUNT` mini-blocks
pub const MINI_BLOCK_COUNT: usize = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if using a MINI_BLOCK count of 2 might get most of the memory savings but have lower CPU overhead 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find this to yield a meaningful performance delta

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for trying

/// Returns the number of bytes of encoded data
fn decoded_len(row: &[u8], options: SortOptions) -> usize {
#[inline]
fn encode_blocks<const SIZE: usize>(out: &mut [u8], val: &[u8]) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we could add a comment here explaining what this writes (specifically continuation tokens vs BLOCK_CONTINUATION)

@tustvold
Copy link
Contributor Author

faster conversion performance but larger memory usage.

The only regression is converting back from rows to arrays, so for DF I would not expect this to regress performance as this only done for grouping where other overheads will dominate

@tustvold tustvold merged commit b64e362 into apache:master Sep 17, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Row Format Adapative Block Size
2 participants