Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utf-8 validation checking for substring #1577

Merged

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented Apr 17, 2022

Which issue does this PR close?

Closes #1575

What changes are included in this PR?

  1. Check if the offsets of substrings are at valid utf-8 char boundary.
  2. Add another substring benchmark.

Are there any user-facing changes?

No.

Benchmark

substring (start = 0, length = None)                                                                            
                        time:   [19.795 ms 19.810 ms 19.828 ms]
                        change: [-1.4682% -1.3201% -1.1834%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

substring (start = 1, length = str_len - 1)                                                                            
                        time:   [21.777 ms 21.798 ms 21.820 ms]
                        change: [+9.0489% +9.2069% +9.3504%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

We only check the char boundary validation when start != 0 or length != None.

update doc
add a test for invalid array type

Signed-off-by: remzi <13716567376yh@gmail.com>
clean up

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 17, 2022
Signed-off-by: remzi <13716567376yh@gmail.com>
@codecov-commenter
Copy link

codecov-commenter commented Apr 17, 2022

Codecov Report

Merging #1577 (5a8dfc3) into master (db4cb8f) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1577      +/-   ##
==========================================
+ Coverage   82.93%   82.95%   +0.01%     
==========================================
  Files         193      193              
  Lines       55350    55373      +23     
==========================================
+ Hits        45907    45934      +27     
+ Misses       9443     9439       -4     
Impacted Files Coverage Δ
arrow/src/compute/kernels/substring.rs 100.00% <100.00%> (+1.68%) ⬆️
arrow/src/datatypes/field.rs 54.32% <0.00%> (-0.30%) ⬇️
arrow/src/array/transform/mod.rs 86.46% <0.00%> (+0.11%) ⬆️
parquet/src/encodings/encoding.rs 93.56% <0.00%> (+0.18%) ⬆️
parquet_derive/src/parquet_field.rs 66.43% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db4cb8f...5a8dfc3. Read the comment docs.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left some comments

Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
// check the `idx` is a valid char boundary or not
// If yes, return `idx`, else return error
let check_char_boundary = |idx: OffsetSize| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could instead interpret data as &str with std::str::from_utf8_unchecked, which should be valid if this is a valid string array, and then use the standard library is_char_boundary instead of implementing our own

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

// If yes, return `idx`, else return error
let check_char_boundary = |idx: OffsetSize| {
let idx_usize = idx.to_usize().unwrap();
if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
// A valid code-point iff it does not start with 0b10xxxxxx
// Bit-magic taken from `std::str::is_char_boundary`
if idx_usize == data.len() || (data[idx_usize] as i8) < -0x40 {

There is a bit-magic trick to save an operator, although it is possible LLVM is smart enough to notice this. I also think it would be better to just use the standard library version directly if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

#[test]
fn check_invalid_array_type() {
let array = Int32Array::from(vec![Some(1), Some(2), Some(3)]);
assert_eq!(substring(&array, 0, None).is_err(), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make these tests easier to read, and also potentially less fragile if they checked the error message, e.g. something like

let err = substring(&array, 0, None).unwrap_err().to_string();
assert!(err.contains("foo"), "{}", err);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
} else {
Box::new(|start: OffsetSize, end: OffsetSize| end - start)
// function for calculating the start index of each string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// function for calculating the start index of each string
// Function for calculating the start index of each string
//
// Takes the start and end offsets of the original string, and returns the byte offset
// to start copying data from the source values array
//
// Returns an error if this offset is not at a valid UTF-8 char boundary

Or something to that effect, it wasn't clear to me whether this was the offset in the new array or the old

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Ordering::Greater => Box::new(|old_start, end| {
check_char_boundary((old_start + start).min(end))
}),
Ordering::Equal => Box::new(|old_start, _| Ok(old_start)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

let new_length = cal_new_length(new_start, pair[1]);
len_so_far += new_length;
new_starts_ends.push((new_start, new_start + new_length));
offsets.windows(2).try_for_each(|pair| -> Result<()> {
Copy link
Contributor

@tustvold tustvold Apr 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if the dyn dispatch functions perform better than letting LLVM perform loop unswitching, I would have thought eliminating the dyn-dispatch would be both clearer and faster

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasons that I use dyn Fn are:

  1. We should do the comparison for start and the destruction for length outside the for loop. And using closure is a convenient way, otherwise there would be lots of duplicate code.
  2. Closures have different sizes because they capture different environment.
  3. The dynamic dispatching doesn't impact the performance in this function because they are not hotspots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that modern optimising compilers will automatically lift conditionals out of loops, in a process called loop unswitching, and that letting it do this can be faster not just from eliminating dynamic dispatch, but also letting it further optimise the code.

I don't feel strongly about it, but thought I'd mention that you might be able to make the code clearer and faster by letting the compiler handle this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your explanation. Let me have a try!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

}),
};

// function for calculating the end index of each string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some documentation of this functions inputs and output would be helpful I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -39,10 +39,10 @@ fn add_benchmark(c: &mut Criterion) {
let str_len = 1000;

let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len);
let start = 0;
let start = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could benchmark the various permutations of no start offset and no length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! And I have updated the benchmark result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation checking is only enabled when start !=0 or length != None. So there is no performance cost for the first micro-bench

/// ```
///
/// # Error
/// this function errors when the passed array is not a \[Large\]String array.
/// - The function errors when the passed array is not a \[Large\]String array.
/// - The function errors when you try to create a substring in the middle of a multibyte character.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, multibyte character sounds a bit vague, maybe just say invalid utf-8 format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
Ok(idx)
} else {
Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message may not be understood by end-users. Maybe just say "The substring begins at index {} is an invalid utf8 char because it is not the first byte in a utf8 code point sequence"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
update doc

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@HaoYang670
Copy link
Contributor Author

@viirya @tustvold please review. Thank you very much!

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me - thank you 👍

I'll leave this open for a bit longer to give others the chance to review, otherwise I'll merge this in this afternoon (GMT)

arrow/src/compute/kernels/substring.rs Show resolved Hide resolved
Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@tustvold tustvold merged commit 43f4e16 into apache:master Apr 19, 2022
@HaoYang670 HaoYang670 deleted the issue1575_add_utf8_validation_checking branch June 22, 2022 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add utf-8 validation checking to substring kernel
4 participants