Add utf-8 validation checking for `substring` #1577

HaoYang670 · 2022-04-17T09:39:30Z

Which issue does this PR close?

What changes are included in this PR?

Check if the offsets of substrings are at valid utf-8 char boundary.
Add another substring benchmark.

Are there any user-facing changes?

No.

Benchmark

substring (start = 0, length = None)                                                                            
                        time:   [19.795 ms 19.810 ms 19.828 ms]
                        change: [-1.4682% -1.3201% -1.1834%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

substring (start = 1, length = str_len - 1)                                                                            
                        time:   [21.777 ms 21.798 ms 21.820 ms]
                        change: [+9.0489% +9.2069% +9.3504%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

We only check the char boundary validation when start != 0 or length != None.

update doc add a test for invalid array type Signed-off-by: remzi <13716567376yh@gmail.com>

clean up Signed-off-by: remzi <13716567376yh@gmail.com>

Signed-off-by: remzi <13716567376yh@gmail.com>

codecov-commenter · 2022-04-17T10:03:37Z

Codecov Report

Merging #1577 (5a8dfc3) into master (db4cb8f) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1577      +/-   ##
==========================================
+ Coverage   82.93%   82.95%   +0.01%     
==========================================
  Files         193      193              
  Lines       55350    55373      +23     
==========================================
+ Hits        45907    45934      +27     
+ Misses       9443     9439       -4

Impacted Files	Coverage Δ
arrow/src/compute/kernels/substring.rs	`100.00% <100.00%> (+1.68%)`	⬆️
arrow/src/datatypes/field.rs	`54.32% <0.00%> (-0.30%)`	⬇️
arrow/src/array/transform/mod.rs	`86.46% <0.00%> (+0.11%)`	⬆️
parquet/src/encodings/encoding.rs	`93.56% <0.00%> (+0.18%)`	⬆️
parquet_derive/src/parquet_field.rs	`66.43% <0.00%> (+0.22%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update db4cb8f...5a8dfc3. Read the comment docs.

tustvold

Looking good, left some comments

tustvold · 2022-04-17T10:18:52Z

arrow/src/compute/kernels/substring.rs

-        Box::new(|old_start: OffsetSize, end: OffsetSize| (end + start).max(old_start))
+    // check the `idx` is a valid char boundary or not
+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {


I wonder if we could instead interpret data as &str with std::str::from_utf8_unchecked, which should be valid if this is a valid string array, and then use the standard library is_char_boundary instead of implementing our own

tustvold · 2022-04-17T10:20:25Z

arrow/src/compute/kernels/substring.rs

+    // If yes, return `idx`, else return error
+    let check_char_boundary = |idx: OffsetSize| {
+        let idx_usize = idx.to_usize().unwrap();
+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {


Suggested change

if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {

// A valid code-point iff it does not start with 0b10xxxxxx

// Bit-magic taken from `std::str::is_char_boundary`

if idx_usize == data.len() || (data[idx_usize] as i8) < -0x40 {

There is a bit-magic trick to save an operator, although it is possible LLVM is smart enough to notice this. I also think it would be better to just use the standard library version directly if possible.

tustvold · 2022-04-17T10:22:37Z

arrow/src/compute/kernels/substring.rs

+    #[test]
+    fn check_invalid_array_type() {
+        let array = Int32Array::from(vec![Some(1), Some(2), Some(3)]);
+        assert_eq!(substring(&array, 0, None).is_err(), true);


I think it would make these tests easier to read, and also potentially less fragile if they checked the error message, e.g. something like

let err = substring(&array, 0, None).unwrap_err().to_string(); assert!(err.contains("foo"), "{}", err);

tustvold · 2022-04-17T10:37:10Z

arrow/src/compute/kernels/substring.rs

-            Box::new(|start: OffsetSize, end: OffsetSize| (*length).min(end - start))
-        } else {
-            Box::new(|start: OffsetSize, end: OffsetSize| end - start)
+    // function for calculating the start index of each string


Suggested change

// function for calculating the start index of each string

// Function for calculating the start index of each string

//

// Takes the start and end offsets of the original string, and returns the byte offset

// to start copying data from the source values array

//

// Returns an error if this offset is not at a valid UTF-8 char boundary

Or something to that effect, it wasn't clear to me whether this was the offset in the new array or the old

tustvold · 2022-04-17T10:37:17Z

arrow/src/compute/kernels/substring.rs

+            Ordering::Greater => Box::new(|old_start, end| {
+                check_char_boundary((old_start + start).min(end))
+            }),
+            Ordering::Equal => Box::new(|old_start, _| Ok(old_start)),


tustvold · 2022-04-17T10:43:08Z

arrow/src/compute/kernels/substring.rs

-        let new_length = cal_new_length(new_start, pair[1]);
-        len_so_far += new_length;
-        new_starts_ends.push((new_start, new_start + new_length));
+    offsets.windows(2).try_for_each(|pair| -> Result<()> {


I'm curious if the dyn dispatch functions perform better than letting LLVM perform loop unswitching, I would have thought eliminating the dyn-dispatch would be both clearer and faster

The reasons that I use dyn Fn are:

We should do the comparison for start and the destruction for length outside the for loop. And using closure is a convenient way, otherwise there would be lots of duplicate code.

Closures have different sizes because they capture different environment.

The dynamic dispatching doesn't impact the performance in this function because they are not hotspots.

My point was that modern optimising compilers will automatically lift conditionals out of loops, in a process called loop unswitching, and that letting it do this can be faster not just from eliminating dynamic dispatch, but also letting it further optimise the code.

I don't feel strongly about it, but thought I'd mention that you might be able to make the code clearer and faster by letting the compiler handle this

Thank you for your explanation. Let me have a try!

tustvold · 2022-04-17T10:43:37Z

arrow/src/compute/kernels/substring.rs

+            }),
+        };
+
+    // function for calculating the end index of each string


Some documentation of this functions inputs and output would be helpful I think

tustvold · 2022-04-17T10:45:18Z

arrow/benches/string_kernels.rs

@@ -39,10 +39,10 @@ fn add_benchmark(c: &mut Criterion) {
    let str_len = 1000;

    let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len);
-    let start = 0;
+    let start = 1;


Perhaps we could benchmark the various permutations of no start offset and no length?

Done! And I have updated the benchmark result.

The validation checking is only enabled when start !=0 or length != None. So there is no performance cost for the first micro-bench

viirya · 2022-04-17T20:05:44Z

arrow/src/compute/kernels/substring.rs

 /// ```
 ///
 /// # Error
-/// this function errors when the passed array is not a \[Large\]String array.
+/// - The function errors when the passed array is not a \[Large\]String array.
+/// - The function errors when you try to create a substring in the middle of a multibyte character.


Hm, multibyte character sounds a bit vague, maybe just say invalid utf-8 format?

viirya · 2022-04-17T20:22:36Z

arrow/src/compute/kernels/substring.rs

+        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+            Ok(idx)
+        } else {
+            Err(ArrowError::ComputeError(format!("The index {} is invalid because {:#010b} is not the head byte of a char.", idx_usize, data[idx_usize])))


The error message may not be understood by end-users. Maybe just say "The substring begins at index {} is an invalid utf8 char because it is not the first byte in a utf8 code point sequence"?

Signed-off-by: remzi <13716567376yh@gmail.com>

update doc Signed-off-by: remzi <13716567376yh@gmail.com>

Signed-off-by: remzi <13716567376yh@gmail.com>

HaoYang670 · 2022-04-19T01:05:18Z

@viirya @tustvold please review. Thank you very much!

tustvold

This looks good to me - thank you 👍

I'll leave this open for a bit longer to give others the chance to review, otherwise I'll merge this in this afternoon (GMT)

arrow/src/compute/kernels/substring.rs

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

viirya

lgtm, thanks!

HaoYang670 added 3 commits April 17, 2022 17:08

add utf-8 validation checking

56f7459

update doc add a test for invalid array type Signed-off-by: remzi <13716567376yh@gmail.com>

add tests

2ac4647

clean up Signed-off-by: remzi <13716567376yh@gmail.com>

merge master

a74dbea

Signed-off-by: remzi <13716567376yh@gmail.com>

github-actions bot added the arrow Changes to the arrow crate label Apr 17, 2022

test the worst case

fce157b

Signed-off-by: remzi <13716567376yh@gmail.com>

tustvold reviewed Apr 17, 2022

View reviewed changes

viirya reviewed Apr 17, 2022

View reviewed changes

HaoYang670 added 6 commits April 18, 2022 11:06

update doc and tests

7237316

Signed-off-by: remzi <13716567376yh@gmail.com>

update doc

cce4339

Signed-off-by: remzi <13716567376yh@gmail.com>

use std method is_char_boundary

cc94518

update doc Signed-off-by: remzi <13716567376yh@gmail.com>

merge master

a84f454

Signed-off-by: remzi <13716567376yh@gmail.com>

add 2 substring benches

5a8dfc3

Signed-off-by: remzi <13716567376yh@gmail.com>

replace dyn Fn with loop unswitching

746e2b2

Signed-off-by: remzi <13716567376yh@gmail.com>

tustvold approved these changes Apr 19, 2022

View reviewed changes

arrow/src/compute/kernels/substring.rs Show resolved Hide resolved

Update arrow/src/compute/kernels/substring.rs

e81285f

Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>

HaoYang670 mentioned this pull request Apr 19, 2022

Epic: enhance the substring kernel #1531

Open

12 tasks

viirya approved these changes Apr 19, 2022

View reviewed changes

tustvold merged commit 43f4e16 into apache:master Apr 19, 2022

HaoYang670 deleted the issue1575_add_utf8_validation_checking branch June 22, 2022 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add utf-8 validation checking for `substring` #1577

Add utf-8 validation checking for `substring` #1577

HaoYang670 commented Apr 17, 2022 •

edited

Loading

codecov-commenter commented Apr 17, 2022 •

edited

Loading

tustvold left a comment

tustvold Apr 17, 2022

HaoYang670 Apr 18, 2022

tustvold Apr 17, 2022

HaoYang670 Apr 18, 2022

tustvold Apr 17, 2022

HaoYang670 Apr 18, 2022

tustvold Apr 17, 2022

HaoYang670 Apr 18, 2022

tustvold Apr 17, 2022

tustvold Apr 17, 2022 •

edited

Loading

HaoYang670 Apr 18, 2022

tustvold Apr 18, 2022

HaoYang670 Apr 18, 2022

HaoYang670 Apr 19, 2022

tustvold Apr 17, 2022

HaoYang670 Apr 18, 2022

tustvold Apr 17, 2022

HaoYang670 Apr 18, 2022

HaoYang670 Apr 18, 2022

viirya Apr 17, 2022

HaoYang670 Apr 18, 2022

viirya Apr 17, 2022

HaoYang670 Apr 18, 2022

HaoYang670 commented Apr 19, 2022

tustvold left a comment

viirya left a comment

-        if idx_usize == data.len() || data[idx_usize] >> 6 != 0b10_u8 {
+        // A valid code-point iff it does not start with 0b10xxxxxx
+        // Bit-magic taken from `std::str::is_char_boundary`
+        if idx_usize == data.len() || (data[idx_usize] as i8) < -0x40 {

-    // function for calculating the start index of each string
+    // Function for calculating the start index of each string
+    //
+    // Takes the start and end offsets of the original string, and returns the byte offset
+    // to start copying data from the source values array
+    //
+    // Returns an error if this offset is not at a valid UTF-8 char boundary

Add utf-8 validation checking for substring #1577

Add utf-8 validation checking for substring #1577

Conversation

HaoYang670 commented Apr 17, 2022 • edited Loading

Which issue does this PR close?

What changes are included in this PR?

Are there any user-facing changes?

Benchmark

codecov-commenter commented Apr 17, 2022 • edited Loading

Codecov Report

tustvold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Apr 17, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HaoYang670 commented Apr 19, 2022

tustvold left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

Add utf-8 validation checking for `substring` #1577

Add utf-8 validation checking for `substring` #1577

HaoYang670 commented Apr 17, 2022 •

edited

Loading

codecov-commenter commented Apr 17, 2022 •

edited

Loading

tustvold Apr 17, 2022 •

edited

Loading