Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Improved performance of utf8 check for ascii-only (-40% parquet reading ascii-only columns) #541

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/array/specification.rs
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,11 @@ pub fn check_offsets_minimal<O: Offset>(offsets: &[O], values_len: usize) -> usi
/// * any slice of `values` between two consecutive pairs from `offsets` is invalid `utf8`, or
/// * any offset is larger or equal to `values_len`.
pub fn check_offsets_and_utf8<O: Offset>(offsets: &[O], values: &[u8]) {
if values.iter().all(|x| *x <= 127) {
// all values are ASCII => each element is valid utf8 (we only need to check offsets)
return check_offsets(offsets, values.len());
}

jorgecarleitao marked this conversation as resolved.
Show resolved Hide resolved
offsets.windows(2).for_each(|window| {
let start = window[0].to_usize();
let end = window[1].to_usize();
Expand Down