Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Introduce utils::hash for StructArray #8552

Merged
merged 7 commits into from
Jan 3, 2024

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

Closes #.
We will need this for #7893

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @jayzhan211

datafusion/common/src/hash_utils.rs Outdated Show resolved Hide resolved
datafusion/common/src/hash_utils.rs Show resolved Hide resolved
};

for i in valid_indices {
let mut values_hashes = vec![0u64; array.len()];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you create a new vec for each iteration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create each values_hashes for each column. And combine those hashed element in column to one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values_hashes can be created upfront, outside of the loop.

@alamb alamb requested a review from Dandandan December 15, 2023 19:14
let column_values = values[i].clone();
// create hashes for each column
create_hashes(&[column_values], random_state, &mut values_hashes)?;
let hash = &mut hashes_buffer[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not correct, as i here to a column in the struct and hashes_buffer to a hash of one value?

Copy link
Contributor Author

@jayzhan211 jayzhan211 Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan. Not fully got your point. I do consider each column as one value (iterate each row in one column with get one final hash hashes_buffer[i]).

i is the column index in the struct array and is also the hashes_buffer index.
For example, column1, null, column3, column4. We iterate 0, 2, and 3. And hash each of them to hashes_buffer[0], hashes_buffer[2], and hashes_buffer[3] respectively. A column-wise hashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I mistake how the null buffer works in StructArray. It seems to be row-wise, but I consider it as column-wise...

Copy link
Contributor Author

@jayzhan211 jayzhan211 Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks pretty much better now :)

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
let valid_indices: Vec<usize> = if let Some(nulls) = nulls {
nulls.valid_indices().collect()
} else {
(0..num_columns).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't yet seem correct, valid_indices contains the rows that are valid, not the columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hash each row now, so we iterate the valid row and hash each column at that row to one hash value.

};

let mut values_hashes = vec![0u64; array.len()];
create_hashes(array.columns(), random_state, &mut values_hashes)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to use hashes_buffer here directly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, we should be able to remove the code related to valid_indices, as this already takes care of nulls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think so.

values_hashes: [9258723240401068087, 9258723240401068087, 8502738074356456021, 8502738074356456021, 4222447303697976283, 9753707356376286577]
hashes_buffer: [9258723240401091360, 9258723240401091360, 0, 8502738074356479294, 0, 0]

We can see that create_hashes does not consider if the row is valid or not. Therefore, we need to iterate valid_indices once .

@@ -327,12 +354,16 @@ pub fn create_hashes<'a>(
array => hash_dictionary(array, random_state, hashes_buffer, rehash)?,
_ => unreachable!()
}
DataType::Struct(_) => {
let array = as_struct_array(array)?;
hash_struct_array(array, random_state, hashes_buffer)?;
Copy link
Contributor

@Dandandan Dandandan Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work?

Suggested change
hash_struct_array(array, random_state, hashes_buffer)?;
create_hashes(array.columns(), random_state, hashes_buffer)?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we consider nulls, this doesn't work

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
@alamb
Copy link
Contributor

alamb commented Dec 20, 2023

@Dandandan do you think this PR is ready to merge?

@jayzhan211
Copy link
Contributor Author

@alamb @Dandandan Any update on this?

@alamb alamb merged commit 1179a76 into apache:main Jan 3, 2024
22 checks passed
@alamb
Copy link
Contributor

alamb commented Jan 3, 2024

Thank you @jayzhan211 @waynexia and @Dandandan

@jayzhan211
Copy link
Contributor Author

I found the current implementation failed this

    #[test]
    // Tests actual values of hashes, which are different if forcing collisions
    #[cfg(not(feature = "force_hash_collisions"))]
    fn create_hashes_for_struct_arrays_more_column_than_row() {
        let struct_array = StructArray::from(
            vec![
                (
                    Arc::new(Field::new("bool", DataType::Boolean, false)),
                    Arc::new(BooleanArray::from(vec![
                        false, false
                    ])) as ArrayRef,
                ),
                (
                    Arc::new(Field::new("i32-1", DataType::Int32, false)),
                    Arc::new(Int32Array::from(vec![10, 10])) as ArrayRef,
                ),
                (
                    Arc::new(Field::new("i32-2", DataType::Int32, false)),
                    Arc::new(Int32Array::from(vec![10, 10])) as ArrayRef,
                ),
                (
                    Arc::new(Field::new("i32-3", DataType::Int32, false)),
                    Arc::new(Int32Array::from(vec![10, 10])) as ArrayRef,
                ),
            ],
        );

        assert!(struct_array.is_valid(0));
        assert!(struct_array.is_valid(1));

        let array = Arc::new(struct_array) as ArrayRef;

        let random_state = RandomState::with_seeds(0, 0, 0, 0);
        let mut hashes = vec![0; array.len()];
        create_hashes(&[array], &random_state, &mut hashes).unwrap();
        assert_eq!(hashes[0], hashes[1]);
    }

Let me checkout the reason

thinkharderdev pushed a commit to coralogix/arrow-datafusion that referenced this pull request Feb 26, 2024
* hash struct

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fmt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* row-wise hash

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* fmt

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* create hashes once

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* cleanup

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

* add comment

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>

---------

Signed-off-by: jayzhan211 <jayzhan211@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants