-
Notifications
You must be signed in to change notification settings - Fork 928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add minhash support for MurmurHash3_x64_128 #13796
Changes from 13 commits
6004d35
f51287d
acec8a8
9775001
6ef09ba
ef22254
74a11ea
35ce135
287e3d9
72b66c2
2389901
0ecc9c6
180c619
706b1e4
fd2f505
b335f1f
07483a5
6a5226f
e2fd975
df38760
3780d96
9052e6f
23d88ca
6690e27
b5fde2e
638a65d
b347eca
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -36,24 +36,22 @@ namespace nvtext { | |
* | ||
* Any null row entries result in corresponding null output rows. | ||
* | ||
* This function uses MurmurHash3_x86_32 for the hash algorithm. | ||
* | ||
* @throw std::invalid_argument if the width < 2 | ||
* @throw std::invalid_argument if hash_function is not HASH_MURMUR3 | ||
* | ||
* @param input Strings column to compute minhash | ||
* @param seed Seed value used for the MurmurHash3_x86_32 algorithm | ||
* @param seed Seed value used for the hash algorithm | ||
* @param width The character width used for apply substrings; | ||
* Default is 4 characters. | ||
* @param hash_function Hash algorithm to use; | ||
* Only HASH_MURMUR3 is currently supported. | ||
* @param mr Device memory resource used to allocate the returned column's device memory | ||
* @return Minhash values for each string in input | ||
*/ | ||
std::unique_ptr<cudf::column> minhash( | ||
cudf::strings_column_view const& input, | ||
cudf::numeric_scalar<cudf::hash_value_type> seed = cudf::numeric_scalar(cudf::DEFAULT_HASH_SEED), | ||
cudf::size_type width = 4, | ||
cudf::hash_id hash_function = cudf::hash_id::HASH_MURMUR3, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
cudf::numeric_scalar<uint32_t> seed = 0, | ||
cudf::size_type width = 4, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
|
||
/** | ||
* @brief Returns the minhash values for each string per seed | ||
|
@@ -64,28 +62,84 @@ std::unique_ptr<cudf::column> minhash( | |
* string. The order of the elements in each row match the order of | ||
* the seeds provided in the `seeds` parameter. | ||
* | ||
* This function uses MurmurHash3_x86_32 for the hash algorithm. | ||
* | ||
* Any null row entries result in corresponding null output rows. | ||
* | ||
* @throw std::invalid_argument if the width < 2 | ||
* @throw std::invalid_argument if hash_function is not HASH_MURMUR3 | ||
* @throw std::invalid_argument if seeds is empty | ||
* @throw std::overflow_error if `seeds * input.size()` exceeds the column size limit | ||
* | ||
* @param input Strings column to compute minhash | ||
* @param seeds Seed values used for the MurmurHash3_x86_32 algorithm | ||
* @param seeds Seed values used for the hash algorithm | ||
* @param width The character width used for apply substrings; | ||
* Default is 4 characters. | ||
* @param hash_function Hash algorithm to use; | ||
* Only HASH_MURMUR3 is currently supported. | ||
* @param mr Device memory resource used to allocate the returned column's device memory | ||
* @return List column of minhash values for each string per seed | ||
* or a hash_value_type column if only a single seed is specified | ||
* or a UINT32 type column if only a single seed is specified | ||
*/ | ||
std::unique_ptr<cudf::column> minhash( | ||
cudf::strings_column_view const& input, | ||
cudf::device_span<cudf::hash_value_type const> seeds, | ||
cudf::device_span<uint32_t const> seeds, | ||
cudf::size_type width = 4, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
|
||
/** | ||
* @brief Returns the minhash value for each string | ||
* | ||
* Hash values are computed from substrings of each string and the | ||
* minimum hash value is returned for each string. | ||
* | ||
* Any null row entries result in corresponding null output rows. | ||
* | ||
* This function uses MurmurHash3_x64_128 for the hash algorithm. | ||
davidwendt marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* The hash function returns 2 uint64 values but only the first value | ||
* is used with the minhash calculation. | ||
* | ||
* @throw std::invalid_argument if the width < 2 | ||
* | ||
* @param input Strings column to compute minhash | ||
* @param seed Seed value used for the hash algorithm | ||
* @param width The character width used for apply substrings; | ||
* Default is 4 characters. | ||
* @param mr Device memory resource used to allocate the returned column's device memory | ||
* @return Minhash values as UINT64 for each string in input | ||
*/ | ||
std::unique_ptr<cudf::column> minhash64( | ||
cudf::strings_column_view const& input, | ||
cudf::numeric_scalar<uint64_t> seed = 0, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same question as before, why the different seed choice? Is it because of the potential for unsafe casts depending on the type of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the type is different here and I'd rather the 2 functions be consistent over any need to use the constant def. |
||
cudf::size_type width = 4, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
|
||
/** | ||
* @brief Returns the minhash values for each string per seed | ||
* | ||
* Hash values are computed from substrings of each string and the | ||
* minimum hash value is returned for each string for each seed. | ||
* Each row of the list column are seed results for the corresponding | ||
* string. The order of the elements in each row match the order of | ||
* the seeds provided in the `seeds` parameter. | ||
* | ||
* This function uses MurmurHash3_x64_128 for the hash algorithm. | ||
* | ||
* Any null row entries result in corresponding null output rows. | ||
* | ||
* @throw std::invalid_argument if the width < 2 | ||
* @throw std::invalid_argument if seeds is empty | ||
* @throw std::overflow_error if `seeds * input.size()` exceeds the column size limit | ||
* | ||
* @param input Strings column to compute minhash | ||
* @param seeds Seed values used for the hash algorithm | ||
* @param width The character width used for apply substrings; | ||
* Default is 4 characters. | ||
* @param mr Device memory resource used to allocate the returned column's device memory | ||
* @return List column of minhash values for each string per seed | ||
* or a UINT64 type column if only a single seed is specified | ||
*/ | ||
std::unique_ptr<cudf::column> minhash64( | ||
cudf::strings_column_view const& input, | ||
cudf::device_span<uint64_t const> seeds, | ||
cudf::size_type width = 4, | ||
cudf::hash_id hash_function = cudf::hash_id::HASH_MURMUR3, | ||
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); | ||
|
||
/** @} */ // end of group | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why switch to hardcoding the value here instead of using the constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the type may be different. I'd rather be clear that the default seed is actually 0 and would not want to change that if the rest of libcudf decided on a different default. Hopefully that is ok.