Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StructuredOptions: hhl_row_hashing #841

Merged
merged 17 commits into from
Jun 6, 2023

Conversation

micdavis
Copy link
Contributor

  • added hhl_row_hashing option
    Default is original row hashing method: hhl_row_hashing=False

@micdavis
Copy link
Contributor Author

interested to know if hll_row_hashing_enabled makes more sense, or something different that what I have

@taylorfturner taylorfturner enabled auto-merge (squash) May 24, 2023 15:41
@taylorfturner taylorfturner added Refactor Code that is being modified to improve the library Mend: license policy violation License policy violation detected by WhiteSource and removed Mend: license policy violation License policy violation detected by WhiteSource labels May 24, 2023
Comment on lines 1931 to 1940
if not self.options.hll_row_hashing:
try:
self.hashed_row_dict.update(
dict.fromkeys(pd.util.hash_pandas_object(data, index=False), True)
)
except TypeError:
self.hashed_row_dict.update(
dict.fromkeys(
pd.util.hash_pandas_object(data.astype(str), index=False), True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change the profile at all when pr the option set. Any logical usage in the profile can be after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready for re-review

@@ -1228,15 +1229,17 @@ def __init__(
# Non-Option variables
self.null_values = null_values
self.column_null_values = column_null_values
self.hll_row_hashing = hll_row_hashing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be under row_statistics since that's the hierarchy in the code. We will have to make a class for row statistics instead of using booleanoptions which will be similar to correlations like on https://github.com/capitalone/DataProfiler/pull/841/files#diff-00da4b3383cf04d1ea70e7d179d78433f0bfde3ac6909c791eb92363c5a34f3bR1225

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready for re-review

@@ -982,6 +982,64 @@ def _validate_helper(self, variable_path: str = "CorrelationOptions") -> list[st
return errors


class HyperLogLogOptions(BooleanOption):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a nested HLL options class because seed is only pertinent when HLL is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename to UniqueCountOptions. Here we should have a str choice to use full or hll_row_hashing in addition we can have the hll_options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ready for re-review

@@ -0,0 +1,71 @@
from dataprofiler.profilers.profiler_options import RowStatisticsOptions
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure about these tests, not sure how to handle nested option testing.

:vartype seed: int
"""
BooleanOption.__init__(self, is_enabled=is_enabled)
self.seed = seed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I also include "p" which represents the 2^p registers in the HLL table

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes per conversation with the team we should implement and option for this

Copy link
Contributor

@taylorfturner taylorfturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking for 0.9.0 release

:vartype is_enabled: bool
"""
BooleanOption.__init__(self, is_enabled=is_enabled)
self.hll_row_hashing = HyperLogLogOptions(is_enabled=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of hll_row_hashing, this is an option wrt the hashes for unique_count. Hence, we should do more like:
self.unique_count so we can enable or disable that. Then, under unique_count we can have the choice between using hll_row_hashing or full_hashing w/ the ability to set the parameters for hll.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed, lmk your thoughts

@@ -982,6 +982,74 @@ def _validate_helper(self, variable_path: str = "CorrelationOptions") -> list[st
return errors


class HyperLogLogOptions(BooleanOption):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree -- let's make this a UniqueCountOptions with more granular toggles of full or hll_row_hashing options beneath that.

@micdavis micdavis added the Work In Progress Solution is being developed label Jun 6, 2023
optpth = self.get_options_path()

# Default configuration
option = self.get_options(seed=0, register_count=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to test default, I suggest not overriding. i.e. just do option = self.get_options()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@micdavis micdavis removed the Work In Progress Solution is being developed label Jun 6, 2023
return errors


class UniqueCountOptions(BooleanOption):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious if RowStatisticsOptions, UniqueCountOptions, or HyperLogLogOptions should inherit BooleanOption or BaseInspectorOptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@taylorfturner taylorfturner Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it should be a BaseInspectorOption (e.g. TextProfilerOptions in UnstructuredOptions)

:vartype is_enabled: bool
"""
BooleanOption.__init__(self, is_enabled=is_enabled)
self.full_hashing = BooleanOption(is_enabled=full_hashing)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should full_hashing be a BooleanOption?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a boolean since there are no properties for full_hashing ... its just True / False binary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, but the issue is what happens if someone specifies true for both? Instead, we might want a string input that selects one or the other. good call below in the validate. Or we need a way to toggle. Open to thoughts.

@taylorfturner taylorfturner merged commit 0f2c916 into capitalone:main Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Code that is being modified to improve the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants