-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StructuredOptions: hhl_row_hashing
#841
StructuredOptions: hhl_row_hashing
#841
Conversation
interested to know if |
if not self.options.hll_row_hashing: | ||
try: | ||
self.hashed_row_dict.update( | ||
dict.fromkeys(pd.util.hash_pandas_object(data, index=False), True) | ||
) | ||
except TypeError: | ||
self.hashed_row_dict.update( | ||
dict.fromkeys( | ||
pd.util.hash_pandas_object(data.astype(str), index=False), True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not change the profile at all when pr the option set. Any logical usage in the profile can be after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready for re-review
@@ -1228,15 +1229,17 @@ def __init__( | |||
# Non-Option variables | |||
self.null_values = null_values | |||
self.column_null_values = column_null_values | |||
self.hll_row_hashing = hll_row_hashing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be under row_statistics since that's the hierarchy in the code. We will have to make a class for row statistics instead of using booleanoptions which will be similar to correlations like on https://github.com/capitalone/DataProfiler/pull/841/files#diff-00da4b3383cf04d1ea70e7d179d78433f0bfde3ac6909c791eb92363c5a34f3bR1225
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready for re-review
@@ -982,6 +982,64 @@ def _validate_helper(self, variable_path: str = "CorrelationOptions") -> list[st | |||
return errors | |||
|
|||
|
|||
class HyperLogLogOptions(BooleanOption): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created a nested HLL options class because seed is only pertinent when HLL is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename to UniqueCountOptions
. Here we should have a str choice to use full
or hll_row_hashing
in addition we can have the hll_options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ready for re-review
@@ -0,0 +1,71 @@ | |||
from dataprofiler.profilers.profiler_options import RowStatisticsOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure about these tests, not sure how to handle nested option testing.
:vartype seed: int | ||
""" | ||
BooleanOption.__init__(self, is_enabled=is_enabled) | ||
self.seed = seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should I also include "p" which represents the 2^p registers in the HLL table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes per conversation with the team we should implement and option for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking for 0.9.0 release
:vartype is_enabled: bool | ||
""" | ||
BooleanOption.__init__(self, is_enabled=is_enabled) | ||
self.hll_row_hashing = HyperLogLogOptions(is_enabled=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of hll_row_hashing
, this is an option wrt the hashes for unique_count
. Hence, we should do more like:
self.unique_count
so we can enable or disable that. Then, under unique_count
we can have the choice between using hll_row_hashing
or full_hashing
w/ the ability to set the parameters for hll.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed, lmk your thoughts
@@ -982,6 +982,74 @@ def _validate_helper(self, variable_path: str = "CorrelationOptions") -> list[st | |||
return errors | |||
|
|||
|
|||
class HyperLogLogOptions(BooleanOption): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree -- let's make this a UniqueCountOptions
with more granular toggles of full
or hll_row_hashing
options beneath that.
optpth = self.get_options_path() | ||
|
||
# Default configuration | ||
option = self.get_options(seed=0, register_count=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to test default, I suggest not overriding. i.e. just do option = self.get_options()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
…is/DataProfiler into hashed_row_object_options
return errors | ||
|
||
|
||
class UniqueCountOptions(BooleanOption): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious if RowStatisticsOptions
, UniqueCountOptions
, or HyperLogLogOptions
should inherit BooleanOption
or BaseInspectorOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it should be a BaseInspectorOption
(e.g. TextProfilerOptions
in UnstructuredOptions
)
:vartype is_enabled: bool | ||
""" | ||
BooleanOption.__init__(self, is_enabled=is_enabled) | ||
self.full_hashing = BooleanOption(is_enabled=full_hashing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should full_hashing
be a BooleanOption
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a boolean since there are no properties for full_hashing
... its just True
/ False
binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, but the issue is what happens if someone specifies true for both? Instead, we might want a string input that selects one or the other. good call below in the validate. Or we need a way to toggle. Open to thoughts.
hhl_row_hashing
optionDefault is original row hashing method:
hhl_row_hashing=False