-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decode categorical #786
Decode categorical #786
Conversation
@@ -26,7 +26,7 @@ def default(self, to_serialize): | |||
numerical_column_stats.NumericStatsMixin, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking on this: numerical_column_stats.NumericStatsMixin
would be included with any profile. Do we need it here specifically? does that cause issue?
Not saying it is a problem, just considering the value of it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed only because I have a test case that serializes a NumericStatsMin test_numeric_stats_mixin_profile.TestNumericStatsMixin.test_json_encode
. Otherwise, it's not really necessary
I might suggest splitting this into two PRs:
That way 1 intent per PR and easier to not need to context switch within the PR itself. |
@@ -0,0 +1,65 @@ | |||
"""Contains methods to decode components of a Profiler.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: nvm, I think I could be over thinking this. current API is fine.
However, I'm wondering how we will handle the higher level profile cases, e.g. BaseCompiler
which contains a list of profiles. Or StructuredColProfiler
. We might need to handle that in the for loop setting attr by looking at the values.
I think we might want to consider a slightly separate API that implements functionality in the profiles themselves.
e.g.
import dataprofiler as dp
# presuming i have this file already saved as JSON.
filepath = <LOCATION_OF_DATA>
profiler = dp.Profiler.load(filepath, type="json")
# or
profiler = dp.Profiler.load_json(filepath)
Within each profile we can handle the profiler load checks,
for categorical it could be something like:
class Cat...Profiler(...):
.
.
.
@classmethod
def load_json_file(cls, filepath):
...
@classmethod
def load_json(cls, json_data):
if json_data.get("class", None) is not cls.__name__:
raise ValueError(...)
profile = cls(...)
...
return profile
moreover, ^^ might be able to just exist in BaseColumnProfiler if it is abstract enough.
the ...
part in the load might be able to be private functions in each class the base calls e.g.
profile = cls(...)
profile._set_properties(json_data.get("data"))
Not keen on _set_properties
naming, but it would have to exist in each class loading data.~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or if the code generalizes well enough, we don't need set_properties
and just use:
for attr, value in json_data["data"].items():
column_profiler.__setattr__(attr, value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the same thing. Something like a separate StructuredColProfile
decode function that calls my decode_column_profiler
whenever it sees a Column Profiler in its attributes.
|
||
def test_json_decode_after_update(self): | ||
# Actual deserialization | ||
serialized = json.dumps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by not calc'ing serialized from df_categorical
, i'm concerned they could become disconnected.
it might be better todo this more similar to the other saves like:
serialized = json.dumps(profile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we shouldn't have to reset naming as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean something like
expected_profile = CategoricalColumn(df_categorical.name)
expected_profile.update(df_categorical)
serialized = json.dumps(expected_profile)
deserialized = decode_column_profile(serialized)
test_utils.assert_profiles_equal(deserialized, expected_profile)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think this is a better methodology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
"c", | ||
] | ||
) | ||
profile = CategoricalColumn(df_categorical.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of naming it by the df, let's make something up that we know is unique to ensure it calc's it.
profile = CategoricalColumn("Fake Profile Name")
however, your current tests the case of None
which is a good edge case
:type class_name: str representing name of class | ||
:return: subclass of BaseColumnProfiler object | ||
""" | ||
if class_name == CategoricalColumn.__name__: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we are being declarative here to allow any class to load in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also avoids circular imports since not in the base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to be optimum for future code we can use a dict:
profiles = {
CategoricalColumn.__name__: CategoricalColumn
}
if class_name not in profiles:
raise ValueError(...)
return profiles.get(class_name)(None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I'll make this change
Head branch was pushed to by a user without write access
3e25839
to
51916df
Compare
46bcdc8
to
07b1f8c
Compare
|
||
def test_json_decode_after_update(self): | ||
# Actual deserialization | ||
serialized = json.dumps( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think this is a better methodology
def assert_profiles_equal(profile1, profile2): | ||
""" | ||
Checks if two profile objects are equal. | ||
|
||
profiles are instances of BaseProfiler or BaseColumnProfiler. Throws | ||
exception if not equal | ||
|
||
:param profile_1: profile to compare to profile2 | ||
:type profile_1: instance of BaseProfiler or BaseColumnProfiler | ||
:param profile_2: profile to compare to profile1 | ||
:type profile_2: instance of BaseProfiler or BaseColumnProfiler | ||
""" | ||
profile1_dict = profile1.__dict__ | ||
profile2_dict = profile2.__dict__ | ||
|
||
if len(profile1_dict) != len(profile2_dict): | ||
raise ValueError( | ||
f"number of attributes on profile1 ({len(profile1_dict)}) != profile2 ({len(profile2_dict)})" | ||
) | ||
|
||
for attr1, value1 in profile1_dict.items(): | ||
if attr1 not in profile2_dict: | ||
raise ValueError(f"Profile attributes unmatched {attr1}") | ||
|
||
value2 = profile2_dict[attr1] | ||
if not (isinstance(value2, type(value1)) or isinstance(value1, type(value2))): | ||
raise ValueError(f"Profile value types unmatched: {value1} != {value2}") | ||
|
||
if isinstance(value1, (BaseProfiler, BaseColumnProfiler)): | ||
assert_profiles_equal(value1, value2) | ||
elif isinstance(value1, numbers.Number): | ||
np.testing.assert_equal(value1, value2) | ||
elif value1 != value2: | ||
raise ValueError(f"Profile values unmatched: {value1} != {value2}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming other types of profiles have nesting in the profile "data", this appears to only be built to work with the expected CategoricalColumn
unit tests that you wrote... not bad but would need to be further develop this to account for nesting potentially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, when we go to test the "top level" profile encode / decode process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine to iterate on when we need to approach that in the future.
else: | ||
return profile_class(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically no need for the else as it will error otherwise.
if profile_class is None:
raise ValueError(f"Invalid profiler class {class_name} " f"failed to load.")
return profile_class(None)
This PR depends on #789
Since this is a big PR, I list the changes I made below
UpdateProfileEncoder
(injson_encode
) to serialize with the class name. This makes it easier to decode. E.g:json_decode
) to handle deserialization. This first callsjson.loads
on the serialized string and then creates an object by going through the attributes in the resultingdict
. This currently only works forCategoricalColumn
Update all encode tests to recognize the "class", "data" structure from step 1json_decode
test for CateogoricalColumn. Createassert_profiles_equal
method intests/utils.py
to compare two profiles by class attributesQuestions
assert_profiles_equal
the best way to compare profiles? Should this method be tested on its owndefaultdict
objects into plaindict
objects. Will this cause any bugs?