Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode categorical #786

Merged
merged 15 commits into from
Apr 27, 2023
Merged

Conversation

kshitijavis
Copy link
Contributor

@kshitijavis kshitijavis commented Apr 19, 2023

This PR depends on #789

Since this is a big PR, I list the changes I made below

  1. Update ProfileEncoder (in json_encode) to serialize with the class name. This makes it easier to decode. E.g:
{
  "class": <str name of class that was serialized>
  "data": {
      <attr1>: <value1>
      <attr2>: <value2>
      ...
  }
}
  1. Create decoder function (in json_decode) to handle deserialization. This first calls json.loads on the serialized string and then creates an object by going through the attributes in the resulting dict. This currently only works for CategoricalColumn
  2. Update all encode tests to recognize the "class", "data" structure from step 1
  3. Write json_decode test for CateogoricalColumn. Create assert_profiles_equal method in tests/utils.py to compare two profiles by class attributes

Questions

  1. Is the assert_profiles_equal the best way to compare profiles? Should this method be tested on its own
  2. The serialization/deserialization process converts defaultdict objects into plain dict objects. Will this cause any bugs?

@@ -26,7 +26,7 @@ def default(self, to_serialize):
numerical_column_stats.NumericStatsMixin,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking on this: numerical_column_stats.NumericStatsMixin would be included with any profile. Do we need it here specifically? does that cause issue?

Not saying it is a problem, just considering the value of it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed only because I have a test case that serializes a NumericStatsMin test_numeric_stats_mixin_profile.TestNumericStatsMixin.test_json_encode. Otherwise, it's not really necessary

@JGSweets
Copy link
Contributor

I might suggest splitting this into two PRs:

  1. The update to encode.
  2. The new decode.

That way 1 intent per PR and easier to not need to context switch within the PR itself.

@@ -0,0 +1,65 @@
"""Contains methods to decode components of a Profiler."""
Copy link
Contributor

@JGSweets JGSweets Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: nvm, I think I could be over thinking this. current API is fine.

However, I'm wondering how we will handle the higher level profile cases, e.g. BaseCompiler which contains a list of profiles. Or StructuredColProfiler. We might need to handle that in the for loop setting attr by looking at the values.

I think we might want to consider a slightly separate API that implements functionality in the profiles themselves.

e.g.

import dataprofiler as dp


# presuming i have this file already saved as JSON.
filepath = <LOCATION_OF_DATA>
profiler = dp.Profiler.load(filepath, type="json")

# or 
profiler = dp.Profiler.load_json(filepath)

Within each profile we can handle the profiler load checks,
for categorical it could be something like:

class Cat...Profiler(...):
    .
    .
    .
    @classmethod
    def load_json_file(cls, filepath):
        ...

    @classmethod
    def load_json(cls, json_data):
        if json_data.get("class", None) is not cls.__name__:
            raise ValueError(...)
        profile = cls(...)
        ...
        return profile

moreover, ^^ might be able to just exist in BaseColumnProfiler if it is abstract enough.
the ... part in the load might be able to be private functions in each class the base calls e.g.

   profile = cls(...)
   profile._set_properties(json_data.get("data"))

Not keen on _set_properties naming, but it would have to exist in each class loading data.~~

Copy link
Contributor

@JGSweets JGSweets Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if the code generalizes well enough, we don't need set_properties and just use:

for attr, value in json_data["data"].items():
    column_profiler.__setattr__(attr, value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking the same thing. Something like a separate StructuredColProfile decode function that calls my decode_column_profiler whenever it sees a Column Profiler in its attributes.


def test_json_decode_after_update(self):
# Actual deserialization
serialized = json.dumps(
Copy link
Contributor

@JGSweets JGSweets Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by not calc'ing serialized from df_categorical, i'm concerned they could become disconnected.

it might be better todo this more similar to the other saves like:

serialized  = json.dumps(profile)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we shouldn't have to reset naming as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean something like

expected_profile = CategoricalColumn(df_categorical.name)
expected_profile.update(df_categorical)

serialized = json.dumps(expected_profile)
deserialized = decode_column_profile(serialized)

test_utils.assert_profiles_equal(deserialized, expected_profile)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this is a better methodology

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

"c",
]
)
profile = CategoricalColumn(df_categorical.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of naming it by the df, let's make something up that we know is unique to ensure it calc's it.

profile = CategoricalColumn("Fake Profile Name")

however, your current tests the case of None which is a good edge case

:type class_name: str representing name of class
:return: subclass of BaseColumnProfiler object
"""
if class_name == CategoricalColumn.__name__:
Copy link
Contributor

@JGSweets JGSweets Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we are being declarative here to allow any class to load in the future

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also avoids circular imports since not in the base.

Copy link
Contributor

@JGSweets JGSweets Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to be optimum for future code we can use a dict:

profiles = {
    CategoricalColumn.__name__: CategoricalColumn
}
if class_name not in profiles:
    raise ValueError(...)
return profiles.get(class_name)(None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll make this change

@taylorfturner taylorfturner enabled auto-merge (squash) April 20, 2023 13:18
@kshitijavis kshitijavis changed the title Decode categorical (WIP) Decode categorical Apr 21, 2023
auto-merge was automatically disabled April 21, 2023 18:40

Head branch was pushed to by a user without write access

@kshitijavis kshitijavis mentioned this pull request Apr 21, 2023
@taylorfturner taylorfturner enabled auto-merge (squash) April 26, 2023 18:25
@kshitijavis kshitijavis changed the title (WIP) Decode categorical Decode categorical Apr 26, 2023
@taylorfturner taylorfturner assigned kshitijavis and unassigned ksneab7 Apr 26, 2023
@taylorfturner taylorfturner added the New Feature A feature addition not currently in the library label Apr 26, 2023

def test_json_decode_after_update(self):
# Actual deserialization
serialized = json.dumps(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this is a better methodology

Comment on lines +169 to +202
def assert_profiles_equal(profile1, profile2):
"""
Checks if two profile objects are equal.

profiles are instances of BaseProfiler or BaseColumnProfiler. Throws
exception if not equal

:param profile_1: profile to compare to profile2
:type profile_1: instance of BaseProfiler or BaseColumnProfiler
:param profile_2: profile to compare to profile1
:type profile_2: instance of BaseProfiler or BaseColumnProfiler
"""
profile1_dict = profile1.__dict__
profile2_dict = profile2.__dict__

if len(profile1_dict) != len(profile2_dict):
raise ValueError(
f"number of attributes on profile1 ({len(profile1_dict)}) != profile2 ({len(profile2_dict)})"
)

for attr1, value1 in profile1_dict.items():
if attr1 not in profile2_dict:
raise ValueError(f"Profile attributes unmatched {attr1}")

value2 = profile2_dict[attr1]
if not (isinstance(value2, type(value1)) or isinstance(value1, type(value2))):
raise ValueError(f"Profile value types unmatched: {value1} != {value2}")

if isinstance(value1, (BaseProfiler, BaseColumnProfiler)):
assert_profiles_equal(value1, value2)
elif isinstance(value1, numbers.Number):
np.testing.assert_equal(value1, value2)
elif value1 != value2:
raise ValueError(f"Profile values unmatched: {value1} != {value2}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming other types of profiles have nesting in the profile "data", this appears to only be built to work with the expected CategoricalColumn unit tests that you wrote... not bad but would need to be further develop this to account for nesting potentially

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, when we go to test the "top level" profile encode / decode process

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to iterate on when we need to approach that in the future.

Comment on lines +26 to +27
else:
return profile_class(None)
Copy link
Contributor

@JGSweets JGSweets Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically no need for the else as it will error otherwise.

if profile_class is None:
    raise ValueError(f"Invalid profiler class {class_name} " f"failed to load.")
return profile_class(None)

@taylorfturner taylorfturner merged commit f206af2 into capitalone:main Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Feature A feature addition not currently in the library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants