Decode categorical #786

kshitijavis · 2023-04-19T19:57:47Z

This PR depends on #789

Since this is a big PR, I list the changes I made below

~~Update ProfileEncoder (in json_encode) to serialize with the class name. This makes it easier to decode. E.g:~~

{
  "class": <str name of class that was serialized>
  "data": {
      <attr1>: <value1>
      <attr2>: <value2>
      ...
  }
}

Create decoder function (in json_decode) to handle deserialization. This first calls json.loads on the serialized string and then creates an object by going through the attributes in the resulting dict. This currently only works for CategoricalColumn
~~Update all encode tests to recognize the "class", "data" structure from step 1~~
Write json_decode test for CateogoricalColumn. Create assert_profiles_equal method in tests/utils.py to compare two profiles by class attributes

Questions

Is the assert_profiles_equal the best way to compare profiles? Should this method be tested on its own
The serialization/deserialization process converts defaultdict objects into plain dict objects. Will this cause any bugs?

JGSweets · 2023-04-19T22:40:27Z

dataprofiler/profilers/json_encoder.py

@@ -26,7 +26,7 @@ def default(self, to_serialize):
                numerical_column_stats.NumericStatsMixin,


Thinking on this: numerical_column_stats.NumericStatsMixin would be included with any profile. Do we need it here specifically? does that cause issue?

Not saying it is a problem, just considering the value of it here.

This is needed only because I have a test case that serializes a NumericStatsMin test_numeric_stats_mixin_profile.TestNumericStatsMixin.test_json_encode. Otherwise, it's not really necessary

JGSweets · 2023-04-19T22:42:15Z

I might suggest splitting this into two PRs:

The update to encode.
The new decode.

That way 1 intent per PR and easier to not need to context switch within the PR itself.

JGSweets · 2023-04-19T22:59:12Z

dataprofiler/profilers/json_decoder.py

@@ -0,0 +1,65 @@
+"""Contains methods to decode components of a Profiler."""


edit: nvm, I think I could be over thinking this. current API is fine.

However, I'm wondering how we will handle the higher level profile cases, e.g. BaseCompiler which contains a list of profiles. Or StructuredColProfiler. We might need to handle that in the for loop setting attr by looking at the values.

~~I think we might want to consider a slightly separate API that implements functionality in the profiles themselves.~~

~~e.g.~~

import dataprofiler as dp # presuming i have this file already saved as JSON. filepath = <LOCATION_OF_DATA> profiler = dp.Profiler.load(filepath, type="json") # or profiler = dp.Profiler.load_json(filepath)

Within each profile we can handle the profiler load checks,
for categorical it could be something like:

class Cat...Profiler(...): . . . @classmethod def load_json_file(cls, filepath): ... @classmethod def load_json(cls, json_data): if json_data.get("class", None) is not cls.__name__: raise ValueError(...) profile = cls(...) ... return profile

moreover, ^^ might be able to just exist in BaseColumnProfiler if it is abstract enough.
the ... part in the load might be able to be private functions in each class the base calls e.g.

profile = cls(...) profile._set_properties(json_data.get("data"))

~~Not keen on _set_properties naming, but it would have to exist in each class loading data.~~~~

or if the code generalizes well enough, we don't need set_properties and just use:

for attr, value in json_data["data"].items(): column_profiler.__setattr__(attr, value)

I was thinking the same thing. Something like a separate StructuredColProfile decode function that calls my decode_column_profiler whenever it sees a Column Profiler in its attributes.

JGSweets · 2023-04-19T23:05:18Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

+
+    def test_json_decode_after_update(self):
+        # Actual deserialization
+        serialized = json.dumps(


by not calc'ing serialized from df_categorical, i'm concerned they could become disconnected.

it might be better todo this more similar to the other saves like:

serialized = json.dumps(profile)

then we shouldn't have to reset naming as well.

Do you mean something like

expected_profile = CategoricalColumn(df_categorical.name) expected_profile.update(df_categorical) serialized = json.dumps(expected_profile) deserialized = decode_column_profile(serialized) test_utils.assert_profiles_equal(deserialized, expected_profile)

yes, I think this is a better methodology

JGSweets · 2023-04-19T23:07:24Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

+                "c",
+            ]
+        )
+        profile = CategoricalColumn(df_categorical.name)


Instead of naming it by the df, let's make something up that we know is unique to ensure it calc's it.

profile = CategoricalColumn("Fake Profile Name")

however, your current tests the case of None which is a good edge case

JGSweets · 2023-04-19T23:09:59Z

dataprofiler/profilers/json_decoder.py

+    :type class_name: str representing name of class
+    :return: subclass of BaseColumnProfiler object
+    """
+    if class_name == CategoricalColumn.__name__:


I see, we are being declarative here to allow any class to load in the future

also avoids circular imports since not in the base.

I think to be optimum for future code we can use a dict:

profiles = { CategoricalColumn.__name__: CategoricalColumn } if class_name not in profiles: raise ValueError(...) return profiles.get(class_name)(None)

Sounds good, I'll make this change

taylorfturner · 2023-04-26T20:05:40Z

dataprofiler/tests/profilers/test_categorical_column_profile.py

+
+    def test_json_decode_after_update(self):
+        # Actual deserialization
+        serialized = json.dumps(


yes, I think this is a better methodology

taylorfturner · 2023-04-26T20:14:50Z

dataprofiler/tests/profilers/utils.py

+def assert_profiles_equal(profile1, profile2):
+    """
+    Checks if two profile objects are equal.
+
+    profiles are instances of BaseProfiler or BaseColumnProfiler. Throws
+        exception if not equal
+
+    :param profile_1: profile to compare to profile2
+    :type profile_1: instance of BaseProfiler or BaseColumnProfiler
+    :param profile_2: profile to compare to profile1
+    :type profile_2: instance of BaseProfiler or BaseColumnProfiler
+    """
+    profile1_dict = profile1.__dict__
+    profile2_dict = profile2.__dict__
+
+    if len(profile1_dict) != len(profile2_dict):
+        raise ValueError(
+            f"number of attributes on profile1 ({len(profile1_dict)}) != profile2 ({len(profile2_dict)})"
+        )
+
+    for attr1, value1 in profile1_dict.items():
+        if attr1 not in profile2_dict:
+            raise ValueError(f"Profile attributes unmatched {attr1}")
+
+        value2 = profile2_dict[attr1]
+        if not (isinstance(value2, type(value1)) or isinstance(value1, type(value2))):
+            raise ValueError(f"Profile value types unmatched: {value1} != {value2}")
+
+        if isinstance(value1, (BaseProfiler, BaseColumnProfiler)):
+            assert_profiles_equal(value1, value2)
+        elif isinstance(value1, numbers.Number):
+            np.testing.assert_equal(value1, value2)
+        elif value1 != value2:
+            raise ValueError(f"Profile values unmatched: {value1} != {value2}")


assuming other types of profiles have nesting in the profile "data", this appears to only be built to work with the expected CategoricalColumn unit tests that you wrote... not bad but would need to be further develop this to account for nesting potentially

for example, when we go to test the "top level" profile encode / decode process

I think this is fine to iterate on when we need to approach that in the future.

JGSweets · 2023-04-27T16:38:28Z

dataprofiler/profilers/json_decoder.py

+    else:
+        return profile_class(None)


technically no need for the else as it will error otherwise.

if profile_class is None: raise ValueError(f"Invalid profiler class {class_name} " f"failed to load.") return profile_class(None)

kshitijavis requested review from JGSweets, ksneab7, taylorfturner, micdavis and tyfarnan as code owners April 19, 2023 19:57

JGSweets reviewed Apr 19, 2023

View reviewed changes

taylorfturner enabled auto-merge (squash) April 20, 2023 13:18

kshitijavis changed the title ~~Decode categorical~~ (WIP) Decode categorical Apr 21, 2023

auto-merge was automatically disabled April 21, 2023 18:40
Head branch was pushed to by a user without write access

kshitijavis force-pushed the decode-categorical branch from 3e25839 to 51916df Compare April 21, 2023 18:40

kshitijavis mentioned this pull request Apr 21, 2023

Encode update format #789

Merged

taylorfturner assigned ksneab7 Apr 25, 2023

kshitijavis added 13 commits April 26, 2023 12:47

Create method to serialize NumerStatsMixin and functions

744bacb

Remove maxDiff

5ab3584

Add CategoricalColumn decoder skeleton

014a70e

Create method to get col profiler class by classname

156720e

Update load_column_profiler function

cbcaffd

Add decode_column_profiler function

8de5e33

Add incomplete json_decode test

8a5585e

test plain CategoricalColumn json_decode

213f85e

Test encode CategoricalColumn after update

0c31dc1

Improve documentation

686e95c

Revert encoder changes

b746f92

Move profiler class matching to dictionary

67dc699

Fix test naming and clearer profile names

ba6dbbb

Re-add encode update format

07b1f8c

kshitijavis force-pushed the decode-categorical branch from 46bcdc8 to 07b1f8c Compare April 26, 2023 18:23

taylorfturner enabled auto-merge (squash) April 26, 2023 18:25

kshitijavis changed the title ~~(WIP) Decode categorical~~ Decode categorical Apr 26, 2023

taylorfturner assigned kshitijavis and unassigned ksneab7 Apr 26, 2023

taylorfturner added the New Feature A feature addition not currently in the library label Apr 26, 2023

taylorfturner suggested changes Apr 26, 2023

View reviewed changes

taylorfturner reviewed Apr 26, 2023

View reviewed changes

JGSweets reviewed Apr 27, 2023

View reviewed changes

JGSweets approved these changes Apr 27, 2023

View reviewed changes

taylorfturner approved these changes Apr 27, 2023

View reviewed changes

Merge branch 'main' into decode-categorical

67a9462

taylorfturner merged commit f206af2 into capitalone:main Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode categorical #786

Decode categorical #786

kshitijavis commented Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023

kshitijavis Apr 21, 2023

JGSweets commented Apr 19, 2023

JGSweets Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023 •

edited

Loading

kshitijavis Apr 21, 2023

JGSweets Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023

kshitijavis Apr 21, 2023

taylorfturner Apr 26, 2023

JGSweets Apr 27, 2023

JGSweets Apr 19, 2023

JGSweets Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023

JGSweets Apr 19, 2023 •

edited

Loading

kshitijavis Apr 21, 2023

taylorfturner Apr 26, 2023

taylorfturner Apr 26, 2023

taylorfturner Apr 26, 2023

JGSweets Apr 27, 2023

JGSweets Apr 27, 2023 •

edited

Loading

		@@ -26,7 +26,7 @@ def default(self, to_serialize):
		numerical_column_stats.NumericStatsMixin,

		@@ -0,0 +1,65 @@
		"""Contains methods to decode components of a Profiler."""

Decode categorical #786

Decode categorical #786

Conversation

kshitijavis commented Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets commented Apr 19, 2023

JGSweets Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

JGSweets Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JGSweets Apr 27, 2023 • edited Loading

Choose a reason for hiding this comment

kshitijavis commented Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023 •

edited

Loading

JGSweets Apr 19, 2023 •

edited

Loading

JGSweets Apr 27, 2023 •

edited

Loading