-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ColumnDataLabelerCompiler: serialize / deserialize #888
ColumnDataLabelerCompiler: serialize / deserialize #888
Conversation
@@ -87,77 +87,6 @@ def test_add_profilers(self): | |||
self.assertEqual(3, merged_compiler._profiles["test"]) | |||
self.assertEqual("compiler1", merged_compiler.name) | |||
|
|||
@mock.patch("dataprofiler.profilers.data_labeler_column_profile.DataLabeler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving to its own TestColumnDataLabelerCompiler
test class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@@ -532,7 +461,7 @@ def test_json_decode_after_update(self): | |||
|
|||
|
|||
class TestColumnStatsProfileCompiler(unittest.TestCase): | |||
def test_primitive_compiler_report(self): | |||
def test_column_stats_profile_compiler_report(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing naming issues from PR #887
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
@@ -553,7 +482,7 @@ def test_primitive_compiler_report(self): | |||
self.assertIn("categorical", report) | |||
self.assertNotIn("order", report) | |||
|
|||
def test_compiler_stats_diff(self): | |||
def test_column_stats_profile_compiler_stats_diff(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
45cb9a5
to
1b79b06
Compare
compiler1 = col_pro_compilers.ColumnDataLabelerCompiler( | ||
data1, structured_options | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use mocks on the class maybe, i assume we don't need the specifically for these tests to run a labeler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added mocks to class
with mock.patch.object( | ||
compiler._profiles["order"], "__dict__", {"an": "order"} | ||
): | ||
with mock.patch.object( | ||
compiler._profiles["category"], "__dict__", {"this": "category"} | ||
): | ||
serialized = json.dumps(compiler, cls=ProfileEncoder) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rework for our case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
assert deserialized.report().get("order", None) == "ascending" | ||
assert deserialized.report().get("categorical", None) == True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rework for our case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
df_float = pd.Series( | ||
list(range(100)) # make orer random and not categorical | ||
).apply(str) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a data case more specific to what we want to check in data labeling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
working this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
||
test_utils.assert_profiles_equal(expected_compiler, deserialized) | ||
|
||
def test_json_decode_after_update(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some mocks in the DataLabelerColumn tests might be helpful for this wrt mocking out a data labeler and prediction capabilities to use "real" results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
0a69181
to
833d582
Compare
…ialize_data_labeler_compiler
new_data = pd.Series(list(range(100))).apply(str) | ||
|
||
# validating update after deserialization with a few small tests | ||
deserialized.update_profile(new_data) | ||
assert deserialized.report().get("data_label", None) == "a|b" | ||
assert ( | ||
sum( | ||
[ | ||
v | ||
for k, v in deserialized.report() | ||
.get("statistics", None) | ||
.get("avg_predictions", None) | ||
.items() | ||
] | ||
) | ||
== 1.0 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to fix this part of the test. The problem is, update_profile
could be doing absolutely nothing when it should have impact. Right now, this test doesn't validate any difference after the update itself which means it could be passing when it should fail.
== 1.0 | ||
) | ||
|
||
new_data = pd.Series(list(range(100))).apply(str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to do a huge update, just add 1 value
* initial changes to categoricalColumn decoder (#818) * Implemented decoding for numerical stats mixin and integer profiles (#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (#852) * Float column profiler encode decode (#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (#868) * added decode text_column_profiler functionality and tests (#870) * Created encoder for the datalabelercolumn (#869) * feat: add test and compiler serialization (#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (#879) * Deserialization of datalabeler (#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (#893) * update * string in list * formatting * Decode options (#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (#915) * Reuse passed labeler (#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (#925) * quick hot fix for input validation on save() save_metho (#931) * BaseProfiler: `load_method` hotfix (#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (#933) * Notebook Example save/load Profile (#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
* feat: add dev to workfow for testing (#897) * Reservoir sampling (#826) * add code for reservoir sampling and insert sample_nrows options * pre commit fix * add tests for reservoir sampling * fixed mypy issues * fix import to relative path --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Richard Bann <richard@bann.com> * [WIP] staging/dev/options (#909) * New preset implementation and test (#867) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * Wrote catch for invalid presets, wrote test for catch for invalid presets, debugged new optimization preset * Forgot to run pre-commit, fixed those issues * black doing weird things * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * RowStatisticsOptions: Add option (#865) * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * added a unit test for RowStatisticsOptions * Deleted test cases that were written in the wrong file * updated testing for null_count toggle in _update_row_statistics * removed the RowStatisticsOptions from test_profiler_options imports * add line * Created toggle option for null_count * RowStatisticsOptions: Add implementation * Revert "RowStatisticsOptions: Add implementation" This reverts commit 2da6a93. * RowStatsticsOptions: Create option * fixed pre-commit error * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * fixed documentation --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Preset test updated w new names and different toggles (#880) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * trying * trying * black doing weird things * trying * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * Update to open-source in prep for wrapper changes for mem op preset * updated preset toggles and preset name (mem op -> large data) * updated tests to match * continued name and test and toggle updates * fix comments * RowStatisticsOptions: Implementing option (#871) * Implementing option * Implementing option * took out redundant if statement. added test case for when null_count is disabled. * attempt to check for conflicts between profile merges * added test to check if two profilers have null_count enabled before merging them together * fixed typo and added a trycatch to prevent failing test * No mocks needed. Fixed assertRaisesRegex error * Changed variables names and added a new test to check for check the null_count when null_count is disabled. * Changed name of test, moved tests to TestStructuredProfilerRowStatistics. Fixed position of if statement to prevent unnecessary code from running. * added null_count test cases * fixed indentation mistake * fixed typo * removed a useless commented a line * Updated test name * update --------- Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> * Cms for categorical (#892) * WIP cms implementation * add heavy hitters implementation * add heavy hitters implementation * WIP: mypy issue * WIP: mypy issue * add cms bool and refactor options handler * WIP: testing for CMS * WIP: testing for CMS * use new heavy_hitters_threshold, add test for it * Reservoir sampling refactor (#910) * refactored all but tests * removed some superfluous tests * moved variables around * Staging/dev/profile serialization (#940) * initial changes to categoricalColumn decoder (#818) * Implemented decoding for numerical stats mixin and integer profiles (#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (#852) * Float column profiler encode decode (#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (#868) * added decode text_column_profiler functionality and tests (#870) * Created encoder for the datalabelercolumn (#869) * feat: add test and compiler serialization (#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (#879) * Deserialization of datalabeler (#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (#893) * update * string in list * formatting * Decode options (#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (#915) * Reuse passed labeler (#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (#925) * quick hot fix for input validation on save() save_metho (#931) * BaseProfiler: `load_method` hotfix (#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (#933) * Notebook Example save/load Profile (#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * Hotfix: fix post feature serialization merge (#942) * fix: to use config instead of options * fix: comment * fix: maxdiff * version bump (#944) --------- Co-authored-by: JGSweets <JGSweets@users.noreply.github.com> Co-authored-by: Rushabh Vinchhi <rushabhuvinchhi@gmail.com> Co-authored-by: Richard Bann <richard@bann.com> Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
* feat: add dev to workfow for testing (capitalone#897) * Reservoir sampling (capitalone#826) * add code for reservoir sampling and insert sample_nrows options * pre commit fix * add tests for reservoir sampling * fixed mypy issues * fix import to relative path --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Richard Bann <richard@bann.com> * [WIP] staging/dev/options (capitalone#909) * New preset implementation and test (capitalone#867) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * Wrote catch for invalid presets, wrote test for catch for invalid presets, debugged new optimization preset * Forgot to run pre-commit, fixed those issues * black doing weird things * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * RowStatisticsOptions: Add option (capitalone#865) * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * RowStatisticsOptions: Add null row count Added null_row_count as an option in RowStatisticsOptions. It toggles the functionality for row_has_null_ratio and row_is_null_ratio in _update_row_statistics. * Unit test for RowStatisticOptions: * Black formatting * added a unit test for RowStatisticsOptions * Deleted test cases that were written in the wrong file * updated testing for null_count toggle in _update_row_statistics * removed the RowStatisticsOptions from test_profiler_options imports * add line * Created toggle option for null_count * RowStatisticsOptions: Add implementation * Revert "RowStatisticsOptions: Add implementation" This reverts commit 2da6a93. * RowStatsticsOptions: Create option * fixed pre-commit error * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * fixed documentation --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Preset test updated w new names and different toggles (capitalone#880) * memory optimization preset ttrying again ttrying again 3 ttrying again 4 accidentally pushed my updated makefile * trying * trying * black doing weird things * trying * made preset validation more maintainable by moving it to the constructor and getting rid of preset list * Update to open-source in prep for wrapper changes for mem op preset * updated preset toggles and preset name (mem op -> large data) * updated tests to match * continued name and test and toggle updates * fix comments * RowStatisticsOptions: Implementing option (capitalone#871) * Implementing option * Implementing option * took out redundant if statement. added test case for when null_count is disabled. * attempt to check for conflicts between profile merges * added test to check if two profilers have null_count enabled before merging them together * fixed typo and added a trycatch to prevent failing test * No mocks needed. Fixed assertRaisesRegex error * Changed variables names and added a new test to check for check the null_count when null_count is disabled. * Changed name of test, moved tests to TestStructuredProfilerRowStatistics. Fixed position of if statement to prevent unnecessary code from running. * added null_count test cases * fixed indentation mistake * fixed typo * removed a useless commented a line * Updated test name * update --------- Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> * Cms for categorical (capitalone#892) * WIP cms implementation * add heavy hitters implementation * add heavy hitters implementation * WIP: mypy issue * WIP: mypy issue * add cms bool and refactor options handler * WIP: testing for CMS * WIP: testing for CMS * use new heavy_hitters_threshold, add test for it * Reservoir sampling refactor (capitalone#910) * refactored all but tests * removed some superfluous tests * moved variables around * Staging/dev/profile serialization (capitalone#940) * initial changes to categoricalColumn decoder (capitalone#818) * Implemented decoding for numerical stats mixin and integer profiles (capitalone#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (capitalone#852) * Float column profiler encode decode (capitalone#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (capitalone#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (capitalone#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (capitalone#868) * added decode text_column_profiler functionality and tests (capitalone#870) * Created encoder for the datalabelercolumn (capitalone#869) * feat: add test and compiler serialization (capitalone#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (capitalone#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (capitalone#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (capitalone#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (capitalone#879) * Deserialization of datalabeler (capitalone#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (capitalone#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (capitalone#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (capitalone#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (capitalone#893) * update * string in list * formatting * Decode options (capitalone#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (capitalone#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (capitalone#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (capitalone#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (capitalone#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (capitalone#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (capitalone#915) * Reuse passed labeler (capitalone#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (capitalone#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (capitalone#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (capitalone#925) * quick hot fix for input validation on save() save_metho (capitalone#931) * BaseProfiler: `load_method` hotfix (capitalone#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (capitalone#933) * Notebook Example save/load Profile (capitalone#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (capitalone#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * Hotfix: fix post feature serialization merge (capitalone#942) * fix: to use config instead of options * fix: comment * fix: maxdiff * version bump (capitalone#944) --------- Co-authored-by: JGSweets <JGSweets@users.noreply.github.com> Co-authored-by: Rushabh Vinchhi <rushabhuvinchhi@gmail.com> Co-authored-by: Richard Bann <richard@bann.com> Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
* initial changes to categoricalColumn decoder (capitalone#818) * Implemented decoding for numerical stats mixin and integer profiles (capitalone#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (capitalone#852) * Float column profiler encode decode (capitalone#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (capitalone#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (capitalone#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (capitalone#868) * added decode text_column_profiler functionality and tests (capitalone#870) * Created encoder for the datalabelercolumn (capitalone#869) * feat: add test and compiler serialization (capitalone#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (capitalone#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (capitalone#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (capitalone#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (capitalone#879) * Deserialization of datalabeler (capitalone#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (capitalone#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (capitalone#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (capitalone#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (capitalone#893) * update * string in list * formatting * Decode options (capitalone#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (capitalone#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (capitalone#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (capitalone#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (capitalone#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (capitalone#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (capitalone#915) * Reuse passed labeler (capitalone#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (capitalone#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (capitalone#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (capitalone#925) * quick hot fix for input validation on save() save_metho (capitalone#931) * BaseProfiler: `load_method` hotfix (capitalone#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (capitalone#933) * Notebook Example save/load Profile (capitalone#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (capitalone#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
* initial changes to categoricalColumn decoder (capitalone#818) * Implemented decoding for numerical stats mixin and integer profiles (capitalone#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (capitalone#852) * Float column profiler encode decode (capitalone#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (capitalone#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (capitalone#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (capitalone#868) * added decode text_column_profiler functionality and tests (capitalone#870) * Created encoder for the datalabelercolumn (capitalone#869) * feat: add test and compiler serialization (capitalone#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (capitalone#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (capitalone#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (capitalone#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (capitalone#879) * Deserialization of datalabeler (capitalone#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (capitalone#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (capitalone#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (capitalone#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (capitalone#893) * update * string in list * formatting * Decode options (capitalone#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (capitalone#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (capitalone#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (capitalone#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (capitalone#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (capitalone#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (capitalone#915) * Reuse passed labeler (capitalone#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (capitalone#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (capitalone#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (capitalone#925) * quick hot fix for input validation on save() save_metho (capitalone#931) * BaseProfiler: `load_method` hotfix (capitalone#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (capitalone#933) * Notebook Example save/load Profile (capitalone#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (capitalone#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
* initial changes to categoricalColumn decoder (capitalone#818) * Implemented decoding for numerical stats mixin and integer profiles (capitalone#844) * hot fixes for encode and decode of numeric stats mixin and intcol profiler (capitalone#852) * Float column profiler encode decode (capitalone#854) * hot fixes for encode and decode of numeric stats mixin and intcol profiler * cleaned up type checking and updated numericstatsmixin readin helper to give type conversions to more attributes * Added docstring to the _load_stats_helper function * Update dataprofiler/profilers/numerical_column_stats.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/numerical_column_stats.py * fix for nan values issue in pytesting * Implementation of float profiler encode and decode process --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Json decode date time column (capitalone#861) * more verbose error log with types for easy debug * add load_from_dict to handle tiimestamps * add json decode tests * include DateTimeColumn class * Added decoding for encoding of ordered column profiles (capitalone#864) * Added ordered col test to ensure correct response to update when different ordering of values is introduced (capitalone#868) * added decode text_column_profiler functionality and tests (capitalone#870) * Created encoder for the datalabelercolumn (capitalone#869) * feat: add test and compiler serialization (capitalone#884) * [WIP] Adds tests validating serialization with Primitive type for compiler (capitalone#885) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * Adds deserialization for compilers and validates tests for Primitive; fixes numerical deserialization (capitalone#886) * feat: add test and compiler serialization * fix: move primitive tests to own class * feat: add primitive col compiler save tests * fix: float serializers asserts * feat: add tests and allow primitive compiler to deserialize * fix: bug in numeric stats deserial * fix: missing `)` after conflict resolution * Add Serialization and Deserialization Tests for Stats Compiler, plus refactors for order Typing (capitalone#887) * fix: organize categorical and add get function * refactor: reorganize tests and add stats test * feat: order typing * feat: add serial and deserial for stats compiler * fix: bug when sample_size == 0 * ready datalabeler for deserialization and improvement on serialization for datalabeler (capitalone#879) * Deserialization of datalabeler (capitalone#891) * Added initial profiler decoding for datalabeler column (WIP) * Intialial implementation for deserialization of datalabelercolumn * Fix LSP violations (capitalone#840) * Make profiler superclasses generic Makes the superclasses BaseColumnProfiler, NumericStatsMixin, and BaseCompiler generic, to avoid casting in subclass diff() methods and violating LSP in principle. * Add needed cast import --------- Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> * Encode Options (capitalone#875) * encode testing * encode dataLabeler testing * encode structuredOptions testing * cleaned up datalabeler test * added text options * [WIP] ColumnDataLabelerCompiler: serialize / deserialize (capitalone#888) * formatting * update formatting * setting up full test suite for DataLabelerCompiler * update isort * updates to test -- still failing * update * Quick Test update (capitalone#893) * update * string in list * formatting * Decode options (capitalone#894) * refactored options encode testing * updated test name * updated class names * fixing test * initial base option decode * inital tests * refactor: allow options to go through all (capitalone#902) * refactor: allow options to go through all * fix: bug * StructuredColProfiler Encode / Decode (capitalone#901) * refactor: allow options to go through all * fix: bug * update * update * update * updates * update * Fixes for taylors StructuredCol Issue * update * update * remove try/except --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> * fix: bug and add tests for structuredcolprofiler (capitalone#904) * fix: bug and add tests * fix: limit scipy requirements till problem understood and fixed * Stuctured profiler encode decode (capitalone#903) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> * [WIP] Added NoImplementationError for UnstructuredProfiler (capitalone#907) * refactor: allow options to go through all * fix: bug in loading options * update * update * Fixes for taylors StructuredCol Issue * Created load and save code from structuredprofiler * intermidiate commit for fixing structured profile * test fix * mypy fixes for typing issues * fix for none case of the datalabler in options * Added mock of datalabeler to structured profile test * Added tests for encoding of the Structured profiler * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/profilers/profiler_options.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Pr fixes * Fixed typo in test * Update dataprofiler/profilers/json_decoder.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Update dataprofiler/tests/profilers/utils.py Co-authored-by: Taylor Turner <taylorfturner@gmail.com> * Update dataprofiler/profilers/profile_builder.py Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> * Fixes for unneeeded callout for _profile check * small change --------- Co-authored-by: Jeremy Goodsitt <jeremy.goodsitt@gmail.com> Co-authored-by: taylorfturner <taylorfturner@gmail.com> Co-authored-by: ksneab7 <ksneab7@gmail.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> * Added testing for values for test_json_decode_after_update (capitalone#915) * Reuse passed labeler (capitalone#924) * refactor: loading labeler for reuse and abstract loading * refactor: use for DataLabelerColumn as well * fix: don't error if doesn't exist * refactor: allow for config dict to be passed entire way * fix: compiler tests * fix: structCol tests * fix: test * BaseProfiler save() for json (capitalone#923) * added save for top level and tests * small refactor * small fix * refactor: use seed for sample for consistency (capitalone#927) * refactor: use seed for sample for consistency * fix: formatting and variables * WIP top level load (capitalone#925) * quick hot fix for input validation on save() save_metho (capitalone#931) * BaseProfiler: `load_method` hotfix (capitalone#932) * added load_method * updated tests * fix: null_rep mat should calculate even if datetime (capitalone#933) * Notebook Example save/load Profile (capitalone#930) * update example data profiler demo save/load * update notebook cells * Update examples/data_profiler_demo.ipynb * Update examples/data_profiler_demo.ipynb * fix: order bug (capitalone#939) * fix: typo on rebase * fix: typing and bugs from rebase * fix: options tests due to merge and loading new options --------- Co-authored-by: Michael Davis <36012613+micdavis@users.noreply.github.com> Co-authored-by: ksneab7 <91956551+ksneab7@users.noreply.github.com> Co-authored-by: Taylor Turner <taylorfturner@gmail.com> Co-authored-by: Tyler <tfarnan@ucsd.edu> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: ksneab7 <ksneab7@gmail.com>
Adding serialize / deserialize for
ColumnDataLabelerCompiler