Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drastic TestStructuredProfiler Speed Improvement #953

Merged
merged 1 commit into from
Jul 6, 2023

Conversation

lizlouise1335
Copy link
Contributor

@lizlouise1335 lizlouise1335 commented Jul 6, 2023

disabled multiprocessing for TestStructuredProfiler class and cut down runtime

TestStructuredProfiler Call Time Improvements - Before/After in Seconds
OG Runtime:
248.38s or roughly 4 minutes
Total New Runtime:
116.05s or roughly 2 minutes

Test_correlation_update call:
34.72
7.82
Test_chi2 call:
28.88
7.16
Test_report_remove_disabled_flag call:
24.92
9.37
Test_update_chi2 call time:
46.82
0.54
test_merge_chi2
22.68
6.47
test_correlation
43.08
1.01
***** test_save_and_load_pkl_file
21.05 # time is hardly improved at all by turning off the multiprocessing so will maybe come back to this
20.57
test_unique_col_permutation
18.41
9.78
test_save_and_load_no_labeler
16.67
7.99
test_data_label_assigned
13.98
6.66
test_duplicate_columns
12.38
6.27

@JGSweets JGSweets merged commit b63b25d into capitalone:dev Jul 6, 2023
taylorfturner pushed a commit that referenced this pull request Aug 1, 2023
taylorfturner pushed a commit that referenced this pull request Aug 4, 2023
micdavis pushed a commit that referenced this pull request Aug 7, 2023
* let's try this again (#953)

* Fix/f1 score path fix import (#952)

* Fixed F1Score Import

* Linted example file with Black Linter

* Scipy bug fix (#951)

* update

* renamed var and removed from for loops

* refactored var

* Make BaseDataProcessor.process() compatible with all argument sets (#954)

A method signature that uses *args: Any, **kwargs: Any is compatible
with any set of arguments in mypy, despite being an LSP violation. This
lets us assert that subclasses of BaseDataProcessor should have some
process() method with an arbitrary signature.

We also add to the return type of BaseDataPreprocessor so that it is
inclusive of all of its subclasses.

Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>

* Fix name mangling and typevar errors (#955)

Inside the BaseDataProcessor class definition, references to
__subclasses are automatically replaced with
_BaseDataProcessor__subclasses. This remains the case even in static
methods _register_subclass() and get_class(). Same with BaseModel and
its __subclasses field. So we do not have to write out the full name
mangled identifiers inside the class definitions.

Also, mypy doesn't seem to be able to handle the return type of
BaseDataProcessor.get_class() being a typevar, so that was changed to
type[BaseDataProcessor]. This does not affect the functionality of
get_class() since it always returns a subclass of BaseDataProcessor.

* None-check labels dependants (#964)

The mypy errors addressed here occur because variables label_mapping
(in CharPreprocessor), unstructured_labels, and unstructured_label_set
(in StructCharPreprocessor.process()) have optional types when they're
used. This is fixed by checking that they are not None prior to the
operation, which mypy recognizes as removing the None type from them.

This should have no effect on functionality because we are already
checking that labels is not None, and the variables above all depend on
labels such that they are None only if labels is None.

* Changed `publish-python-package.yml` to include only release branches. (#965)

* Changed release option to only release branches named \'release/<version-tag>\'.

* Reverted types

* Updated DATAPROFILER_SEED setting in utils.py; abstracted RNG creation (#959) (#966)

* abstracted rng creation 23/07/11 14:32

* updated profile_builder random number generation

* renamed dp_rng() to get_random_number_generator()

* updated data_utils random number generation, added warning back to get_random_number_generator()

* removed erroneous print statement

* added tests of get_random_number_generator() to test_data_utils and test_utils

* removed unnecessary int dtype conversion

* edited seed declaration statement

* added setUp function to get_random_number_generator() testing

* fixed duplicate variable declaration in test_data_utils.py and test_utils.py

* moved generator function to root of dataprofiler dir; added test_generator.py; reverted test_data_utils and test_utils

* moved and renamed utils_global; cleaned up unused imports

* additional tests of get_random_number_generator()

* added test of utils_global for DATAPROFILER_SEED not in os.environ and settings._seed==None

* added the last four unit tests in Taylors requested changes to test_utils_global.py

* removed unneeded tests and declarations; changed to relative imports; updated assertWarnsRegex in test_utils_global

* changed two more imports to relative imports

* updated rng.integers call

* removed unnecessary slicing/indexing

* removed unnecessary slicing/indexing

* cleaned up os.environ mocks in test_utils_global

* mocked expected values in unit tests

* simplified mocks

* removed unnecessary test

* added more descriptive mock names; ensured that rng generator uses proper seed

* cleaned up mock names; improved docstrings

* removed unnecessary clear=True clauses; removed duplicate assert statement

* made clear=True statements consistent

* removed one variable declaration; added clear=True to one mock

* removed clear=True statement

* removed unused imports and variable declarations

* renamed utils_global -> rng_utils and corresponding test; renamed utils.py -> profiler_utils.py and corresponding test

* fixed import error

* renamed utils.py and utils_global.py

* replaced imports of profilers.utils with profilers.profiler_utils

Co-authored-by: jacob-buehler <86370501+jacob-buehler@users.noreply.github.com>

* Staging: into dev feature/num-quantiles (#990)

* fix scipy mend issue (#988)

* HistogramAndQuantilesOption sync with dev branch (#987)

* Changes to HistogramAndQuantilesOption now sync with concurrent updates to dev branch.

* Changes to scipy version, fixing comments

* Slight docstrings change

* revert back -- other PR to fix

* empty

* fix

* Staging multiprocess automation into dev (#997) (#998)

* Fix ProfilerOptions() documentation (#1002)

* fixed hyperlinks to documentation about ProfilerOptions()

* relative path add

* update with proper link

* update unstruct with link

* update version

* retain

* revert

---------

Co-authored-by: Liz Smith <liz.smith@richmond.edu>
Co-authored-by: Navid Nafiuzzaman <mxn4459@rit.edu>
Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com>
Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>
Co-authored-by: clee1152 <chrislee011502@gmail.com>
Co-authored-by: jacob-buehler <86370501+jacob-buehler@users.noreply.github.com>
junholee6a added a commit to junholee6a/DataProfiler that referenced this pull request Sep 17, 2023
* let's try this again (capitalone#953)

* Fix/f1 score path fix import (capitalone#952)

* Fixed F1Score Import

* Linted example file with Black Linter

* Scipy bug fix (capitalone#951)

* update

* renamed var and removed from for loops

* refactored var

* Make BaseDataProcessor.process() compatible with all argument sets (capitalone#954)

A method signature that uses *args: Any, **kwargs: Any is compatible
with any set of arguments in mypy, despite being an LSP violation. This
lets us assert that subclasses of BaseDataProcessor should have some
process() method with an arbitrary signature.

We also add to the return type of BaseDataPreprocessor so that it is
inclusive of all of its subclasses.

Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>

* Fix name mangling and typevar errors (capitalone#955)

Inside the BaseDataProcessor class definition, references to
__subclasses are automatically replaced with
_BaseDataProcessor__subclasses. This remains the case even in static
methods _register_subclass() and get_class(). Same with BaseModel and
its __subclasses field. So we do not have to write out the full name
mangled identifiers inside the class definitions.

Also, mypy doesn't seem to be able to handle the return type of
BaseDataProcessor.get_class() being a typevar, so that was changed to
type[BaseDataProcessor]. This does not affect the functionality of
get_class() since it always returns a subclass of BaseDataProcessor.

* None-check labels dependants (capitalone#964)

The mypy errors addressed here occur because variables label_mapping
(in CharPreprocessor), unstructured_labels, and unstructured_label_set
(in StructCharPreprocessor.process()) have optional types when they're
used. This is fixed by checking that they are not None prior to the
operation, which mypy recognizes as removing the None type from them.

This should have no effect on functionality because we are already
checking that labels is not None, and the variables above all depend on
labels such that they are None only if labels is None.

* Changed `publish-python-package.yml` to include only release branches. (capitalone#965)

* Changed release option to only release branches named \'release/<version-tag>\'.

* Reverted types

* Updated DATAPROFILER_SEED setting in utils.py; abstracted RNG creation (capitalone#959) (capitalone#966)

* abstracted rng creation 23/07/11 14:32

* updated profile_builder random number generation

* renamed dp_rng() to get_random_number_generator()

* updated data_utils random number generation, added warning back to get_random_number_generator()

* removed erroneous print statement

* added tests of get_random_number_generator() to test_data_utils and test_utils

* removed unnecessary int dtype conversion

* edited seed declaration statement

* added setUp function to get_random_number_generator() testing

* fixed duplicate variable declaration in test_data_utils.py and test_utils.py

* moved generator function to root of dataprofiler dir; added test_generator.py; reverted test_data_utils and test_utils

* moved and renamed utils_global; cleaned up unused imports

* additional tests of get_random_number_generator()

* added test of utils_global for DATAPROFILER_SEED not in os.environ and settings._seed==None

* added the last four unit tests in Taylors requested changes to test_utils_global.py

* removed unneeded tests and declarations; changed to relative imports; updated assertWarnsRegex in test_utils_global

* changed two more imports to relative imports

* updated rng.integers call

* removed unnecessary slicing/indexing

* removed unnecessary slicing/indexing

* cleaned up os.environ mocks in test_utils_global

* mocked expected values in unit tests

* simplified mocks

* removed unnecessary test

* added more descriptive mock names; ensured that rng generator uses proper seed

* cleaned up mock names; improved docstrings

* removed unnecessary clear=True clauses; removed duplicate assert statement

* made clear=True statements consistent

* removed one variable declaration; added clear=True to one mock

* removed clear=True statement

* removed unused imports and variable declarations

* renamed utils_global -> rng_utils and corresponding test; renamed utils.py -> profiler_utils.py and corresponding test

* fixed import error

* renamed utils.py and utils_global.py

* replaced imports of profilers.utils with profilers.profiler_utils

Co-authored-by: jacob-buehler <86370501+jacob-buehler@users.noreply.github.com>

* Staging: into dev feature/num-quantiles (capitalone#990)

* fix scipy mend issue (capitalone#988)

* HistogramAndQuantilesOption sync with dev branch (capitalone#987)

* Changes to HistogramAndQuantilesOption now sync with concurrent updates to dev branch.

* Changes to scipy version, fixing comments

* Slight docstrings change

* revert back -- other PR to fix

* empty

* fix

* Staging multiprocess automation into dev (capitalone#997) (capitalone#998)

* Fix ProfilerOptions() documentation (capitalone#1002)

* fixed hyperlinks to documentation about ProfilerOptions()

* relative path add

* update with proper link

* update unstruct with link

* update version

* retain

* revert

---------

Co-authored-by: Liz Smith <liz.smith@richmond.edu>
Co-authored-by: Navid Nafiuzzaman <mxn4459@rit.edu>
Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com>
Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>
Co-authored-by: clee1152 <chrislee011502@gmail.com>
Co-authored-by: jacob-buehler <86370501+jacob-buehler@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants