Skip to content

Commit

Permalink
Staging: main 0.10.3 (capitalone#1004)
Browse files Browse the repository at this point in the history
* let's try this again (capitalone#953)

* Fix/f1 score path fix import (capitalone#952)

* Fixed F1Score Import

* Linted example file with Black Linter

* Scipy bug fix (capitalone#951)

* update

* renamed var and removed from for loops

* refactored var

* Make BaseDataProcessor.process() compatible with all argument sets (capitalone#954)

A method signature that uses *args: Any, **kwargs: Any is compatible
with any set of arguments in mypy, despite being an LSP violation. This
lets us assert that subclasses of BaseDataProcessor should have some
process() method with an arbitrary signature.

We also add to the return type of BaseDataPreprocessor so that it is
inclusive of all of its subclasses.

Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>

* Fix name mangling and typevar errors (capitalone#955)

Inside the BaseDataProcessor class definition, references to
__subclasses are automatically replaced with
_BaseDataProcessor__subclasses. This remains the case even in static
methods _register_subclass() and get_class(). Same with BaseModel and
its __subclasses field. So we do not have to write out the full name
mangled identifiers inside the class definitions.

Also, mypy doesn't seem to be able to handle the return type of
BaseDataProcessor.get_class() being a typevar, so that was changed to
type[BaseDataProcessor]. This does not affect the functionality of
get_class() since it always returns a subclass of BaseDataProcessor.

* None-check labels dependants (capitalone#964)

The mypy errors addressed here occur because variables label_mapping
(in CharPreprocessor), unstructured_labels, and unstructured_label_set
(in StructCharPreprocessor.process()) have optional types when they're
used. This is fixed by checking that they are not None prior to the
operation, which mypy recognizes as removing the None type from them.

This should have no effect on functionality because we are already
checking that labels is not None, and the variables above all depend on
labels such that they are None only if labels is None.

* Changed `publish-python-package.yml` to include only release branches. (capitalone#965)

* Changed release option to only release branches named \'release/<version-tag>\'.

* Reverted types

* Updated DATAPROFILER_SEED setting in utils.py; abstracted RNG creation (capitalone#959) (capitalone#966)

* abstracted rng creation 23/07/11 14:32

* updated profile_builder random number generation

* renamed dp_rng() to get_random_number_generator()

* updated data_utils random number generation, added warning back to get_random_number_generator()

* removed erroneous print statement

* added tests of get_random_number_generator() to test_data_utils and test_utils

* removed unnecessary int dtype conversion

* edited seed declaration statement

* added setUp function to get_random_number_generator() testing

* fixed duplicate variable declaration in test_data_utils.py and test_utils.py

* moved generator function to root of dataprofiler dir; added test_generator.py; reverted test_data_utils and test_utils

* moved and renamed utils_global; cleaned up unused imports

* additional tests of get_random_number_generator()

* added test of utils_global for DATAPROFILER_SEED not in os.environ and settings._seed==None

* added the last four unit tests in Taylors requested changes to test_utils_global.py

* removed unneeded tests and declarations; changed to relative imports; updated assertWarnsRegex in test_utils_global

* changed two more imports to relative imports

* updated rng.integers call

* removed unnecessary slicing/indexing

* removed unnecessary slicing/indexing

* cleaned up os.environ mocks in test_utils_global

* mocked expected values in unit tests

* simplified mocks

* removed unnecessary test

* added more descriptive mock names; ensured that rng generator uses proper seed

* cleaned up mock names; improved docstrings

* removed unnecessary clear=True clauses; removed duplicate assert statement

* made clear=True statements consistent

* removed one variable declaration; added clear=True to one mock

* removed clear=True statement

* removed unused imports and variable declarations

* renamed utils_global -> rng_utils and corresponding test; renamed utils.py -> profiler_utils.py and corresponding test

* fixed import error

* renamed utils.py and utils_global.py

* replaced imports of profilers.utils with profilers.profiler_utils

Co-authored-by: jacob-buehler <86370501+jacob-buehler@users.noreply.github.com>

* Staging: into dev feature/num-quantiles (capitalone#990)

* fix scipy mend issue (capitalone#988)

* HistogramAndQuantilesOption sync with dev branch (capitalone#987)

* Changes to HistogramAndQuantilesOption now sync with concurrent updates to dev branch.

* Changes to scipy version, fixing comments

* Slight docstrings change

* revert back -- other PR to fix

* empty

* fix

* Staging multiprocess automation into dev (capitalone#997) (capitalone#998)

* Fix ProfilerOptions() documentation (capitalone#1002)

* fixed hyperlinks to documentation about ProfilerOptions()

* relative path add

* update with proper link

* update unstruct with link

* update version

* retain

* revert

---------

Co-authored-by: Liz Smith <liz.smith@richmond.edu>
Co-authored-by: Navid Nafiuzzaman <mxn4459@rit.edu>
Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com>
Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com>
Co-authored-by: JGSweets <JGSweets@users.noreply.github.com>
Co-authored-by: clee1152 <chrislee011502@gmail.com>
Co-authored-by: jacob-buehler <86370501+jacob-buehler@users.noreply.github.com>
  • Loading branch information
8 people committed Sep 17, 2023
1 parent cf4d651 commit 0620bf2
Show file tree
Hide file tree
Showing 42 changed files with 761 additions and 419 deletions.
13 changes: 3 additions & 10 deletions dataprofiler/data_readers/data_utils.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
"""Contains functions for data readers."""
import json
import os
import random
import re
import urllib
from collections import OrderedDict
Expand All @@ -28,7 +26,7 @@
from chardet.universaldetector import UniversalDetector
from typing_extensions import TypeGuard

from .. import dp_logging, settings
from .. import dp_logging, rng_utils
from .._typing import JSONType, Url
from .filepath_or_buffer import FileOrBufferHandler, is_stream_buffer # NOQA

Expand Down Expand Up @@ -315,11 +313,7 @@ def reservoir(file: TextIOWrapper, sample_nrows: int) -> list:

kinv = 1 / sample_nrows
W = 1.0
rng = random.Random(x=settings._seed)
if "DATAPROFILER_SEED" in os.environ and settings._seed is None:
seed = os.environ.get("DATAPROFILER_SEED")
if seed:
rng = random.Random(int(seed))
rng = rng_utils.get_random_number_generator()

while True:
W *= rng.random() ** kinv
Expand All @@ -334,7 +328,7 @@ def reservoir(file: TextIOWrapper, sample_nrows: int) -> list:
except StopIteration:
break
# Append new, replace old with dummy, and keep track of order
remove_index = rng.randrange(sample_nrows)
remove_index = rng.integers(0, sample_nrows)
values[indices[remove_index]] = str(None)
indices[remove_index] = len(values)
values.append(newval)
Expand Down Expand Up @@ -824,7 +818,6 @@ def url_to_bytes(url_as_string: Url, options: Dict) -> BytesIO:
"Content-length" in url.headers
and int(url.headers["Content-length"]) >= 1024**3
):

raise ValueError(
"The downloaded file from the url may not be " "larger than 1GB"
)
Expand Down
5 changes: 3 additions & 2 deletions dataprofiler/profilers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
DataLabelerOptions,
DateTimeOptions,
FloatOptions,
HistogramOption,
HistogramAndQuantilesOption,
HyperLogLogOptions,
IntOptions,
ModeOption,
Expand Down Expand Up @@ -66,7 +66,8 @@

json_decoder._options = {
BooleanOption.__name__: BooleanOption,
HistogramOption.__name__: HistogramOption,
"HistogramOption": HistogramAndQuantilesOption,
HistogramAndQuantilesOption.__name__: HistogramAndQuantilesOption,
ModeOption.__name__: ModeOption,
BaseInspectorOptions.__name__: BaseInspectorOptions,
NumericalOptions.__name__: NumericalOptions,
Expand Down
6 changes: 3 additions & 3 deletions dataprofiler/profilers/base_column_profilers.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import numpy as np
import pandas as pd

from . import utils
from . import profiler_utils
from .profiler_options import BaseInspectorOptions, BaseOption

BaseColumnProfilerT = TypeVar("BaseColumnProfilerT", bound="BaseColumnProfiler")
Expand Down Expand Up @@ -76,7 +76,7 @@ def _timeit(method: Callable = None, name: str = None) -> Callable:
:param name: key argument for the times dictionary
:type name: str
"""
return utils.method_timeit(method, name)
return profiler_utils.method_timeit(method, name)

@staticmethod
def _filter_properties_w_options(
Expand Down Expand Up @@ -173,7 +173,7 @@ def _add_helper(
else:
raise ValueError(f"Column names unmatched: {other1.name} != {other2.name}")

self.times = utils.add_nested_dictionaries(other1.times, other2.times)
self.times = profiler_utils.add_nested_dictionaries(other1.times, other2.times)

self.sample_size = other1.sample_size + other2.sample_size

Expand Down
34 changes: 20 additions & 14 deletions dataprofiler/profilers/categorical_column_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import datasketches
from pandas import DataFrame, Series

from . import utils
from . import profiler_utils
from .base_column_profilers import BaseColumnProfiler
from .profiler_options import CategoricalOptions

Expand Down Expand Up @@ -131,7 +131,7 @@ def __add__(self, other: CategoricalColumn) -> CategoricalColumn:
elif not self.cms and not other.cms:
# If both profiles have not met stop condition
if not (self._stop_condition_is_met or other._stop_condition_is_met):
merged_profile._categories = utils.add_nested_dictionaries(
merged_profile._categories = profiler_utils.add_nested_dictionaries(
self._categories, other._categories
)

Expand Down Expand Up @@ -250,21 +250,21 @@ def diff(self, other_profile: CategoricalColumn, options: dict = None) -> dict:
# Make sure other_profile's type matches this class
differences: dict = super().diff(other_profile, options)

differences["categorical"] = utils.find_diff_of_strings_and_bools(
differences["categorical"] = profiler_utils.find_diff_of_strings_and_bools(
self.is_match, other_profile.is_match
)

differences["statistics"] = dict(
[
(
"unique_count",
utils.find_diff_of_numbers(
profiler_utils.find_diff_of_numbers(
self.unique_count, other_profile.unique_count
),
),
(
"unique_ratio",
utils.find_diff_of_numbers(
profiler_utils.find_diff_of_numbers(
self.unique_ratio, other_profile.unique_ratio
),
),
Expand All @@ -275,19 +275,25 @@ def diff(self, other_profile: CategoricalColumn, options: dict = None) -> dict:
if self.is_match and other_profile.is_match:
differences["statistics"][
"chi2-test"
] = utils.perform_chi_squared_test_for_homogeneity(
] = profiler_utils.perform_chi_squared_test_for_homogeneity(
self._categories,
self.sample_size,
other_profile._categories,
other_profile.sample_size,
)
differences["statistics"]["categories"] = utils.find_diff_of_lists_and_sets(
differences["statistics"][
"categories"
] = profiler_utils.find_diff_of_lists_and_sets(
self.categories, other_profile.categories
)
differences["statistics"]["gini_impurity"] = utils.find_diff_of_numbers(
differences["statistics"][
"gini_impurity"
] = profiler_utils.find_diff_of_numbers(
self.gini_impurity, other_profile.gini_impurity
)
differences["statistics"]["unalikeability"] = utils.find_diff_of_numbers(
differences["statistics"][
"unalikeability"
] = profiler_utils.find_diff_of_numbers(
self.unalikeability, other_profile.unalikeability
)
cat_count1 = dict(
Expand All @@ -299,9 +305,9 @@ def diff(self, other_profile: CategoricalColumn, options: dict = None) -> dict:
)
)

differences["statistics"]["categorical_count"] = utils.find_diff_of_dicts(
cat_count1, cat_count2
)
differences["statistics"][
"categorical_count"
] = profiler_utils.find_diff_of_dicts(cat_count1, cat_count2)

return differences

Expand Down Expand Up @@ -532,7 +538,7 @@ def _merge_categories_cms(
for k in (x for x in heavy_hitter_dict2 if x not in heavy_hitter_dict1):
heavy_hitter_dict1[k] = cms1.get_estimate(k)

categories = utils.add_nested_dictionaries(
categories = profiler_utils.add_nested_dictionaries(
heavy_hitter_dict2, heavy_hitter_dict1
)

Expand Down Expand Up @@ -604,7 +610,7 @@ def _update_categories(
)
else:
category_count = self._get_categories_full(df_series)
self._categories = utils.add_nested_dictionaries(
self._categories = profiler_utils.add_nested_dictionaries(
self._categories, category_count
)
self._update_stop_condition(df_series)
Expand Down
8 changes: 4 additions & 4 deletions dataprofiler/profilers/column_profile_compilers.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

from pandas import Series

from . import utils
from . import profiler_utils
from .categorical_column_profile import CategoricalColumn
from .data_labeler_column_profile import DataLabelerColumn
from .datetime_column_profile import DateTimeColumn
Expand Down Expand Up @@ -106,7 +106,7 @@ def _create_profile(
df_series.name, options=profiler_options
)
except Exception as e:
utils.warn_on_profile(profiler.type, e)
profiler_utils.warn_on_profile(profiler.type, e)

# Update profile after creation
self.update_profile(df_series, pool)
Expand Down Expand Up @@ -338,7 +338,7 @@ def diff(
if all_profiles:
for key in all_profiles:
if key in self._profiles and key in other._profiles:
diff = utils.find_diff_of_numbers(
diff = profiler_utils.find_diff_of_numbers(
self._profiles[key].data_type_ratio,
other._profiles[key].data_type_ratio,
)
Expand All @@ -352,7 +352,7 @@ def diff(
data_type1 = self.selected_data_type
data_type2 = other.selected_data_type
if data_type1 is not None or data_type2 is not None:
diff_profile["data_type"] = utils.find_diff_of_strings_and_bools(
diff_profile["data_type"] = profiler_utils.find_diff_of_strings_and_bools(
data_type1, data_type2
)
# Find diff of matching profile statistics
Expand Down
14 changes: 9 additions & 5 deletions dataprofiler/profilers/data_labeler_column_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

from ..labelers.base_data_labeler import BaseDataLabeler
from ..labelers.data_labelers import DataLabeler
from . import utils
from . import profiler_utils
from .base_column_profilers import BaseColumnProfiler
from .profiler_options import DataLabelerOptions

Expand Down Expand Up @@ -325,7 +325,7 @@ def load_from_dict(cls, data, config: dict | None = None) -> DataLabelerColumn:

data_labeler_load_attr = data.pop("data_labeler")
if data_labeler_load_attr:
data_labeler_object = utils.reload_labeler_from_options_or_get_new(
data_labeler_object = profiler_utils.reload_labeler_from_options_or_get_new(
data_labeler_load_attr, config
)
if data_labeler_object is not None:
Expand Down Expand Up @@ -379,9 +379,13 @@ def diff(self, other_profile: DataLabelerColumn, options: dict = None) -> dict:
other_label_rep = other_profile.label_representation

differences = {
"data_label": utils.find_diff_of_lists_and_sets(self_labels, other_labels),
"avg_predictions": utils.find_diff_of_dicts(avg_preds, other_avg_preds),
"label_representation": utils.find_diff_of_dicts(
"data_label": profiler_utils.find_diff_of_lists_and_sets(
self_labels, other_labels
),
"avg_predictions": profiler_utils.find_diff_of_dicts(
avg_preds, other_avg_preds
),
"label_representation": profiler_utils.find_diff_of_dicts(
label_rep, other_label_rep
),
}
Expand Down
10 changes: 5 additions & 5 deletions dataprofiler/profilers/datetime_column_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import numpy as np
import pandas as pd

from . import utils
from . import profiler_utils
from .base_column_profilers import BaseColumnPrimitiveTypeProfiler, BaseColumnProfiler
from .profiler_options import DateTimeOptions

Expand Down Expand Up @@ -114,7 +114,7 @@ def __add__(self, other: DateTimeColumn) -> DateTimeColumn:
merged_profile.max = other.max
merged_profile._dt_obj_max = other._dt_obj_max

merged_profile.date_formats = utils._combine_unique_sets(
merged_profile.date_formats = profiler_utils._combine_unique_sets(
self.date_formats, other.date_formats
)
return merged_profile
Expand Down Expand Up @@ -192,13 +192,13 @@ def diff(self, other_profile: DateTimeColumn, options: dict = None) -> dict:
super().diff(other_profile, options)

differences = {
"min": utils.find_diff_of_dates(
"min": profiler_utils.find_diff_of_dates(
self._dt_obj_min, other_profile._dt_obj_min
),
"max": utils.find_diff_of_dates(
"max": profiler_utils.find_diff_of_dates(
self._dt_obj_max, other_profile._dt_obj_max
),
"format": utils.find_diff_of_lists_and_sets(
"format": profiler_utils.find_diff_of_lists_and_sets(
self.date_formats, other_profile.date_formats
),
}
Expand Down
4 changes: 2 additions & 2 deletions dataprofiler/profilers/float_column_profile.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
import numpy as np
import pandas as pd

from . import utils
from . import profiler_utils
from .base_column_profilers import BaseColumnPrimitiveTypeProfiler, BaseColumnProfiler
from .numerical_column_stats import NumericStatsMixin
from .profiler_options import FloatOptions
Expand Down Expand Up @@ -137,7 +137,7 @@ def diff(self, other_profile: FloatColumn, options: dict = None) -> dict:
other_precision = other_profile.profile["precision"]
precision_diff = dict()
for key in self.profile["precision"].keys():
precision_diff[key] = utils.find_diff_of_numbers(
precision_diff[key] = profiler_utils.find_diff_of_numbers(
self.profile["precision"][key], other_precision[key]
)
precision_diff.pop("confidence_level")
Expand Down
20 changes: 10 additions & 10 deletions dataprofiler/profilers/graph_profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
from packaging import version

from ..data_readers.graph_data import GraphData
from . import utils
from . import profiler_utils
from .base_column_profilers import BaseColumnProfiler
from .profiler_options import ProfilerOptions

Expand Down Expand Up @@ -118,34 +118,34 @@ def diff(self, other_profile: GraphProfiler, options: dict = None) -> dict:
)

diff_profile = {
"num_nodes": utils.find_diff_of_numbers(
"num_nodes": profiler_utils.find_diff_of_numbers(
self._num_nodes, other_profile._num_nodes
),
"num_edges": utils.find_diff_of_numbers(
"num_edges": profiler_utils.find_diff_of_numbers(
self._num_edges, other_profile._num_edges
),
"categorical_attributes": utils.find_diff_of_lists_and_sets(
"categorical_attributes": profiler_utils.find_diff_of_lists_and_sets(
self._categorical_attributes, other_profile._categorical_attributes
),
"continuous_attributes": utils.find_diff_of_lists_and_sets(
"continuous_attributes": profiler_utils.find_diff_of_lists_and_sets(
self._continuous_attributes, other_profile._continuous_attributes
),
"avg_node_degree": utils.find_diff_of_numbers(
"avg_node_degree": profiler_utils.find_diff_of_numbers(
self._avg_node_degree, other_profile._avg_node_degree
),
"global_max_component_size": utils.find_diff_of_numbers(
"global_max_component_size": profiler_utils.find_diff_of_numbers(
self._global_max_component_size,
other_profile._global_max_component_size,
),
"continuous_distribution": utils.find_diff_of_dicts_with_diff_keys(
"continuous_distribution": profiler_utils.find_diff_of_dicts_with_diff_keys(
self._continuous_distribution,
other_profile._continuous_distribution,
),
"categorical_distribution": utils.find_diff_of_dicts_with_diff_keys(
"categorical_distribution": profiler_utils.find_diff_of_dicts_with_diff_keys( # noqa: E501
self._categorical_distribution,
other_profile._categorical_distribution,
),
"times": utils.find_diff_of_dicts(self.times, other_profile.times),
"times": profiler_utils.find_diff_of_dicts(self.times, other_profile.times),
}

return diff_profile
Expand Down
9 changes: 9 additions & 0 deletions dataprofiler/profilers/json_decoder.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Contains methods to decode components of a Profiler."""
from __future__ import annotations

import warnings
from typing import TYPE_CHECKING

if TYPE_CHECKING:
Expand Down Expand Up @@ -72,6 +73,14 @@ def get_option_class(class_name: str) -> type[BaseOption]:
options_class: type[BaseOption] | None = _options.get(class_name)
if options_class is None:
raise ValueError(f"Invalid option class {class_name} " f"failed to load.")

if class_name == "HistogramOption":
warnings.warn(
f"{class_name} will be deprecated in the future. During the JSON encode"
" process, HistogramOption is mapped to HistogramAndQuantilesOption. "
"Please begin utilizing the new HistogramAndQuantilesOption class.",
DeprecationWarning,
)
return options_class


Expand Down
Loading

0 comments on commit 0620bf2

Please sign in to comment.