Staging: main 0.10.3 (#1004)

* let's try this again (#953) * Fix/f1 score path fix import (#952) * Fixed F1Score Import * Linted example file with Black Linter * Scipy bug fix (#951) * update * renamed var and removed from for loops * refactored var * Make BaseDataProcessor.process() compatible with all argument sets (#954) A method signature that uses *args: Any, **kwargs: Any is compatible with any set of arguments in mypy, despite being an LSP violation. This lets us assert that subclasses of BaseDataProcessor should have some process() method with an arbitrary signature. We also add to the return type of BaseDataPreprocessor so that it is inclusive of all of its subclasses. Co-authored-by: JGSweets <JGSweets@users.noreply.github.com> * Fix name mangling and typevar errors (#955) Inside the BaseDataProcessor class definition, references to __subclasses are automatically replaced with _BaseDataProcessor__subclasses. This remains the case even in static methods _register_subclass() and get_class(). Same with BaseModel and its __subclasses field. So we do not have to write out the full name mangled identifiers inside the class definitions. Also, mypy doesn't seem to be able to handle the return type of BaseDataProcessor.get_class() being a typevar, so that was changed to type[BaseDataProcessor]. This does not affect the functionality of get_class() since it always returns a subclass of BaseDataProcessor. * None-check labels dependants (#964) The mypy errors addressed here occur because variables label_mapping (in CharPreprocessor), unstructured_labels, and unstructured_label_set (in StructCharPreprocessor.process()) have optional types when they're used. This is fixed by checking that they are not None prior to the operation, which mypy recognizes as removing the None type from them. This should have no effect on functionality because we are already checking that labels is not None, and the variables above all depend on labels such that they are None only if labels is None. * Changed `publish-python-package.yml` to include only release branches. (#965) * Changed release option to only release branches named \'release/<version-tag>\'. * Reverted types * Updated DATAPROFILER_SEED setting in utils.py; abstracted RNG creation (#959) (#966) * abstracted rng creation 23/07/11 14:32 * updated profile_builder random number generation * renamed dp_rng() to get_random_number_generator() * updated data_utils random number generation, added warning back to get_random_number_generator() * removed erroneous print statement * added tests of get_random_number_generator() to test_data_utils and test_utils * removed unnecessary int dtype conversion * edited seed declaration statement * added setUp function to get_random_number_generator() testing * fixed duplicate variable declaration in test_data_utils.py and test_utils.py * moved generator function to root of dataprofiler dir; added test_generator.py; reverted test_data_utils and test_utils * moved and renamed utils_global; cleaned up unused imports * additional tests of get_random_number_generator() * added test of utils_global for DATAPROFILER_SEED not in os.environ and settings._seed==None * added the last four unit tests in Taylors requested changes to test_utils_global.py * removed unneeded tests and declarations; changed to relative imports; updated assertWarnsRegex in test_utils_global * changed two more imports to relative imports * updated rng.integers call * removed unnecessary slicing/indexing * removed unnecessary slicing/indexing * cleaned up os.environ mocks in test_utils_global * mocked expected values in unit tests * simplified mocks * removed unnecessary test * added more descriptive mock names; ensured that rng generator uses proper seed * cleaned up mock names; improved docstrings * removed unnecessary clear=True clauses; removed duplicate assert statement * made clear=True statements consistent * removed one variable declaration; added clear=True to one mock * removed clear=True statement * removed unused imports and variable declarations * renamed utils_global -> rng_utils and corresponding test; renamed utils.py -> profiler_utils.py and corresponding test * fixed import error * renamed utils.py and utils_global.py * replaced imports of profilers.utils with profilers.profiler_utils Co-authored-by: jacob-buehler <86370501+jacob-buehler@users.noreply.github.com> * Staging: into dev feature/num-quantiles (#990) * fix scipy mend issue (#988) * HistogramAndQuantilesOption sync with dev branch (#987) * Changes to HistogramAndQuantilesOption now sync with concurrent updates to dev branch. * Changes to scipy version, fixing comments * Slight docstrings change * revert back -- other PR to fix * empty * fix * Staging multiprocess automation into dev (#997) (#998) * Fix ProfilerOptions() documentation (#1002) * fixed hyperlinks to documentation about ProfilerOptions() * relative path add * update with proper link * update unstruct with link * update version * retain * revert --------- Co-authored-by: Liz Smith <liz.smith@richmond.edu> Co-authored-by: Navid Nafiuzzaman <mxn4459@rit.edu> Co-authored-by: Richard Bann <87214439+drahc1R@users.noreply.github.com> Co-authored-by: Junho Lee <53921230+junholee6a@users.noreply.github.com> Co-authored-by: JGSweets <JGSweets@users.noreply.github.com> Co-authored-by: clee1152 <chrislee011502@gmail.com> Co-authored-by: jacob-buehler <86370501+jacob-buehler@users.noreply.github.com>
capitalone · Aug 7, 2023 · b0b8510 · b0b8510
1 parent ec47d45
commit b0b8510
Show file tree

Hide file tree

Showing 46 changed files with 1,027 additions and 560 deletions.
diff --git a/.github/workflows/publish-python-package.yml b/.github/workflows/publish-python-package.yml
@@ -7,6 +7,8 @@ name: Publish Python Package
 on:
   release:
     types: [created]
+    branches:
+      - 'release/*'
 
 jobs:
   deploy:

diff --git a/dataprofiler/data_readers/data_utils.py b/dataprofiler/data_readers/data_utils.py
@@ -1,7 +1,5 @@
 """Contains functions for data readers."""
 import json
-import os
-import random
 import re
 import urllib
 from collections import OrderedDict
@@ -28,7 +26,7 @@
 from chardet.universaldetector import UniversalDetector
 from typing_extensions import TypeGuard
 
-from .. import dp_logging, settings
+from .. import dp_logging, rng_utils
 from .._typing import JSONType, Url
 from .filepath_or_buffer import FileOrBufferHandler, is_stream_buffer  # NOQA
 
@@ -315,11 +313,7 @@ def reservoir(file: TextIOWrapper, sample_nrows: int) -> list:
 
     kinv = 1 / sample_nrows
     W = 1.0
-    rng = random.Random(x=settings._seed)
-    if "DATAPROFILER_SEED" in os.environ and settings._seed is None:
-        seed = os.environ.get("DATAPROFILER_SEED")
-        if seed:
-            rng = random.Random(int(seed))
+    rng = rng_utils.get_random_number_generator()
 
     while True:
         W *= rng.random() ** kinv
@@ -334,7 +328,7 @@ def reservoir(file: TextIOWrapper, sample_nrows: int) -> list:
         except StopIteration:
             break
         # Append new, replace old with dummy, and keep track of order
-        remove_index = rng.randrange(sample_nrows)
+        remove_index = rng.integers(0, sample_nrows)
         values[indices[remove_index]] = str(None)
         indices[remove_index] = len(values)
         values.append(newval)
@@ -824,7 +818,6 @@ def url_to_bytes(url_as_string: Url, options: Dict) -> BytesIO:
                 "Content-length" in url.headers
                 and int(url.headers["Content-length"]) >= 1024**3
             ):
-
                 raise ValueError(
                     "The downloaded file from the url may not be " "larger than 1GB"
                 )

diff --git a/dataprofiler/labelers/base_model.py b/dataprofiler/labelers/base_model.py
@@ -32,7 +32,7 @@ def __new__(
 class BaseModel(metaclass=abc.ABCMeta):
     """For labeling data."""
 
-    _BaseModel__subclasses: dict[str, type[BaseModel]] = {}
+    __subclasses: dict[str, type[BaseModel]] = {}
     __metaclass__ = abc.ABCMeta
 
     # boolean if the label mapping requires the mapping for index 0 reserved
@@ -90,7 +90,7 @@ def __eq__(self, other: object) -> bool:
     def _register_subclass(cls) -> None:
         """Register a subclass for the class factory."""
         if not inspect.isabstract(cls):
-            cls._BaseModel__subclasses[cls.__name__.lower()] = cls
+            cls.__subclasses[cls.__name__.lower()] = cls
 
     @property
     def label_mapping(self) -> dict[str, int]:
@@ -156,7 +156,7 @@ def get_class(cls, class_name: str) -> type[BaseModel] | None:
         from .column_name_model import ColumnNameModel  # NOQA
         from .regex_model import RegexModel  # NOQA
 
-        return cls._BaseModel__subclasses.get(class_name.lower(), None)
+        return cls.__subclasses.get(class_name.lower(), None)
 
     def get_parameters(self, param_list: list[str] | None = None) -> dict:
         """

diff --git a/dataprofiler/labelers/data_processing.py b/dataprofiler/labelers/data_processing.py
@@ -49,16 +49,14 @@ def __init__(self, **parameters: Any) -> None:
     def _register_subclass(cls) -> None:
         """Register a subclass for the class factory."""
         if not inspect.isabstract(cls):
-            cls._BaseDataProcessor__subclasses[  # type: ignore
-                cls.__name__.lower()
-            ] = cls
+            cls.__subclasses[cls.__name__.lower()] = cls
 
     @classmethod
-    def get_class(cls: type[Processor], class_name: str) -> type[Processor] | None:
+    def get_class(
+        cls: type[BaseDataProcessor], class_name: str
+    ) -> type[BaseDataProcessor] | None:
         """Get class of BaseDataProcessor object."""
-        return cls._BaseDataProcessor__subclasses.get(  # type: ignore
-            class_name.lower(), None
-        )
+        return cls.__subclasses.get(class_name.lower(), None)
 
     def __eq__(self, other: object) -> bool:
         """
@@ -129,7 +127,7 @@ def set_params(self, **kwargs: Any) -> None:
             self._parameters[param] = kwargs[param]
 
     @abc.abstractmethod
-    def process(self, *args: Any) -> Any:
+    def process(self, *args: Any, **kwargs: Any) -> Any:
         """Process data."""
         raise NotImplementedError()
 
@@ -169,13 +167,15 @@ def __init__(self, **parameters: Any) -> None:
         super().__init__(**parameters)
 
     @abc.abstractmethod
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         labels: np.ndarray | None = None,
         label_mapping: dict[str, int] | None = None,
         batch_size: int = 32,
-    ) -> Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None]:
+    ) -> Generator[tuple[np.ndarray, np.ndarray] | np.ndarray, None, None] | tuple[
+        np.ndarray, np.ndarray
+    ] | np.ndarray:
         """Preprocess data."""
         raise NotImplementedError()
 
@@ -191,7 +191,7 @@ def __init__(self, **parameters: Any) -> None:
         super().__init__(**parameters)
 
     @abc.abstractmethod
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         results: dict,
@@ -240,7 +240,7 @@ def help(cls) -> None:
         )
         print(help_str)
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         labels: np.ndarray | None = None,
@@ -668,7 +668,7 @@ def gen_none() -> Generator[None, None, None]:
         if batch_data["samples"]:
             yield batch_data
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         labels: np.ndarray | None = None,
@@ -735,8 +735,8 @@ def process(  # type: ignore
             X_train = np.array(
                 [[sentence] for sentence in batch_data["samples"]], dtype=object
             )
-            if labels is not None:
-                num_classes = max(label_mapping.values()) + 1  # type: ignore
+            if labels is not None and label_mapping is not None:
+                num_classes = max(label_mapping.values()) + 1
 
                 Y_train = tf.keras.utils.to_categorical(
                     batch_data["labels"], num_classes
@@ -836,7 +836,7 @@ def _validate_parameters(self, parameters: dict) -> None:
         if errors:
             raise ValueError("\n".join(errors))
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         labels: np.ndarray | None = None,
@@ -1269,7 +1269,7 @@ def match_sentence_lengths(
 
         return results
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         results: dict,
@@ -1439,7 +1439,7 @@ def convert_to_unstructured_format(
 
         return text, entities
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         labels: np.ndarray | None = None,
@@ -1503,8 +1503,12 @@ def process(  # type: ignore
                 unstructured_label_set,
             ) = self.convert_to_unstructured_format(batch_data, batch_labels)
             unstructured_data[ind] = unstructured_text
-            if labels is not None:
-                unstructured_labels[ind] = unstructured_label_set  # type: ignore
+            if (
+                labels is not None
+                and unstructured_labels is not None
+                and unstructured_label_set is not None
+            ):
+                unstructured_labels[ind] = unstructured_label_set
 
         if labels is not None:
             np_unstruct_labels = np.array(unstructured_labels, dtype="object")
@@ -1800,7 +1804,7 @@ def convert_to_structured_analysis(
 
         return results
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         results: dict,
@@ -2022,7 +2026,7 @@ def split_prediction(results: dict) -> None:
                 pred, axis=1, ord=1, keepdims=True
             )
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         results: dict,
@@ -2160,7 +2164,7 @@ def _save_processor(self, dirpath: str) -> None:
         ) as fp:
             json.dump(params, fp)
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         results: dict,
@@ -2253,7 +2257,7 @@ def help(cls) -> None:
         )
         print(help_str)
 
-    def process(  # type: ignore
+    def process(
         self,
         data: np.ndarray,
         results: dict,

diff --git a/dataprofiler/profilers/__init__.py b/dataprofiler/profilers/__init__.py
@@ -28,7 +28,7 @@
     DataLabelerOptions,
     DateTimeOptions,
     FloatOptions,
-    HistogramOption,
+    HistogramAndQuantilesOption,
     HyperLogLogOptions,
     IntOptions,
     ModeOption,
@@ -66,7 +66,8 @@
 
 json_decoder._options = {
     BooleanOption.__name__: BooleanOption,
-    HistogramOption.__name__: HistogramOption,
+    "HistogramOption": HistogramAndQuantilesOption,
+    HistogramAndQuantilesOption.__name__: HistogramAndQuantilesOption,
     ModeOption.__name__: ModeOption,
     BaseInspectorOptions.__name__: BaseInspectorOptions,
     NumericalOptions.__name__: NumericalOptions,

diff --git a/dataprofiler/profilers/base_column_profilers.py b/dataprofiler/profilers/base_column_profilers.py
@@ -11,7 +11,7 @@
 import numpy as np
 import pandas as pd
 
-from . import utils
+from . import profiler_utils
 from .profiler_options import BaseInspectorOptions, BaseOption
 
 BaseColumnProfilerT = TypeVar("BaseColumnProfilerT", bound="BaseColumnProfiler")
@@ -76,7 +76,7 @@ def _timeit(method: Callable = None, name: str = None) -> Callable:
         :param name: key argument for the times dictionary
         :type name: str
         """
-        return utils.method_timeit(method, name)
+        return profiler_utils.method_timeit(method, name)
 
     @staticmethod
     def _filter_properties_w_options(
@@ -173,7 +173,7 @@ def _add_helper(
         else:
             raise ValueError(f"Column names unmatched: {other1.name} != {other2.name}")
 
-        self.times = utils.add_nested_dictionaries(other1.times, other2.times)
+        self.times = profiler_utils.add_nested_dictionaries(other1.times, other2.times)
 
         self.sample_size = other1.sample_size + other2.sample_size
 

diff --git a/dataprofiler/profilers/categorical_column_profile.py b/dataprofiler/profilers/categorical_column_profile.py
@@ -8,7 +8,7 @@
 import datasketches
 from pandas import DataFrame, Series
 
-from . import utils
+from . import profiler_utils
 from .base_column_profilers import BaseColumnProfiler
 from .profiler_options import CategoricalOptions
 
@@ -131,7 +131,7 @@ def __add__(self, other: CategoricalColumn) -> CategoricalColumn:
         elif not self.cms and not other.cms:
             # If both profiles have not met stop condition
             if not (self._stop_condition_is_met or other._stop_condition_is_met):
-                merged_profile._categories = utils.add_nested_dictionaries(
+                merged_profile._categories = profiler_utils.add_nested_dictionaries(
                     self._categories, other._categories
                 )
 
@@ -250,21 +250,21 @@ def diff(self, other_profile: CategoricalColumn, options: dict = None) -> dict:
         # Make sure other_profile's type matches this class
         differences: dict = super().diff(other_profile, options)
 
-        differences["categorical"] = utils.find_diff_of_strings_and_bools(
+        differences["categorical"] = profiler_utils.find_diff_of_strings_and_bools(
             self.is_match, other_profile.is_match
         )
 
         differences["statistics"] = dict(
             [
                 (
                     "unique_count",
-                    utils.find_diff_of_numbers(
+                    profiler_utils.find_diff_of_numbers(
                         self.unique_count, other_profile.unique_count
                     ),
                 ),
                 (
                     "unique_ratio",
-                    utils.find_diff_of_numbers(
+                    profiler_utils.find_diff_of_numbers(
                         self.unique_ratio, other_profile.unique_ratio
                     ),
                 ),
@@ -275,19 +275,25 @@ def diff(self, other_profile: CategoricalColumn, options: dict = None) -> dict:
         if self.is_match and other_profile.is_match:
             differences["statistics"][
                 "chi2-test"
-            ] = utils.perform_chi_squared_test_for_homogeneity(
+            ] = profiler_utils.perform_chi_squared_test_for_homogeneity(
                 self._categories,
                 self.sample_size,
                 other_profile._categories,
                 other_profile.sample_size,
             )
-            differences["statistics"]["categories"] = utils.find_diff_of_lists_and_sets(
+            differences["statistics"][
+                "categories"
+            ] = profiler_utils.find_diff_of_lists_and_sets(
                 self.categories, other_profile.categories
             )
-            differences["statistics"]["gini_impurity"] = utils.find_diff_of_numbers(
+            differences["statistics"][
+                "gini_impurity"
+            ] = profiler_utils.find_diff_of_numbers(
                 self.gini_impurity, other_profile.gini_impurity
             )
-            differences["statistics"]["unalikeability"] = utils.find_diff_of_numbers(
+            differences["statistics"][
+                "unalikeability"
+            ] = profiler_utils.find_diff_of_numbers(
                 self.unalikeability, other_profile.unalikeability
             )
             cat_count1 = dict(
@@ -299,9 +305,9 @@ def diff(self, other_profile: CategoricalColumn, options: dict = None) -> dict:
                 )
             )
 
-            differences["statistics"]["categorical_count"] = utils.find_diff_of_dicts(
-                cat_count1, cat_count2
-            )
+            differences["statistics"][
+                "categorical_count"
+            ] = profiler_utils.find_diff_of_dicts(cat_count1, cat_count2)
 
         return differences
 
@@ -532,7 +538,7 @@ def _merge_categories_cms(
         for k in (x for x in heavy_hitter_dict2 if x not in heavy_hitter_dict1):
             heavy_hitter_dict1[k] = cms1.get_estimate(k)
 
-        categories = utils.add_nested_dictionaries(
+        categories = profiler_utils.add_nested_dictionaries(
             heavy_hitter_dict2, heavy_hitter_dict1
         )
 
@@ -604,7 +610,7 @@ def _update_categories(
             )
         else:
             category_count = self._get_categories_full(df_series)
-            self._categories = utils.add_nested_dictionaries(
+            self._categories = profiler_utils.add_nested_dictionaries(
                 self._categories, category_count
             )
             self._update_stop_condition(df_series)