Pandas data type 'string' not understood #93

nikml · 2022-07-12T19:12:23Z

Describe the bug
Running the Quickstart results in an error

To Reproduce
Steps to reproduce the behavior:
Runing:

import pandas as pd
import nannyml as nml
from IPython.display import display

# Load synthetic data
reference, analysis, analysis_target = nml.load_synthetic_binary_classification_dataset()
display(reference.head())
display(analysis.head())

# Choose a chunker or set a chunk size
chunk_size = 5000

# initialize, specify required data columns, fit estimator and estimate
estimator = nml.CBPE(
   y_pred_proba='y_pred_proba',
   y_pred='y_pred',
   y_true='work_home_actual',
   timestamp_column_name='timestamp',
   metrics=['roc_auc'],
   chunk_size=chunk_size,
)
estimator = estimator.fit(reference)
estimated_performance = estimator.estimate(analysis)

# Show results
figure = estimated_performance.plot(kind='performance', metric='roc_auc', plot_reference=True)
figure.show()

# Define feature columns
feature_column_names = [
    col for col in reference.columns if col not in [
        'timestamp', 'y_pred_proba', 'period', 'y_pred', 'work_home_actual', 'identifier'
    ]]

# Let's initialize the object that will perform the Univariate Drift calculations
univariate_calculator = nml.UnivariateStatisticalDriftCalculator(
    feature_column_names=feature_column_names,
    timestamp_column_name='timestamp',
    chunk_size=chunk_size
)
univariate_calculator = univariate_calculator.fit(reference)
univariate_results = univariate_calculator.calculate(analysis)
# Plot drift results for all model inputs
for feature in univariate_calculator.feature_column_names:
    figure = univariate_results.plot(
        kind='feature_drift',
        metric='statistic',
        feature_column_name=feature,
        plot_reference=True
    )
    figure.show()

# Rank features based on number of alerts
ranker = nml.Ranker.by('alert_count')
ranked_features = ranker.rank(univariate_results, only_drifting = False)
display(ranked_features)

calc = nml.StatisticalOutputDriftCalculator(
    y_pred='y_pred',
    y_pred_proba='y_pred_proba',
    timestamp_column_name='timestamp'
)
calc.fit(reference)
results = calc.calculate(analysis)

figure = results.plot(kind='prediction_drift', plot_reference=True)
figure.show()

# Let's initialize the object that will perform Data Reconstruction with PCA
rcerror_calculator = nml.DataReconstructionDriftCalculator(feature_column_names=feature_column_names, timestamp_column_name='timestamp', chunk_size=chunk_size).fit(reference_data=reference)
# let's see Reconstruction error statistics for all available data
rcerror_results = rcerror_calculator.calculate(analysis)
figure = rcerror_results.plot(kind='drift', plot_reference=True)
figure.show()

Gives the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\anaconda3\lib\site-packages\nannyml\base.py in fit(self, reference_data, *args, **kwargs)
     94             self._logger.debug(f"fitting {str(self)}")
---> 95             return self._fit(reference_data, *args, **kwargs)
     96         except InvalidArgumentsException:

~\anaconda3\lib\site-packages\nannyml\drift\model_inputs\univariate\statistical\calculator.py in _fit(self, reference_data, *args, **kwargs)
    105         self.previous_reference_data = reference_data.copy()
--> 106         self.previous_reference_results = self._calculate(self.previous_reference_data).data
    107 

~\anaconda3\lib\site-packages\nannyml\drift\model_inputs\univariate\statistical\calculator.py in _calculate(self, data, *args, **kwargs)
    116 
--> 117         self.continuous_column_names, self.categorical_column_names = _split_features_by_type(
    118             data, self.feature_column_names

~\anaconda3\lib\site-packages\nannyml\base.py in _split_features_by_type(data, feature_column_names)
    229 
--> 230     categorical_column_names = [col for col in feature_column_names if _column_is_categorical(data[col])]
    231 

~\anaconda3\lib\site-packages\nannyml\base.py in <listcomp>(.0)
    229 
--> 230     categorical_column_names = [col for col in feature_column_names if _column_is_categorical(data[col])]
    231 

~\anaconda3\lib\site-packages\nannyml\base.py in _column_is_categorical(column)
    235 def _column_is_categorical(column: pd.Series) -> bool:
--> 236     return column.dtype in ['object', 'string', 'category', 'bool']
    237 

TypeError: data type 'string' not understood

During handling of the above exception, another exception occurred:

CalculatorException                       Traceback (most recent call last)
<ipython-input-1-9ae82d7fa4d4> in <module>
     39     chunk_size=chunk_size
     40 )
---> 41 univariate_calculator = univariate_calculator.fit(reference)
     42 univariate_results = univariate_calculator.calculate(analysis)
     43 # Plot drift results for all model inputs

~\anaconda3\lib\site-packages\nannyml\base.py in fit(self, reference_data, *args, **kwargs)
     99             raise
    100         except Exception as exc:
--> 101             raise CalculatorException(f"failed while fitting {str(self)}.\n{exc}")
    102 
    103     def calculate(self, data: pd.DataFrame, *args, **kwargs) -> AbstractCalculatorResult:

CalculatorException: failed while fitting <nannyml.drift.model_inputs.univariate.statistical.calculator.UnivariateStatisticalDriftCalculator object at 0x0000022BBF196A30>.
data type 'string' not understood

Expected behavior
The quickstart code runs without a problem.

Additional context

The user who had that issue was running python 3.8 on windows through a pycharm environment.

I couldn't reproduce the error when I tried on my machine. Moreover when I guided the user to set up a new conda environment the error went away.

However maybe the way string type is defined here could be changed similar to suggestions such as these to cover more cases? I 'd hold of on that until we see more users having the issue, since in this case a misconfigured environment is more likely the problem than a library compatibility issue.

The text was updated successfully, but these errors were encountered:

bernhardbarker · 2022-07-29T12:35:37Z

Looks like the required numpy version (>=1.14.0) might be wrong. It seems like it should be closer to >=1.21.0 (at least based on the current nannyml code).

I reproduced it as follows:

conda create --name nannyml python=3.8
conda activate nannyml
pip install numpy==1.19.5
pip install pandas==1.4.3
pip install nannyml==0.5.0
pip install IPython==8.4.0
# quickstart.py contains the code above
python quickstart.py  # data type 'string' not understood

These libraries all seem to be compatible with one another (otherwise pip would show a warning or automatically update libraries to compatible versions, neither of which happened).

I fixed it by updating numpy:

pip install numpy==1.21.0
python quickstart.py  # works

I can also fix the issue by just explicitly converting each column to string:

...
display(ranked_features)

for c in reference.columns:
    reference[c] = reference[c].astype("string")

for c in analysis.columns:
    analysis[c] = analysis[c].astype("string")

calc = nml.StatisticalOutputDriftCalculator(
...

I'm running Ubuntu 22, in case it matters.

gabrieltardochi · 2022-08-21T14:48:57Z

Commenting to keep the issue alive, since I just faced the same problem while running the Univariate Drift Detection tutorial. Manually updating numpy worked for me as well.

nnansters · 2022-08-21T17:20:07Z

Thanks for the extra information all. We'll try to include a fix in the next release, coming next week!

nnansters · 2022-08-21T17:21:53Z

Looks like the required numpy version (>=1.14.0) might be wrong. It seems like it should be closer to >=1.21.0 (at least based on the current nannyml code).

I reproduced it as follows:
conda create --name nannyml python=3.8
conda activate nannyml
pip install numpy==1.19.5
pip install pandas==1.4.3
pip install nannyml==0.5.0
pip install IPython==8.4.0
# quickstart.py contains the code above
python quickstart.py  # data type 'string' not understood
These libraries all seem to be compatible with one another (otherwise pip would show a warning or automatically update libraries to compatible versions, neither of which happened).

I fixed it by updating numpy:
pip install numpy==1.21.0
python quickstart.py  # works
I can also fix the issue by just explicitly converting each column to string:
...
display(ranked_features)

for c in reference.columns:
    reference[c] = reference[c].astype("string")

for c in analysis.columns:
    analysis[c] = analysis[c].astype("string")

calc = nml.StatisticalOutputDriftCalculator(
...
I'm running Ubuntu 22, in case it matters.

Sorry for missing this Bernhard, I was out on holiday. Thanks for the report and the investigation!

nikml · 2022-09-01T09:27:45Z

Nice dig down!

On the latest version, 0.5.3, with this code sample:

import pandas as pd
import nannyml as nml
from IPython.display import display

# Load synthetic data
reference, analysis, analysis_target = nml.load_synthetic_binary_classification_dataset()
display(reference.head())
display(analysis.head())

# Choose a chunker or set a chunk size
chunk_size = 5000

# initialize, specify required data columns, fit estimator and estimate
estimator = nml.CBPE(
   y_pred_proba='y_pred_proba',
   y_pred='y_pred',
   y_true='work_home_actual',
   timestamp_column_name='timestamp',
   metrics=['roc_auc'],
   chunk_size=chunk_size,
   # problem_type='classification_binary',
)
estimator = estimator.fit(reference)
estimated_performance = estimator.estimate(analysis)

# Show results
figure = estimated_performance.plot(kind='performance', metric='roc_auc', plot_reference=True)
figure.show()

# Define feature columns
feature_column_names = [
    col for col in reference.columns if col not in [
        'timestamp', 'y_pred_proba', 'period', 'y_pred', 'work_home_actual', 'identifier'
    ]]

# Let's initialize the object that will perform the Univariate Drift calculations
univariate_calculator = nml.UnivariateStatisticalDriftCalculator(
    feature_column_names=feature_column_names,
    timestamp_column_name='timestamp',
    chunk_size=chunk_size
)
univariate_calculator = univariate_calculator.fit(reference)
univariate_results = univariate_calculator.calculate(analysis)
# Plot drift results for all model inputs
for feature in univariate_calculator.feature_column_names:
    figure = univariate_results.plot(
        kind='feature_drift',
        metric='statistic',
        feature_column_name=feature,
        plot_reference=True
    )
    figure.show()

# Rank features based on number of alerts
ranker = nml.Ranker.by('alert_count')
ranked_features = ranker.rank(univariate_results, only_drifting = False)
display(ranked_features)

calc = nml.StatisticalOutputDriftCalculator(
    y_pred='y_pred',
    y_pred_proba='y_pred_proba',
    timestamp_column_name='timestamp',
    # problem_type='classification_binary'
)
calc.fit(reference)
results = calc.calculate(analysis)

figure = results.plot(kind='prediction_drift', plot_reference=True)
figure.show()

# Let's initialize the object that will perform Data Reconstruction with PCA
rcerror_calculator = nml.DataReconstructionDriftCalculator(feature_column_names=feature_column_names, timestamp_column_name='timestamp', chunk_size=chunk_size).fit(reference_data=reference)
# let's see Reconstruction error statistics for all available data
rcerror_results = rcerror_calculator.calculate(analysis)
figure = rcerror_results.plot(kind='drift', plot_reference=True)
figure.show()

the issue is not reproducable on my end with numpy 1.22.4 and pandas 1.4.3

stale · 2022-10-31T13:43:09Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

nikml added bug Something isn't working triage Needs to be assessed labels Jul 12, 2022

nnansters pushed a commit that referenced this issue Aug 25, 2022

Updated dependencies + explicitly added Numpy (#93)

6fef598

nnansters removed the triage Needs to be assessed label Aug 26, 2022

stale bot added the stale label Oct 31, 2022

stale bot closed this as completed Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas data type 'string' not understood #93

Pandas data type 'string' not understood #93

nikml commented Jul 12, 2022

bernhardbarker commented Jul 29, 2022 •

edited

Loading

gabrieltardochi commented Aug 21, 2022

nnansters commented Aug 21, 2022

nnansters commented Aug 21, 2022

nikml commented Sep 1, 2022

stale bot commented Oct 31, 2022

Pandas data type 'string' not understood #93

Pandas data type 'string' not understood #93

Comments

nikml commented Jul 12, 2022

bernhardbarker commented Jul 29, 2022 • edited Loading

gabrieltardochi commented Aug 21, 2022

nnansters commented Aug 21, 2022

nnansters commented Aug 21, 2022

nikml commented Sep 1, 2022

stale bot commented Oct 31, 2022

bernhardbarker commented Jul 29, 2022 •

edited

Loading