Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas data type 'string' not understood #93

Closed
nikml opened this issue Jul 12, 2022 · 6 comments
Closed

Pandas data type 'string' not understood #93

nikml opened this issue Jul 12, 2022 · 6 comments
Labels
bug Something isn't working stale

Comments

@nikml
Copy link
Contributor

nikml commented Jul 12, 2022

Describe the bug
Running the Quickstart results in an error

To Reproduce
Steps to reproduce the behavior:
Runing:

import pandas as pd
import nannyml as nml
from IPython.display import display

# Load synthetic data
reference, analysis, analysis_target = nml.load_synthetic_binary_classification_dataset()
display(reference.head())
display(analysis.head())

# Choose a chunker or set a chunk size
chunk_size = 5000

# initialize, specify required data columns, fit estimator and estimate
estimator = nml.CBPE(
   y_pred_proba='y_pred_proba',
   y_pred='y_pred',
   y_true='work_home_actual',
   timestamp_column_name='timestamp',
   metrics=['roc_auc'],
   chunk_size=chunk_size,
)
estimator = estimator.fit(reference)
estimated_performance = estimator.estimate(analysis)

# Show results
figure = estimated_performance.plot(kind='performance', metric='roc_auc', plot_reference=True)
figure.show()

# Define feature columns
feature_column_names = [
    col for col in reference.columns if col not in [
        'timestamp', 'y_pred_proba', 'period', 'y_pred', 'work_home_actual', 'identifier'
    ]]

# Let's initialize the object that will perform the Univariate Drift calculations
univariate_calculator = nml.UnivariateStatisticalDriftCalculator(
    feature_column_names=feature_column_names,
    timestamp_column_name='timestamp',
    chunk_size=chunk_size
)
univariate_calculator = univariate_calculator.fit(reference)
univariate_results = univariate_calculator.calculate(analysis)
# Plot drift results for all model inputs
for feature in univariate_calculator.feature_column_names:
    figure = univariate_results.plot(
        kind='feature_drift',
        metric='statistic',
        feature_column_name=feature,
        plot_reference=True
    )
    figure.show()

# Rank features based on number of alerts
ranker = nml.Ranker.by('alert_count')
ranked_features = ranker.rank(univariate_results, only_drifting = False)
display(ranked_features)

calc = nml.StatisticalOutputDriftCalculator(
    y_pred='y_pred',
    y_pred_proba='y_pred_proba',
    timestamp_column_name='timestamp'
)
calc.fit(reference)
results = calc.calculate(analysis)

figure = results.plot(kind='prediction_drift', plot_reference=True)
figure.show()

# Let's initialize the object that will perform Data Reconstruction with PCA
rcerror_calculator = nml.DataReconstructionDriftCalculator(feature_column_names=feature_column_names, timestamp_column_name='timestamp', chunk_size=chunk_size).fit(reference_data=reference)
# let's see Reconstruction error statistics for all available data
rcerror_results = rcerror_calculator.calculate(analysis)
figure = rcerror_results.plot(kind='drift', plot_reference=True)
figure.show()

Gives the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\anaconda3\lib\site-packages\nannyml\base.py in fit(self, reference_data, *args, **kwargs)
     94             self._logger.debug(f"fitting {str(self)}")
---> 95             return self._fit(reference_data, *args, **kwargs)
     96         except InvalidArgumentsException:

~\anaconda3\lib\site-packages\nannyml\drift\model_inputs\univariate\statistical\calculator.py in _fit(self, reference_data, *args, **kwargs)
    105         self.previous_reference_data = reference_data.copy()
--> 106         self.previous_reference_results = self._calculate(self.previous_reference_data).data
    107 

~\anaconda3\lib\site-packages\nannyml\drift\model_inputs\univariate\statistical\calculator.py in _calculate(self, data, *args, **kwargs)
    116 
--> 117         self.continuous_column_names, self.categorical_column_names = _split_features_by_type(
    118             data, self.feature_column_names

~\anaconda3\lib\site-packages\nannyml\base.py in _split_features_by_type(data, feature_column_names)
    229 
--> 230     categorical_column_names = [col for col in feature_column_names if _column_is_categorical(data[col])]
    231 

~\anaconda3\lib\site-packages\nannyml\base.py in <listcomp>(.0)
    229 
--> 230     categorical_column_names = [col for col in feature_column_names if _column_is_categorical(data[col])]
    231 

~\anaconda3\lib\site-packages\nannyml\base.py in _column_is_categorical(column)
    235 def _column_is_categorical(column: pd.Series) -> bool:
--> 236     return column.dtype in ['object', 'string', 'category', 'bool']
    237 

TypeError: data type 'string' not understood

During handling of the above exception, another exception occurred:

CalculatorException                       Traceback (most recent call last)
<ipython-input-1-9ae82d7fa4d4> in <module>
     39     chunk_size=chunk_size
     40 )
---> 41 univariate_calculator = univariate_calculator.fit(reference)
     42 univariate_results = univariate_calculator.calculate(analysis)
     43 # Plot drift results for all model inputs

~\anaconda3\lib\site-packages\nannyml\base.py in fit(self, reference_data, *args, **kwargs)
     99             raise
    100         except Exception as exc:
--> 101             raise CalculatorException(f"failed while fitting {str(self)}.\n{exc}")
    102 
    103     def calculate(self, data: pd.DataFrame, *args, **kwargs) -> AbstractCalculatorResult:

CalculatorException: failed while fitting <nannyml.drift.model_inputs.univariate.statistical.calculator.UnivariateStatisticalDriftCalculator object at 0x0000022BBF196A30>.
data type 'string' not understood

Expected behavior
The quickstart code runs without a problem.

Additional context

The user who had that issue was running python 3.8 on windows through a pycharm environment.

I couldn't reproduce the error when I tried on my machine. Moreover when I guided the user to set up a new conda environment the error went away.

However maybe the way string type is defined here could be changed similar to suggestions such as these to cover more cases? I 'd hold of on that until we see more users having the issue, since in this case a misconfigured environment is more likely the problem than a library compatibility issue.

@nikml nikml added bug Something isn't working triage Needs to be assessed labels Jul 12, 2022
@bernhardbarker
Copy link

bernhardbarker commented Jul 29, 2022

Looks like the required numpy version (>=1.14.0) might be wrong. It seems like it should be closer to >=1.21.0 (at least based on the current nannyml code).

I reproduced it as follows:

conda create --name nannyml python=3.8
conda activate nannyml
pip install numpy==1.19.5
pip install pandas==1.4.3
pip install nannyml==0.5.0
pip install IPython==8.4.0
# quickstart.py contains the code above
python quickstart.py  # data type 'string' not understood

These libraries all seem to be compatible with one another (otherwise pip would show a warning or automatically update libraries to compatible versions, neither of which happened).

I fixed it by updating numpy:

pip install numpy==1.21.0
python quickstart.py  # works

I can also fix the issue by just explicitly converting each column to string:

...
display(ranked_features)

for c in reference.columns:
    reference[c] = reference[c].astype("string")

for c in analysis.columns:
    analysis[c] = analysis[c].astype("string")

calc = nml.StatisticalOutputDriftCalculator(
...

I'm running Ubuntu 22, in case it matters.

@gabrieltardochi
Copy link

Commenting to keep the issue alive, since I just faced the same problem while running the Univariate Drift Detection tutorial. Manually updating numpy worked for me as well.

@nnansters
Copy link
Contributor

Thanks for the extra information all. We'll try to include a fix in the next release, coming next week!

@nnansters
Copy link
Contributor

Looks like the required numpy version (>=1.14.0) might be wrong. It seems like it should be closer to >=1.21.0 (at least based on the current nannyml code).

I reproduced it as follows:

conda create --name nannyml python=3.8
conda activate nannyml
pip install numpy==1.19.5
pip install pandas==1.4.3
pip install nannyml==0.5.0
pip install IPython==8.4.0
# quickstart.py contains the code above
python quickstart.py  # data type 'string' not understood

These libraries all seem to be compatible with one another (otherwise pip would show a warning or automatically update libraries to compatible versions, neither of which happened).

I fixed it by updating numpy:

pip install numpy==1.21.0
python quickstart.py  # works

I can also fix the issue by just explicitly converting each column to string:

...
display(ranked_features)

for c in reference.columns:
    reference[c] = reference[c].astype("string")

for c in analysis.columns:
    analysis[c] = analysis[c].astype("string")

calc = nml.StatisticalOutputDriftCalculator(
...

I'm running Ubuntu 22, in case it matters.

Sorry for missing this Bernhard, I was out on holiday. Thanks for the report and the investigation!

@nnansters nnansters removed the triage Needs to be assessed label Aug 26, 2022
@nikml
Copy link
Contributor Author

nikml commented Sep 1, 2022

Nice dig down!

On the latest version, 0.5.3, with this code sample:

import pandas as pd
import nannyml as nml
from IPython.display import display

# Load synthetic data
reference, analysis, analysis_target = nml.load_synthetic_binary_classification_dataset()
display(reference.head())
display(analysis.head())

# Choose a chunker or set a chunk size
chunk_size = 5000

# initialize, specify required data columns, fit estimator and estimate
estimator = nml.CBPE(
   y_pred_proba='y_pred_proba',
   y_pred='y_pred',
   y_true='work_home_actual',
   timestamp_column_name='timestamp',
   metrics=['roc_auc'],
   chunk_size=chunk_size,
   # problem_type='classification_binary',
)
estimator = estimator.fit(reference)
estimated_performance = estimator.estimate(analysis)

# Show results
figure = estimated_performance.plot(kind='performance', metric='roc_auc', plot_reference=True)
figure.show()

# Define feature columns
feature_column_names = [
    col for col in reference.columns if col not in [
        'timestamp', 'y_pred_proba', 'period', 'y_pred', 'work_home_actual', 'identifier'
    ]]

# Let's initialize the object that will perform the Univariate Drift calculations
univariate_calculator = nml.UnivariateStatisticalDriftCalculator(
    feature_column_names=feature_column_names,
    timestamp_column_name='timestamp',
    chunk_size=chunk_size
)
univariate_calculator = univariate_calculator.fit(reference)
univariate_results = univariate_calculator.calculate(analysis)
# Plot drift results for all model inputs
for feature in univariate_calculator.feature_column_names:
    figure = univariate_results.plot(
        kind='feature_drift',
        metric='statistic',
        feature_column_name=feature,
        plot_reference=True
    )
    figure.show()

# Rank features based on number of alerts
ranker = nml.Ranker.by('alert_count')
ranked_features = ranker.rank(univariate_results, only_drifting = False)
display(ranked_features)

calc = nml.StatisticalOutputDriftCalculator(
    y_pred='y_pred',
    y_pred_proba='y_pred_proba',
    timestamp_column_name='timestamp',
    # problem_type='classification_binary'
)
calc.fit(reference)
results = calc.calculate(analysis)

figure = results.plot(kind='prediction_drift', plot_reference=True)
figure.show()

# Let's initialize the object that will perform Data Reconstruction with PCA
rcerror_calculator = nml.DataReconstructionDriftCalculator(feature_column_names=feature_column_names, timestamp_column_name='timestamp', chunk_size=chunk_size).fit(reference_data=reference)
# let's see Reconstruction error statistics for all available data
rcerror_results = rcerror_calculator.calculate(analysis)
figure = rcerror_results.plot(kind='drift', plot_reference=True)
figure.show()

the issue is not reproducable on my end with numpy 1.22.4 and pandas 1.4.3

@stale
Copy link

stale bot commented Oct 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Oct 31, 2022
@stale stale bot closed this as completed Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

4 participants