Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression with v0.5.13 introducing StringArray #172

Closed
he-rvb opened this issue Apr 25, 2023 · 10 comments
Closed

Regression with v0.5.13 introducing StringArray #172

he-rvb opened this issue Apr 25, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@he-rvb
Copy link

he-rvb commented Apr 25, 2023

Describe the bug

Starting 0.5.13, pandas' StringArray are used, but it is only experimental and not well supported.
As a result, exporting a pandas dataframe with to_hdf lead to the following error:
TypeError: objects of type ``StringArray`` are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or bytes

Steps to reproduce

  1. Have 0.5.13 or more recent version
  2. Execute code example below

Code example

import clickhouse_connect

with clickhouse_connect.get_client(
    host="play.clickhouse.com", port=443, username="play"
) as client:
    df = client.query_df(query="SELECT 'TEST' as test")

print(df.dtypes)

df.to_hdf("./test.hdf", "df")

Expected behaviour

The export should not raise an exception and df.types should probably return

test    object
dtype: object

instead of

test    string
dtype: object

Configuration

Environment

  • Python version: 3.10.10
    • clickhouse-connect version: 0.5.13
    • pandas version: 1.5.3
    • tables version: 3.8.0
  • Operating system: Linux
@he-rvb he-rvb added the bug Something isn't working label Apr 25, 2023
@genzgd
Copy link
Collaborator

genzgd commented Apr 25, 2023

This seems like this should be an option perhaps? Also is there any easy workaround by updating the dtype before calling to_hdf?

@genzgd
Copy link
Collaborator

genzgd commented Apr 25, 2023

There's also a query option designed for disabling "advanced" pandas types which solves the problem. Please try:

df = client.query_df(query="SELECT 'TEST' as test", use_na_values=False)

@genzgd genzgd added enhancement New feature or request and removed bug Something isn't working labels Apr 25, 2023
@he-rvb
Copy link
Author

he-rvb commented Apr 25, 2023

Yes changing the types of some columns before using to_hdf was my first thought but I felt it was important to let you know in case other user would be impacted by this change.
Thanks for the quick answer and for the workaround; using use_na_values=False seems to solve the issue in a cleaner way.

@genzgd
Copy link
Collaborator

genzgd commented Apr 25, 2023

Glad to hear it, I think that option should reduce the dtypes used to the basic numpy types (plus pandas Timestamp), so it probably should have been named "use_advanced_dtypes" or something along those lines.

@he-rvb
Copy link
Author

he-rvb commented Apr 25, 2023

I agree that an option to disable or enable experimental dtypes might be useful.
However it is not exactly the usage of use_na_values, for example it is possible that to get a similar error for the experimental IntegerArray even with use_na_values=False with the following example:

import clickhouse_connect

with clickhouse_connect.get_client(
    host="play.clickhouse.com", port=443, username="play"
) as client:
    df = client.query_df(query="SELECT 1 as test UNION ALL SELECT NULL" , use_na_values=False)

print(df.dtypes)

df.to_hdf("./test.hdf", "df")

@genzgd
Copy link
Collaborator

genzgd commented Apr 25, 2023

I'll take a look at that, it's probably fairly easy to make the same option return an object array with NULL numeric columns.

@genzgd genzgd added bug Something isn't working and removed enhancement New feature or request labels Apr 25, 2023
@genzgd
Copy link
Collaborator

genzgd commented Apr 25, 2023

Yes, it's an easy change and I think it's more consistent to avoid all non-numpy dtypes if that flag is set. It will be fixed in the next release (tentatively scheduled for next week.)

@genzgd
Copy link
Collaborator

genzgd commented Apr 26, 2023

Renamed the flag in the new release 0.5.21 to use_extended_dtypes. Setting this to False on query_df should work to return "basic" dataframes.

@genzgd genzgd closed this as completed Apr 26, 2023
@he-rvb
Copy link
Author

he-rvb commented Apr 28, 2023

I tried this version and setting and query_df with use_extended_dtypes = False indeed seems to work as expected.
Thank you.

@genzgd
Copy link
Collaborator

genzgd commented Apr 28, 2023

Thanks for testing and for reporting the result. Feedback is always much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants