Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query with column_formats set on FixedString columns should convert to pandas.StringDtype #356

Closed
0liu opened this issue Jun 4, 2024 · 5 comments · Fixed by #357
Closed
Labels
bug Something isn't working

Comments

@0liu
Copy link

0liu commented Jun 4, 2024

Describe the bug

When a table has FixedString type columns, query_df method can use read formats parameter to convert it to string in the returned dataframe. However, unlike other returned String columns which are converted to Pandas StringDType, FixedString columns are converted to object dtype. This makes the returned string columns have inconsistent types and behaviors.

Reference: https://pandas.pydata.org/docs/user_guide/text.html

Steps to reproduce

  1. Create a test table:
CREATE TABLE test_string
(
    `id` UInt32,
    `name` String,
    `sex` FixedString(1),
    `city` String
)
ENGINE = Memory
  1. Insert test values
INSERT INTO test_string VALUES (1, 'Jacob', 'M', 'Chicago'), (2, 'Annie', 'F', 'Los Angeles');
  1. Query from client and check dtypes:
df = client.query_df("SELECT * FROM test_string", column_formats={"sex": 'string'})
In [8]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      2 non-null      uint32
 1   name    2 non-null      string
 2   sex     2 non-null      object
 3   city    2 non-null      string
dtypes: object(1), string(2), uint32(1)
memory usage: 188.0+ bytes

In [10]: df.name.dtype
Out[10]: string[python]

In [11]: df.sex.dtype
Out[11]: dtype('O')

In [12]: df.city.dtype
Out[12]: string[python]

Expected behaviour

As shown above, the returned String columns name and city have Pandas StringDType type, while the FixedString column sex is converted to object dtype. It's expected the sex column is also converted to Pandas StringDType.

Configuration

Environment

  • clickhouse-connect version: 0.7.11
  • Python version: 3.12.3
  • Operating system: Linux 6.8.11-1-MANJARO

ClickHouse server

  • ClickHouse Server version: 24.4 official docker image
@0liu 0liu added the bug Something isn't working label Jun 4, 2024
@genzgd
Copy link
Collaborator

genzgd commented Jun 4, 2024

Have you tried setting the read format for the column to string? https://clickhouse.com/docs/en/integrations/python#read-formats

I'm not sure that will work without a code change but it's worth testing.

@0liu
Copy link
Author

0liu commented Jun 4, 2024

Have you tried setting the read format for the column to string?

The simple example in the description uses read formats with column_formats={"sex": 'string'} parameter. If you mean to set it globally like

from clickhouse_connect.datatypes.format import set_read_format

set_read_format('FixedString', 'string')

# client = get_client()
df = client.query_df("select * from test_string")

this gives the same object type for the FixedString column.

@genzgd
Copy link
Collaborator

genzgd commented Jun 4, 2024

Ah, I missed that, the column level should be good enough. I'll try to push a fix in the next day or two.

@genzgd
Copy link
Collaborator

genzgd commented Jun 4, 2024

I actually can't reproduce this. What Pandas version are you using?

@genzgd
Copy link
Collaborator

genzgd commented Jun 4, 2024

This should be fixed in 0.7.12 if you want to validate. I did manage to recreate the issue (I think), although pandas was reporting weird data types in my tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants