Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column dtypes are converted to object if query is unlucky #132

Closed
georgipeev opened this issue Feb 15, 2023 · 18 comments · Fixed by #137
Closed

Column dtypes are converted to object if query is unlucky #132

georgipeev opened this issue Feb 15, 2023 · 18 comments · Fixed by #137
Assignees
Labels
bug Something isn't working

Comments

@georgipeev
Copy link

If the query_df method receives a packet of table data where a column has only NULL values, it will set that column to None in the partial DataFrame that it later concatenates. The result is that the DataFrame returned by the method will have a type of object and contain both None values and pd.NaT/pd.NA/np.NaN values as corresponding to the type of the column.
If none of the parts read by query_df happen to contain all NULL values for that column, the resulting DataFrame wil have the correct and expected dtype.

@genzgd
Copy link
Collaborator

genzgd commented Feb 16, 2023

Do you have any idea in the code where "it will set that column to None" happens? It seems like it might be an issue with pd.DataFrame.from_dict. Is the underlying column a numpy datetime type, a float, or something else?

@genzgd genzgd added the bug Something isn't working label Feb 16, 2023
@georgipeev
Copy link
Author

This happens with datetimes, ints, and floats. The dtype of the resulting DataFrame column will be:

  1. If no NULLs are selected: np.datetime64/int/float
  2. If some NULLs are selected and the query happens to not receive a single piece where all values are NULLs: np.datetime64/float (containing np.NaN instead of pd.NA)/float
  3. If some NULLs are selected and the query happens to receive a piece where all values are NULLs: object (column values contain a mixture of None and pd.NaT/np.NaN)
  4. If only NULLs are selected (corollary of 3): object, all values set to None

I realize you probably won't be willing to do anything about case 4, but cases 2 and 3 are very confusing, because the ultimate dtype of the column depends purely on luck - how the query gets chunked and whether any of the pieces that close_df will pd.concat happens to contain all NULLs or not. This kind of non-determinism is less than ideal.

@genzgd
Copy link
Collaborator

genzgd commented Feb 16, 2023

I don't fully understand why this is happening at a low level (especially for floats), because I would think that the Pandas dataframe would still use the numpy dtype for the source column, but that is apparently not true.

I think it might be fixable by explicitly settings the dtypes when calling pd.DataFrame.from_dict, but that will require some additional tests to make sure it doesn't break anything. There's nothing explicit in the actual clickhouse-connect code which treats NaN types differently (or sets them to None) if the whole column in the block contains nothing but those values, so it's numpy or Pandas that doing it somewhere.

@georgipeev
Copy link
Author

I'm not exactly sure how it happens because I haven't tried debugging clickhouse_connect's innards, but I know for a fact that the dtype of a datetime/int/float column will always be object and contain None values if only NULLs are selected, and that it will sometimes, depending on the query, be object with mixed Nones and pd.NaT/pd.NaN if many NULLs are selected. I can and will code around this non-determinism, but it was quite hard to figure out and reproduce, so I thought you might not want other users going through the same painful process.

@genzgd
Copy link
Collaborator

genzgd commented Feb 16, 2023

I agree that the behavior is bad (especially the non-deterministic way that it happens based on incoming blocks) and I expect to be able to fix it. I just want to clarify something that can help me debug -- are you setting use_none=True when calling query_df? I can see some potential issues with that code path.

@georgipeev
Copy link
Author

I do set use_none=True when calling query_df.

@genzgd
Copy link
Collaborator

genzgd commented Feb 16, 2023

Okay, I think I understand what's happening then. Pandas can't infer the datatype from a column that's completely None, so I'll have to preserve it based on the ClickHouse datatype returned by the query (which is what the numpy code does).

I don't claim to be a Pandas expert, but in numpy at least you can't have Nones in any non-object column. Are Pandas dtypes different? Are Nones allowed in columns with a dtype other than object?

@georgipeev
Copy link
Author

I don't think pandas allows None values in columns of dtype other than object, and I've noticed it eagerly converts the dtype to object whenever it can't easily maintain it as int/float/datetime.
I think it'd be entirely reasonable to have the column dtype be object if only NULL values were selected from the database (as opposed to doing something a lot more involved to make it match the type of the Clickhouse column definition), but if it contains at least one non-NULL value it should be possible to get pandas to infer the correct dtype, even if multiple pieces are being concatenated together, some of which may have only None values in that column (and, therefore, dtype object).

@genzgd
Copy link
Collaborator

genzgd commented Feb 16, 2023

I guess I'm confused (and not familiar with low level pandas types) -- if you set use_none=True for the query_df (meaning ClickHouse NULLs are converted to None), wouldn't any column with any None/NULL be a dtype object column? Or are some dtypes okay with pd.NaT without being an object dtype? (I know float is weird, but would you expect a ClickHouse NULL to be converted to a pd.NaN in that case?)

@georgipeev
Copy link
Author

Getting a Clickhouse NULL as a pd.NaN makes perfect sense to me for a Float column. However, clickhouse_connect will also use pd.NaN for NULLs in Int columns, which causes pandas to set the dtype of columns that correspond to Nullable(Int) to float. I think it would be preferable to get such columns as dtype pandas.Int64Dtype(), which has its own not-a-value, pd.NA. That, however, is a separate issue, albeit related.

I think the issue at hand here should be reproduceable by creating a table with a Nullable(Float64) column and inserting into it a lot of NULL values and one non-NULL value.
Querying only NULLs from that table will show how query_df ends up constructing a DataFrame with a column dtype of object which contains only None values.
Querying the non-NULL and only a few NULLs (i.e., one piece received) will probably cause query_df to return a DataFrame with a numpy.float64 dtype and np.nan values for the NULLs.
Querying the non-NULL and a lot of NULLs (i.e., multiple pieces received and concatenated) will probably cause query_df to return a DataFrame with a object dtype and both None and np.nan values for the NULLs.

@genzgd
Copy link
Collaborator

genzgd commented Feb 16, 2023

Got it. It sounds like we need a fix for float columns when is_none=True, where we would expect all ClickHouse NULLs to be np.nan, and if possible a fix for int columns that would convert the pandas dtype to an Int*Dtype() with pd.NA for ClickHouse NULLs. (I'm thinking the second might be tricky based on the actual Pandas version installed).

Should the driver handle the case where the user actually wants an object column for floats where ClickHouse has a NULL value, or do you think it's safe to always convert NULLs to np.nan.

@georgipeev
Copy link
Author

georgipeev commented Feb 16, 2023

I think adding logic to make sure nullable int columns are returned as pd.Int64DType() with pd.NA values instead of the current np.float64 with np.nan values would be beneficial, but is not imperative.

I don't really understand your second paragraph - why would the user ever want an object column for what is a Float in Clickhouse?
The problem is that such Floats can currently be returned as either np.float64 (if none of the pieces constructed from http responses end up with only NULL values), or as object (if at least one such piece ends up with only NULL values). A further complication is that in the latter case the column will contain None values, possibly mixed with np.nans.
This is an issue mostly because it's non-deterministic. I don't think it ultimately matters if the implementation does either of these:

  1. If the Clickhouse column type is Float, the resulting DataFrame column dtype is always np.float64 with np.nans
  2. If the select statement returns only NULL values, the DataFrame column will contain None values and be of dtype object. Otherwise it will be of dtype np.float64 and contain np.nan values where there was a NULL in Clickhouse.
  3. If the select statement returns any NULL value, the DataFrame column type will be object and it will contain regular floats and None values. Otherwise it will be of dtype np.float64 and contain only regular floats.

Option 1 is the cleanest, but also hardest to implement. Thus, although I prefer it, I don't advocate it.
Option 3 would be a bit annoying to users.
However, all 3 options above are self-consistent and deterministic, so users can reliably code to a query_df method that implements any of them. Which one you choose is entirely up to you, I'm only saying that as a user I'd like to have deterministic and properly documented behavior - even if it is not ideal, it should at least be workable. The current behavior is not very workable.

@genzgd
Copy link
Collaborator

genzgd commented Feb 16, 2023

I think adding logic to make sure nullable int columns are returned as pd.Int64DType() with pd.NA values instead of the current np.float64 with np.nan values would be beneficial, but is not imperative.

I'll see what version of Pandas has that support and if it's recent enough, I think this would be good.

I don't really understand your second paragraph - why would the user ever want an object column for what is a Float in Clickhouse?

If this is correct, why make such a column Nullable in the first place? You could get the same result by just using a regular Float32 or Float64 ClickHouse column. ClickHouse can store a float NaN in such columns.

CREATE TABLE test_floats
(
    `f1` Float64
)
ENGINE = MergeTree
ORDER BY f1

INSERT INTO test_floats VALUES (NaN)

SELECT * FROM test_floats

┌──f1─┐
│ nan │
└─────┘

My assumption is that if you are making such a column Nullable, you want to distinguish somehow between NULL and NaN for some reason.

  1. If the Clickhouse column type is Float, the resulting DataFrame column dtype is always np.float64 with np.nans

Again this is easy to achieve just by using a regular Float32 or Float64 column in ClickHouse and not adding the Nullable wrapper. I can make clickhouse-connect do the same thing, but then someone who actually wants to distinguish between NULL and NaN can't do so.

  1. If the select statement returns only NULL values, the DataFrame column will contain None values and be of dtype object. Otherwise it will be of dtype np.float64 and contain np.nan values where there was a NULL in Clickhouse.
  2. If the select statement returns any NULL value, the DataFrame column type will be object and it will contain regular floats and None values. Otherwise it will be of dtype np.float64 and contain only regular floats.

I'm assuming that if someone actually creates a Nullable Float* column (and they set use_none=True), they care about the difference between NULL and NaN, so they always want an object dtype if there are any Nones. So maybe extra logic like this becomes unnecessarily complicated?

One possible improvement for Nullable(Float*) columns is if they set use_none=False (the default for Pandas), clickhouse-connect uses np.nan instead of None for actual nulls in a nullable column. However, it might have to be more fine grained if they want None for NULL strings and np.nan for NULL floats.

@georgipeev
Copy link
Author

My assumption is that if you are making such a column Nullable, you want to distinguish somehow between NULL and NaN for some reason.

This sounds very reasonable. I think NaN values on the Python side get converted to NULL on the Clickhouse side when writing via the clickhouse-client (the standard Clickhouse CLI tool), which is how we end up using NULL values in Clickhouse to represent NaNs. We use the clickhouse-client because writing large amounts of data from Python is much faster done by saving to a flat file and passing that file to the CLI tool in a separate process. I have a feeling that other methods of writing Pandas DataFrames to Clickhouse may also convert the NaNs to NULLs, although I haven't tested that.

@genzgd
Copy link
Collaborator

genzgd commented Feb 22, 2023

I'm not surprised that using the clickhouse-client CLI from a separate process, even with the need to write to a separate file, is more efficient that writing directly from clickhouse-connect (although that should be less true as we improve the Python insert performance) .

Do you know what format you use to create the flat file? Since you want NaN and not None/NULL when consuming the data ideally it would be best (both simpler and more performant) to avoid creating a Nullable(Float) column for ingest as well, so there might be a cleaner way to do that.

In any case I'm thinking of adding another flag to the query (and insert) methods of use_na_values_for_null. If true then all ClickHouse NULLs would be set to np.NaN for floats, pd.NaT for pandas Timestamps, and pd.NA if available for other applicable pandas types.

@georgipeev
Copy link
Author

georgipeev commented Feb 22, 2023

For writing to Clickhouse we use Arrow and CSV. CSV is slower, but Arrow has problems with columns that have only Nones. For reading we use ArrowStream and then convert the Arrow types to pandas types (e.g., strings are byte arrays in Arrow, dates are # of days since 1970/1/1, etc.). Even with the type conversion overhead, for large queries that's still faster than query_df.
I'll test what happens if I use a non-Nullable Float column and try to insert a NaN into it. If that works, I agree with you that it would be preferable and cleaner.
I'm not exactly sure what you propose for the semantics of the use_na_values_for_null, but if you come up with a clean definition I'd certainly be interested in using that functionality. Thanks!

@genzgd
Copy link
Collaborator

genzgd commented Feb 22, 2023

How do you hook ArrowStream up to ClickHouse? I think some of the reason clickhouse-connect might be slower is the conversion of columns with Nones as we're seeing here. Also the ArrowStream format is all C optimized, and of course, Arrow is quite efficient anyway.

@georgipeev
Copy link
Author

The standard CLI tool clickhouse-client supports the ArrowStream format, which can then be piped into pyarrow.ipc.open_stream(process.stdout).read_pandas().

@genzgd genzgd self-assigned this Feb 23, 2023
@genzgd genzgd linked a pull request Feb 26, 2023 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants