-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column dtypes are converted to object if query is unlucky #132
Comments
Do you have any idea in the code where "it will set that column to None" happens? It seems like it might be an issue with |
This happens with datetimes, ints, and floats. The dtype of the resulting DataFrame column will be:
I realize you probably won't be willing to do anything about case 4, but cases 2 and 3 are very confusing, because the ultimate dtype of the column depends purely on luck - how the query gets chunked and whether any of the pieces that |
I don't fully understand why this is happening at a low level (especially for floats), because I would think that the Pandas dataframe would still use the numpy dtype for the source column, but that is apparently not true. I think it might be fixable by explicitly settings the dtypes when calling |
I'm not exactly sure how it happens because I haven't tried debugging clickhouse_connect's innards, but I know for a fact that the dtype of a datetime/int/float column will always be |
I agree that the behavior is bad (especially the non-deterministic way that it happens based on incoming blocks) and I expect to be able to fix it. I just want to clarify something that can help me debug -- are you setting |
I do set |
Okay, I think I understand what's happening then. Pandas can't infer the datatype from a column that's completely None, so I'll have to preserve it based on the ClickHouse datatype returned by the query (which is what the numpy code does). I don't claim to be a Pandas expert, but in numpy at least you can't have Nones in any non-object column. Are Pandas dtypes different? Are Nones allowed in columns with a dtype other than |
I don't think pandas allows |
I guess I'm confused (and not familiar with low level pandas types) -- if you set |
Getting a Clickhouse I think the issue at hand here should be reproduceable by creating a table with a |
Got it. It sounds like we need a fix for float columns when Should the driver handle the case where the user actually wants an |
I think adding logic to make sure nullable int columns are returned as I don't really understand your second paragraph - why would the user ever want an
Option 1 is the cleanest, but also hardest to implement. Thus, although I prefer it, I don't advocate it. |
I'll see what version of Pandas has that support and if it's recent enough, I think this would be good.
If this is correct, why make such a column Nullable in the first place? You could get the same result by just using a regular Float32 or Float64 ClickHouse column. ClickHouse can store a float NaN in such columns. CREATE TABLE test_floats
(
`f1` Float64
)
ENGINE = MergeTree
ORDER BY f1
INSERT INTO test_floats VALUES (NaN)
SELECT * FROM test_floats
┌──f1─┐
│ nan │
└─────┘ My assumption is that if you are making such a column Nullable, you want to distinguish somehow between NULL and NaN for some reason.
Again this is easy to achieve just by using a regular Float32 or Float64 column in ClickHouse and not adding the Nullable wrapper. I can make clickhouse-connect do the same thing, but then someone who actually wants to distinguish between NULL and NaN can't do so.
I'm assuming that if someone actually creates a Nullable Float* column (and they set One possible improvement for Nullable(Float*) columns is if they set |
This sounds very reasonable. I think |
I'm not surprised that using the clickhouse-client CLI from a separate process, even with the need to write to a separate file, is more efficient that writing directly from clickhouse-connect (although that should be less true as we improve the Python insert performance) . Do you know what format you use to create the flat file? Since you want NaN and not None/NULL when consuming the data ideally it would be best (both simpler and more performant) to avoid creating a Nullable(Float) column for ingest as well, so there might be a cleaner way to do that. In any case I'm thinking of adding another flag to the query (and insert) methods of |
For writing to Clickhouse we use Arrow and CSV. CSV is slower, but Arrow has problems with columns that have only |
How do you hook ArrowStream up to ClickHouse? I think some of the reason clickhouse-connect might be slower is the conversion of columns with Nones as we're seeing here. Also the ArrowStream format is all C optimized, and of course, Arrow is quite efficient anyway. |
The standard CLI tool |
If the
query_df
method receives a packet of table data where a column has onlyNULL
values, it will set that column toNone
in the partialDataFrame
that it later concatenates. The result is that theDataFrame
returned by the method will have a type ofobject
and contain bothNone
values andpd.NaT
/pd.NA
/np.NaN
values as corresponding to the type of the column.If none of the parts read by
query_df
happen to contain allNULL
values for that column, the resultingDataFrame
wil have the correct and expected dtype.The text was updated successfully, but these errors were encountered: