-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema mismatch error when writing both partitioned and non-partitioned Parquet datasets #3046
Comments
Hi @diegoxfx this is by design. I strongly advice you against mixing partitioned and non-partitioned data under the same prefix. |
Hi @kukushking, I apologize if my initial example was unclear. The issue isn’t that I’m mixing partitioned and non-partitioned files in the same prefix. In my actual use case, I'm writing two separate outputs to entirely different S3 locations:
s3_adapter is just a wrapper for the awswrangler functions. So, these two writes target different buckets/prefixes and should not interfere with each other.: The first one writes into "s3://bucket1/model_result_path/file.parquet", and the second writes into the table "payouts_model_table" which is located in "s3://bucket2/database_path/table/" My expectation:
|
Thanks @diegoxfx , from what you shared above looks like you are overriding the schema object for partitioned write making it expect columns that are not supposed to be there. Btw, code in "how to reproduce" passes successfully. |
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed. |
Describe the bug
When I attempt to write both a partitioned Parquet dataset and a non-partitioned Parquet file from the same data schema, I encounter a schema mismatch error. This occurs because partitioned writes exclude the partition columns from the Parquet file schema, while non-partitioned writes include them. Attempting one after the other leads to:
How to Reproduce
The second call fails with a schema mismatch error. If you reverse the order of the calls (first the partitioned and then the non partitioned) also fails.
Expected behavior
The second call should write the data successfully without a schema mismatch error.
Your project
No response
Screenshots
No response
OS
Docker Container
Python version
3.11.8
AWS SDK for pandas version
3.10.1
Additional context
ChatGPT o1 says here's probably the cause of the bug:
The text was updated successfully, but these errors were encountered: