-
Notifications
You must be signed in to change notification settings - Fork 403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deltalake.PyDeltaTableError: Invalid JSON in log record: missing field minValues
at line 1 column 16
#582
Comments
It sounds like Spark wrote a bad statistics in the Delta log. Two possible (and not mutually exclusive) solutions: (1) If there is an error processing statistics, we could ignore the stats. (Maybe we have an option to error on bad stats, if people want that.) Do you have information on the version of Spark and the Delta writer this table was written with? |
Thanks! https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-3-runtime Here should have all the details on spark and Delta. Specifically, Let me know if you need anything else from me! |
But the weird thing is Spark can read the table, so not sure if there are still stats errors |
Thanks @xiaoyongzhu for the report :) I think this is probably caused by an ambiguity in the delta spec. In https://github.com/delta-io/delta/blob/master/PROTOCOL.md#per-file-statistics, there is no mention of whether these stats fields are optional or not. Right now all the stats fields are modeled as required in Line 126 in 346f51a
We should probably also send a PR to https://github.com/delta-io to clarify the behavior too. |
Thanks @houqp for the response! Ya marking those as optional should be good. Maybe I can expect a fix (and a release) soon? |
Let's see what what @wjones127 has to say about this approach first. You are welcome submit a quick fix PR as well if you wants to get your hands rusty ;) |
Yes, adding the option seems like it should solve this issue. To be defensive, I'm inclined to also add parsing logic to ignore the stats if they come back with this error. What do you think of that @houqp? We should go over and make a clarification to the delta protocol. |
My main concern about this is what happens if it contains other stats like null count and max val, but only missing min value? In this case, we still want to load the other stats right? |
Yes, agreed. My main point is that parsing these stats could return an Option, not a Result, so that even if the stats were completely corrupt (say invalid JSON, or used a custom format we didn't recognize), we could still load the table. |
Ha I see, that sounds reasonable. But I think it would be good to log the parsing error as well so we are not silently ignoring the error. |
Thanks @houqp and @wjones127 ! I'm not a rust developer so might not be able to help here, but definitely open to help testing it out! |
Goal of this PR is to provide support for Delta Lake. Currently one of the test will fail (reading the delta lake result from Azure Synapse) due to an error on the underlying library. (tracked here: delta-io/delta-rs#582)
This problem currently prevents use of delta-rs with "Microsoft Fabric Data Warehouse" ( https://twitter.com/mim_djo/status/1692855422364041428 ). I'm also told that disabling column statistics in Spark results in a log entry which has numRecords but not the other members. |
Environment
Local Python + Azure Spark
Delta-rs version:
unkonwn
Binding:
Python (latest installation from pip)
Environment:
Bug
What happened:
After the above lines were executed, the error will be like below:
What you expected to happen:
Function correctly
How to reproduce it:
It should be reproduced easily. Account name and key can be provided at request.
More details:
The result is generated by a spark job and can be loaded in spark correctly. However in delta-rs or the python binding, it was not loaded correctly
The text was updated successfully, but these errors were encountered: