-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delta Exporter action: Handle vacuumed objects correctly #8409
Delta Exporter action: Handle vacuumed objects correctly #8409
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good
Thanks
if entry.remove ~= nil then | ||
-- If the object is not found, and the entry is a remove entry, we can assume it was vacuumed | ||
print(string.format("Object with path '%s' of a `remove` entry wasn't found. Assuming vacuum.", unescaped_path)) | ||
unfound_paths[unescaped_path] = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it was an add
entry (meaning the clause beneath) and a stat object failed, there might still be a remove
entry with the same path further ahead which will cause a vacuum to delete it, and in that case we don't want to fail the process, thus we add it to the unfound_paths
table.
This line indicates that if a remove occurred and a stat object failed, we should treat any previously added entries as vacuumed objects.
The test on line 155 checks if there were any failed add
entry path stat object requests (meaning that they weren't removed by a remove
entry), and if that's the case throw an error.
When traversing a delta lake log, we might come across entries of objects that were vacuumed. Prior to this change, we would stat all of the entry paths to get their physical address, even if those paths were vacuumed. This would cause a 404 "not found" status code to be returned, and the hook would error.
This fix handles such cases.
The resulting exported delta log will include logical paths of vacuumed objects but that doesn't matter for either reading a table, nor future vacuums.
How was it tested?
Manually: created a table, added data, altered the table such that the delta log included a path under an
add
entry (in the0000...0.json
file) and the same path under aremove
entry (in the0000...1.json
file). I've deleted the path manually, thus mimicking vacuum, and committed which made the hook run.Finally, verified that the resulting log is as expected, and read the table using Spark.