Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement filesystem check #1103

Merged
merged 13 commits into from
Feb 3, 2023
Merged

Implement filesystem check #1103

merged 13 commits into from
Feb 3, 2023

Conversation

Blajda
Copy link
Collaborator

@Blajda Blajda commented Jan 26, 2023

Description

Implementation of the filesystem check operation.

The implementation is fairly straight forward with a HEAD call being made for each active file to check if it exists.
A remove action is then made for each file that is orphaned.

An alternative solution is instead to maintain a hashset with all active files and then recursively list all files. If the file exists then remove from the set. All remaining files in the set are then considered orphaned.

Looking for feedback and if the second approach is preferred I can make the changes

Related Issue(s)

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels Jan 26, 2023
@houqp
Copy link
Member

houqp commented Jan 27, 2023

Instead of heading each individual file, wouldn't it be more efficient to issue an prefixed list call to gather all current files in the table? This way, we can reduce API call count by about 1000x.

@Blajda Blajda marked this pull request as ready for review January 28, 2023 20:56
@Blajda
Copy link
Collaborator Author

Blajda commented Jan 28, 2023

@houqp I modified the process to instead use a list call but I'm not quite sure on how I would take advantage of prefixes in this case. For partitioned tables it could perform a list per partition path but I think non-partitioned tables would still require a full list on the root.

For non-partitioned tables would it be possible to perform a list without including the files in _delta_log?

houqp
houqp previously approved these changes Jan 29, 2023
@houqp
Copy link
Member

houqp commented Jan 29, 2023

@Blajda the prefix filter can only be applied with the table root unfortunately, so it will always include the _detal_log folder unfortunately. The main benefit it provides is we can restrict the listing to the table we are running the operation on, not all other tables stored in the same bucket.

@houqp
Copy link
Member

houqp commented Jan 29, 2023

@roeap @wjones127 want to take a final look?

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this! I've been thinking having an operation like this would let me sleep better at night :)

I think we should address the case where data files are outside the root, either explicitly not supporting or using HEAD for those ones. Also noticed a few minor issues.

rust/src/operations/filesystem_check.rs Outdated Show resolved Hide resolved
rust/src/operations/filesystem_check.rs Outdated Show resolved Hide resolved
rust/src/operations/filesystem_check.rs Outdated Show resolved Hide resolved
rust/src/operations/filesystem_check.rs Outdated Show resolved Hide resolved
rust/src/operations/filesystem_check.rs Outdated Show resolved Hide resolved
rust/src/operations/filesystem_check.rs Outdated Show resolved Hide resolved

impl FileSystemCheckBuilder {
/// Create a new [`FileSystemCheckBuilder`]
pub fn new(store: Arc<DeltaObjectStore>, state: DeltaTableState) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate that the provided state is the latest version of the table? Since I don't think this is a valid operation for earlier states? or maybe that is an overall issue with the operations module?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dry run of this operation would be helpful for diagnosing issues that arise from a combination of vacuum + time travel. If executed on a older version I do expect it to fail during the commit operation.
I can add a test to demonstrate that.

rust/tests/command_filesystem_check.rs Show resolved Hide resolved
rust/tests/command_filesystem_check.rs Show resolved Hide resolved
files_to_check.insert(active.path.to_owned(), active);
});

let mut files = self.store.list(None).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delta Tables are allowed to have data files outside the table root. Currently, if that is the case, this operation will remove them all from the log. I see two options:

  1. Call HEAD on each file outside the root to check if they exist.
  2. Error if we detect any files outside the root.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Do any examples of tables that use absolute paths?
I just checked the protocol and noticed absolute paths are encoded as defined in rfc2396

It states the following

An absolute URI contains the name of the scheme being used (<scheme>)
   followed by a colon (":") and then a string (the <scheme-specific-
   part>) whose interpretation depends on the scheme.

So I can segment paths based of it the scheme is set on the path and then use the correct api call for each.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a good point! Checking for the scheme at the beginning makes sense.

We don't have any example tables right now. We don't generally support them in most operations, but we should try to keep them in mind. Hard to have example tables without setting up a public S3 bucket or something.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not comfortable with supporting absolute paths without having some integration tests. Currently the operation will fail when an absolute path is encountered.

Add wjones127 suggestions

Co-authored-by: Will Jones <willjones127@gmail.com>
@Blajda Blajda marked this pull request as ready for review February 2, 2023 01:39
@Blajda
Copy link
Collaborator Author

Blajda commented Feb 2, 2023

@wjones127 I updated the PR to fail when an absolute path exists in the Delta log. Also added an additional test to demonstrate the operation will only work on the latest table version.

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Blajda - one small question along the lines of hoping to eventually support absolute paths in the delta log. Otherwise looks great!

Comment on lines +61 to +70
fn is_absolute_path(path: &str) -> DeltaResult<bool> {
match Url::parse(path) {
Ok(_) => Ok(true),
Err(ParseError::RelativeUrlWithoutBase) => Ok(false),
Err(_) => Err(DeltaTableError::Generic(format!(
"Unable to parse path: {}",
&path
))),
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will not work for local paths. One pattern we apply in object_store and here is to try and cannonicalize the path and if it fails proceed with url parsing. e.g.

if let Ok(path) = std::fs::canonicalize(table_uri) {
return Url::from_directory_path(path)
.map_err(|_| DeltaTableError::InvalidTableLocation(table_uri.to_string()));
}

Copy link
Collaborator Author

@Blajda Blajda Feb 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow on why it would not work on local paths. Paths in the Delta log must follow rfc2396 which states absolute paths must have a scheme.
Do you have a example that would cause the unit test for this function to fail?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well .. in that case I am wrong! If that paths are required to have a scheme, then url parsing should always work for valid paths ... The windows case is also covered in your tests, so all is well :).

@wjones127 wjones127 merged commit ddf4f43 into delta-io:main Feb 3, 2023
@Blajda Blajda deleted the fsck branch February 3, 2023 23:09
chitralverma pushed a commit to chitralverma/delta-rs that referenced this pull request Mar 17, 2023
# Description
Implementation of the filesystem check operation.

The implementation is fairly straight forward with a HEAD call being
made for each active file to check if it exists.
A remove action is then made for each file that is orphaned.

An alternative solution is instead to maintain a hashset with all active
files and then recursively list all files. If the file exists then
remove from the set. All remaining files in the set are then considered
orphaned.
 
Looking for feedback and if the second approach is preferred I can make
the changes

# Related Issue(s)
- closes delta-io#1092

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support FSCK REPAIR TABLE Operation
4 participants