Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the key management tools API for encryption #426

Merged
merged 25 commits into from
Mar 4, 2024

Conversation

adamreeve
Copy link
Contributor

Fixes #200

This adds bindings to the Parquet key management tools API, also known as the high level encryption API. The existing low-level encryption API supports setting encryption/decryption keys directly, whereas this API uses a Key Management System (KMS) to implement envelope encryption, where data encryption keys are randomly generated and then stored encrypted in the file metadata.

Comment on lines +54 to +57
/// This CryptoFactory instance must remain alive and not disposed until after any files using these
/// decryption properties have been read, as internally the FileDecryptionProperties contains references to
/// data in the CryptoFactory that cannot be managed by ParquetSharp.
/// Failure to do so may result in native memory access violations and crashes that cannot be caught as exceptions.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward. I couldn't find a way to handle this nicely without exposing users to possible segfaults, as the FileDecryptionProperties class in C++ has a private constructor and can't be subclassed, so we can't do anything like combine it with a reference to the CryptoFactory. I'm hoping this can be fixed in Arrow so we can fix this limitation for the next version though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a PR in Arrow to address this at apache/arrow#40329

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's been merged, so we should be able to remove this warning in the 16.0.0 release

@adamreeve adamreeve force-pushed the high_level_encryption branch from 62b367d to 7101437 Compare February 29, 2024 02:58
@adamreeve
Copy link
Contributor Author

I'm not sure what to do about the Windows build failure. It worked for me with MSVC 19.35.32216.1, and with 19.39.33521.0 after upgrading. The GitHub runner has version 19.38.33135.0 though.

It looks like this is triggered by my changes pulling in a dependency on the Arrow FileSystem API, but this seems to be a bug in MSVC. I found a similar issue reported at https://developercommunity.visualstudio.com/t/Latest-Visual-Studio-version-1791-ca/10598722

We might need to wait until the runners get a newer version of MSVC, or disable this feature on Windows for now.

@adamreeve
Copy link
Contributor Author

Similar build failure issue reported at actions/runner-images#9398, but that was apparently caused by linking to a different STL version than what the compiler expected. Maybe there's something similar going on with our build.

@adamreeve
Copy link
Contributor Author

The build failure appears to be because Arrow is built with a different MSVC than ParquetSharp, which isn't good. I'm not sure why different versions are being selected:

From the Arrow build logs:

-- The CXX compiler identification is MSVC 19.39.33520.0
...
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.39.33519/bin/Hostx64/x64/cl.exe - skipped

And from the ParquetSharp build:

-- The CXX compiler identification is MSVC 19.38.33135.0
...
-- Check for working CXX compiler: C:/Program Files/Microsoft Visual Studio/2022/Enterprise/VC/Tools/MSVC/14.38.33130/bin/HostX64/x64/cl.exe - skipped

@adamreeve adamreeve force-pushed the high_level_encryption branch 2 times, most recently from fff7f25 to 8f6c8dd Compare March 1, 2024 00:59
@adamreeve adamreeve force-pushed the high_level_encryption branch from 8f6c8dd to b57fc62 Compare March 1, 2024 02:00
@adamreeve adamreeve force-pushed the high_level_encryption branch from b57fc62 to 634b69a Compare March 1, 2024 07:28
Copy link
Contributor

@marcin-krystianc marcin-krystianc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, well done!

set => ExceptionInfo.Check(EncryptionConfiguration_SetDataKeyLengthBits(Handle.IntPtr, value));
}

private static string EncodeColumnKeys(IReadOnlyDictionary<string, IReadOnlyList<string>> columnKeys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It can be implemented in a single line:
return string.Join(";", columnKeys.Select(kvp => $"{kvp.Key}:{string.Join(",", kvp.Value)}"));
Nonetheless, the current version is also fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, that's much more readable, I've changed this now

@adamreeve adamreeve merged commit e767413 into G-Research:master Mar 4, 2024
30 checks passed
@adamreeve adamreeve deleted the high_level_encryption branch March 4, 2024 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate high level encryption API
2 participants