Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Allocator type param to MutableBuffer #6336

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

waynexia
Copy link
Member

Which issue does this PR close?

Part of #3960

Rationale for this change

I find #3960 for the same reason -- tracking memory consumption with Arrays. The most straightforward way is to use the unstable Allocator API.

By adding a decoration layer to Allocator we can keep track of the real-time memory consumption of each container instance. There is a tiny tool that implements this https://github.com/waynexia/unkai

What changes are included in this PR?

The most important changes in this PR are

  • add a feature gate allocator_api, same name as the unstable rust feature
  • add an optional type parameter A: Allocator = Global to MutableBuffer like Vec

Since this requires an unstable rust feature, allocator_api can only be used in the nightly toolchain. To keep this library still working in the stable toolchain when the feature gate is disabled, this PR defines many dummy structs like Global and Allocator to substitute those from std lib as they are not referencable without unstable feature.

When allocator_api is enabled, users can either appoint their Allocator by new APIs new_in and with_capacity_in or invoking From<Vec<T, A>> which will inherit the allocator from vec.

This PR is tested with MIRI:

cargo +nightly miri nextest run -p arrow-buffer -F allocator_api
cargo +nightly miri nextest run -p arrow-buffer

Are there any user-facing changes?

New APIs as described above

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 30, 2024
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
arrow-buffer/src/buffer/mutable.rs Outdated Show resolved Hide resolved
return;
}

#[cfg(feature = "allocator_api")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire MutableBuffer has been gated under allocator_api. Do we still need it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may misread this. Only those new public APIs are limited under the feature gate.

@@ -28,6 +31,23 @@ use crate::{

use super::Buffer;

#[cfg(not(feature = "allocator_api"))]
pub trait Allocator: private::Sealed {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, it's somewhat confusing to define a sealed trait here. Maybe add some comments here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, please take a look

arrow-buffer/src/buffer/mutable.rs Show resolved Hide resolved
@@ -28,6 +31,23 @@ use crate::{

use super::Buffer;

#[cfg(not(feature = "allocator_api"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's a good idea to introduce allocator-api2 as a compatibility layer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your information. It looks like allocator-api2 provides many things we don't need. And I'm hesitant to add a new dep for the normal case. We can add it in the future if it turns out that allocator-api2 suits our use case well. What we need here is

  • Placeholders to standard library things that are not available without unstable features (Allocator and Global)
  • Wrapped alloc and dealloc methods

As for now we can define them by ourselves in few lines.

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your work. Overall, it looks good to me.

By the way, I still believe introducing allocator_api2 is a good idea, as it eliminates the need for a feature that requires nightly Rust.

The decision is up to you.


This PR also makes me wonder if we can introduce an allocator for all similar APIs.

@waynexia
Copy link
Member Author

Thanks for your review @Xuanwo

This PR also makes me wonder if we can introduce an allocator for all similar APIs.

I plan to add this for Buffer as well for the next step

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @waynexia and @Xuanwo for the reviews

I think this idea is quite cool 🙏 However, in order to get it ready to merge a few more things are needed:

  1. Documentation (specifically document the new feature here https://crates.io/crates/arrow)
  2. Tests

In terms of testing, what I think would be the most useful would be an "end to end" test / example of how to use this feature. For example to create a MutableBuffer with a custom allocator and report on the allocations performed or something.

Perhaps we could add such an example to the examples directory https://github.com/apache/arrow-rs/tree/master/arrow/examples and add a pointer to that that to the documentation?

@@ -61,8 +61,8 @@ jobs:
submodules: true
- name: Setup Rust toolchain
uses: ./.github/actions/setup-builder
- name: Test arrow-buffer with all features
run: cargo test -p arrow-buffer --all-features
- name: Test arrow-buffer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it confusing at first why this line doesn't follow the model of the rest of the tests in this workflow which run with --all-features

I am also concerned this may remove coverage for some of the other features such as prettyprint, ffi and pyarrow

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arrow-buffer only has one feature gate allocator_api from this PR. This specific change won't affect other tests like prettyprint as the command -p arrow-buffer only runs on this sub-crate. But I agree it's easy to forget when the new feature comes in the future... Sadly, cargo seems not to support opt-out one feature in CLI 😢

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, the same problem comes to the main arrow lib because I added a delegate feature allocator_api = ["arrow-buffer/allocator_api"]. Looks like we have to enumerate all available features except allocator_api in CI?

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia
Copy link
Member Author

waynexia commented Sep 2, 2024

Thanks for reviewing @alamb. I've created a new example allocator_api.rs under examples/, and a paragraph in the doc comment of MutableBuffer that describes the usage and links to the example.

For example to create a MutableBuffer with a custom allocator and report on the allocations performed or something.

I tried to use unkai at first to demo the ability to track memory usage but then I realized I can't add an optional dev-dependence. An unconditional dep to unkai will break other examples without a nightly toolchain. So the example only uses Global. (maybe I should add a similar feature gate to unkai?)

#[cfg(not(feature = "allocator_api"))]
allocator: Global,
#[cfg(feature = "allocator_api")]
allocator: *value.allocator(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The From impl would benefit from having a type parameter for the allocator, otherwise the allocator here would always implicitly be the global one. That probably requires two separate impl blocks depending on the feature flag.

The same would be useful for Buffer::from_vec(Vec).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will default to Global without allocator_api, and inherit from the vector via Vec::allocator() API. Is this behavior ideal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got what you mean https://github.com/apache/arrow-rs/pull/6336/files#diff-371342744df1b634b0bd9d90f4fe38c1eb0096df322fd3cc2fbc513f3428046cR692-R696

We should have two impl blocks indeed for different type parameters

@jhorstmann
Copy link
Contributor

An unconditional dep to unkai will break other examples without a nightly toolchain. So the example only uses Global. (maybe I should add a similar feature gate to unkai?)

A minimal allocator implementation that tracks the memory usage shouldn't be too big to include directly in the example, without dependencies.

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@alamb
Copy link
Contributor

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

@jhorstmann
Copy link
Contributor

I plan to look into this in more detail as it might be useful in our codebase, which makes good use of a custom allocator to track memory usage.

A unit test with a custom allocator, to be run with miri, would be very nice. I have a concern that after turning the MutableBuffer into a Buffer it would not be freed with the correct allocator. Adding a generic parameter to Buffer would affect a lot more apis, maybe it could use the existing Deallocation::Custom strategy somehow.

Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
Signed-off-by: Ruihang Xia <waynestxia@gmail.com>
@waynexia
Copy link
Member Author

I have a concern that after turning the MutableBuffer into a Buffer it would not be freed with the correct allocator. Adding a generic parameter to Buffer would affect a lot more apis, maybe it could use the existing Deallocation::Custom strategy somehow.

Good catch 👍 I've limited those freeze() APIs to Global alloc in 6b28b10. Now it should not be possible to construct a Buffer with memory from other allocators. I'll add this functionalities back when implementing allocator param for Buffer in the follow up PR.

A unit test with a custom allocator, to be run with miri, would be very nice.

New test mutable_buffer_with_custom_allocator in 02d73ad. There is a script .github/workflows/miri.sh to run it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants