-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
near-vm: use a recycling pool of shared code memories instead of a in-memory cache of loaded artifacts #9244
Conversation
As I was preparing this PR, I remembered there is another kind of memory that would benefit from the same sort of treatment – the wasm memory memory. These are allocated for each instance, but are also a little more tricky in that they have strict zeroing requirements and possibly other changes around trapping on oob accesses would be good to make too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky code, and that in parts of the code base I never really looked at...
I've done a review now and left some questions. But I'd be more comfortable if @Ekleog-NEAR would also take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Though there are quite a few comments to handle, so I’ll have another look once the discussions calm down :)
@@ -2,170 +2,254 @@ | |||
// Attributions: https://github.com/wasmerio/wasmer/blob/master/ATTRIBUTIONS.md | |||
|
|||
//! Memory management for executable code. | |||
use near_vm_compiler::{CustomSectionRef, FunctionBodyRef}; | |||
use near_vm_vm::{Mmap, VMFunctionBody}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you say it’d be nice, later, to handle wasm memory in the same way; as it’ll allow us to remove ~300loc of custom mmap implementation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think the only comment that needs addressing before landing is the one on alignment computation, though it’d also be nice to make sure the kernel doesn’t just mmap overcommitted memory by writing to one byte each page_size
of the code memories upon creation :)
super::LimitedMemoryPool::new(1, 10 * rustix::param::page_size()).unwrap() | ||
}); | ||
let pool = | ||
self.pool.unwrap_or_else(|| panic!("Universal::code_memory_pool was not set up!")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not .expect("...")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn’t read right! The text would need to be negated, but that then doesn’t make sense to the person seeing this message.
executable_size: usize, | ||
/// Addresses `0..executable_end` contain executable memory. | ||
/// | ||
/// This is always be page-aligned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: the data section must start page-aligned, we don’t have any code that makes executable_end
itself be page-aligned
// plausibly be. This is not necessarily true – there may be WebAssembly | ||
// instructions that expand by more than 4 times in terms of instruction size after | ||
// a conversion to x86_64, In that case a re-allocation will occur and executing | ||
// that particular function call will be slower. Not to mention there isn't a | ||
// strong guarantee on the upper bound of the memory that the contract runtime may | ||
// require. | ||
LimitedMemoryPool::new(32, 16 * 1024 * 1024).unwrap_or_else(|e| { | ||
LimitedMemoryPool::new(8, 64 * 1024 * 1024).unwrap_or_else(|e| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially with 64MiB mmaps, it might make sense to poke each page of each mapping to make sure the kernel didn’t overcommit them. It’d make initial startup a tad slower, but we’d avoid the variance of pagefaulting and needing to reallocate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faulting in maps like this is somewhat of a bogus thing to do. Kernel can freely swap out pages from memory even after they have been faulted in. So in a memory-starved situations the same cliff still exists. The work-around to this behaviour is to use mlock
, however using that opens its own can of worms – systems aren’t typically set up for a huge amount of locked memory. For instance on my system 8MiB is the maximum you can mlock
without further (rlimit
) changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I had thought of the kernel just not allocating the memory, not of the kernel swapping it out.
.fold(0, |acc, data| round_up(acc + data.bytes.len(), DATA_SECTION_ALIGNMENT.into())); | ||
let total_len = 0; | ||
let total_len = function_bodies.iter().fold(total_len, |acc, func| { | ||
round_up(acc + function_allocation_size(*func), ARCH_FUNCTION_ALIGNMENT.into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think this is correct. Here you’re first adding the allocation and then rounding up to ARCH, while ARCH is the alignment for the allocation itself and not for the next allocation. In practice it’ll work because we only have a single alignment type before aligning to page and then switching to another alignment type, but I think the proper way of doing this (more future-proof should we add other alignments later on) would be:
round_up(acc, ARCH_FUNCTION_ALIGNMENT.into()) + function_allocation_size(*func)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, so this really depends on who does one considers the padding bytes to belong to.
In assembly due to how labels work the padding bytes will usually appear as-if they belong to the preceding function:
func1:
add
div
ret
ud2 ; padding
ud2 ; padding
func2:
...
Which sort-of matches the current code a little better. But I don’t have a strong opinion either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My issue with the initial writing is, if we ever were to have first items with align of 8 and then items with align of 16, then the alignment for the first of the items of the second type would be wrong. Because we align after func1
with the alignment of func1
and not of func2
, to reuse your example.
Creating new memory maps is expensive, and so is changing their permissions. Doing this every time a function is called is expensive, and we don't really _need_ to do it with just a fairly basic pool of reusable resources (so long as they are fairly meticulously managed).
Taking out `source_pool` before a failure means that the memory won't be returned to the pool even though it still may be usable?
43d1e69
to
8c458af
Compare
This possibly mitigates the RocksDB slowness of loading contracts from the storage after the another in-memory cache fronting RocksDB was removed in #9244. As part of the 1.35 release[^1] it turned out that this has caused a significant bimodality in chunk application times – mostly attributed to storage layer. This, we suspected, had started causing increased validator kick out numbers. Original LRU cache fronting RocksDB stored 128 units of loaded contracts. At a rough estimate of 4MiB of machine code per contract, this comes out to 512MiB. Of course the two caches are still very different – this stores uncompressed blocks containing compiled machine code. The previous cache stored contracts loaded into memory. I'm confident that the overhead of loading the contract code into memory is not particularly notable at this point, though, so the difference is something we can live with. [^1]: https://near.zulipchat.com/#narrow/stream/297873-pagoda.2Fnode/topic/release.201.2E36/near/379169243
This possibly mitigates the RocksDB slowness of loading contracts from the storage after the another in-memory cache fronting RocksDB was removed in near#9244. As part of the 1.35 release[^1] it turned out that this has caused a significant bimodality in chunk application times – mostly attributed to storage layer. This, we suspected, had started causing increased validator kick out numbers. Original LRU cache fronting RocksDB stored 128 units of loaded contracts. At a rough estimate of 4MiB of machine code per contract, this comes out to 512MiB. Of course the two caches are still very different – this stores uncompressed blocks containing compiled machine code. The previous cache stored contracts loaded into memory. I'm confident that the overhead of loading the contract code into memory is not particularly notable at this point, though, so the difference is something we can live with. [^1]: https://near.zulipchat.com/#narrow/stream/297873-pagoda.2Fnode/topic/release.201.2E36/near/379169243
…d of a in-memory cache of loaded artifacts (near#9244)" (near#10788) This reverts commit bb02713.
This gives us potential to manage our memory consumption and worst-case function call expense in a much better managed way.
This may, however be a decrease in continuous param estimator outputs for function call costs, depending on how and what workload it tests.