Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: temporarily workaround the LimitedMemoryPool limiting the concurrency of the contract runtime too much #10733

Merged
merged 5 commits into from
Mar 8, 2024

Conversation

nagisa
Copy link
Collaborator

@nagisa nagisa commented Mar 8, 2024

See individual commits and their descriptions for an easier review experience.

This PR is intended to be easily backportable with a minimal risk, with perhaps only the 2nd commit being superfluous in that context. There are definitely things that we can do to make the reliability and sensibility of this area much better, but they would result in a much more invasive change. My belief is that we should land both this workaround and fix #10732 as a follow-up.

cc #10732
cc @posvyatokum we will want this backported to 1.37.1, let me know if you'd like me to prepare a backport and/or where you want it to end up at.


runtime: hotfix to allow up-to 256 concurrent runtime invocations

This is not a proper fix as certain configuration settings can still
make this number to be exceeded, but the changes there are a little more
involved than what I'd be comfortable with making in a rush, especially
for a backport to the 1.37 release.

In order to keep the initial memory use reasonably bounded, I have
chosen to reduce the initial size of each map to a smaller number.
Since the runtime reuses most recently returned memories first, in
use-cases where there are few concurrent accesses, only that many
memories will get resized to fit all the required contracts over time,
thus keeping the memory use reasonably limited. This will result in
slightly less predictable function call performance for the first
invocation of the very few first contract invocations that
require more memory, but this seems like a required tradeoff to avoid
preallocating dozens of GB of mappings that will later go mostly unused.


runtime: avoid losing memories when failing to resize them

This is not something that I am aware of happening but a fresh look at
the code immediately spotted it as icky part of the code. The new
ordering will still fail to resize and most likely lead to the
termination of the node, but if the caller chooses to do something else
about the error, this new setup will correctly handle returning the
previous memory to the memory pool.


runtime: switch LimitedMemoryPool to a LIFO queue

crossbeam's implementation is a FIFO queue, which will repeatedly touch
all of the memories, regardless of the concurrency which then can result
in more memory use than strictly necessary in low-concurrency scenarios.

This is not a proper fix as certain configuration settings can still
make this number to be exceeded, but the changes there are a little more
involved than what I'd be comfortable with making in a rush, especially
for a backport to the 1.37 release.

In order to keep the initial memory use reasonably bounded, I have
chosen to reduce the initial size of each map to a smaller number.
Since the runtime reuses most recently returned memories first, in
use-cases where there are few concurrent accesses, only that many
memories will get resized to fit all the required contracts over time,
thus keeping the memory use reasonably limited. This will result in
slightly less predictable function call performance for the first
invocation of the very few first contract invocations that
require more memory, but this seems like a required tradeoff to avoid
preallocating dozens of GB of mappings that will later go mostly unused.
This is not something that I am aware of happening but a fresh look at
the code immediately spotted it as icky part of the code. The new
ordering will still fail to resize and most likely lead to the
termination of the node, but if the caller chooses to do something else
about the error, this new setup will correctly handle returning the
previous memory to the memory pool.
crossbeam's implementation is a FIFO queue, which will repeatedly touch
all of the memories, regardless of the concurrency which then can result
in more memory use than strictly necessary in low-concurrency scenarios.
@nagisa nagisa requested a review from a team as a code owner March 8, 2024 10:58
@@ -258,7 +258,7 @@ impl NearVM {
// that particular function call will be slower. Not to mention there isn't a
// strong guarantee on the upper bound of the memory that the contract runtime may
// require.
LimitedMemoryPool::new(8, 64 * 1024 * 1024).unwrap_or_else(|e| {
LimitedMemoryPool::new(256, 1 * 1024 * 1024).unwrap_or_else(|e| {
Copy link
Collaborator Author

@nagisa nagisa Mar 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of each individual memory in the pool grows as required (see CodeMemory::resize). This grow operation is fallible, unfortunately, but at the very least with the change to the LIFO queue for the memory pool, in most "regular" applications only the required number of memories should actually grow. For a validator that tracks 4 shards, that would mean just 4 of these memories.

Copy link
Collaborator

@akhi3030 akhi3030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, this looks good so I am approving to unblock. Still a thorough review from Leo would be great.

Copy link
Collaborator

@Ekleog-NEAR Ekleog-NEAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a few comments around mutex handling :)

runtime/near-vm/engine/src/universal/code_memory.rs Outdated Show resolved Hide resolved
runtime/near-vm/engine/src/universal/code_memory.rs Outdated Show resolved Hide resolved
runtime/near-vm/engine/src/universal/code_memory.rs Outdated Show resolved Hide resolved
runtime/near-vm/engine/src/universal/code_memory.rs Outdated Show resolved Hide resolved
@nagisa nagisa enabled auto-merge March 8, 2024 12:53
@nagisa nagisa disabled auto-merge March 8, 2024 12:59
@nagisa nagisa enabled auto-merge March 8, 2024 13:00
@nagisa nagisa added this pull request to the merge queue Mar 8, 2024
Merged via the queue into near:master with commit f430756 Mar 8, 2024
23 of 24 checks passed
@nagisa nagisa deleted the nagisa/temporary-workaround-for-ooms branch March 8, 2024 13:21
nagisa added a commit to nagisa/nearcore that referenced this pull request Mar 8, 2024
…ncurrency of the contract runtime too much (near#10733)

See individual commits and their descriptions for an easier review
experience.

This PR is intended to be easily backportable with a minimal risk, with
perhaps only the 2nd commit being superfluous in that context. There are
definitely things that we can do to make the reliability and sensibility
of this area much better, but they would result in a much more invasive
change. My belief is that we should land both this workaround *and* fix
near#10732 as a follow-up.

cc near#10732
cc @posvyatokum we will want this backported to 1.37.1, let me know if
you'd like me to prepare a backport and/or where you want it to end up
at.

---

runtime: hotfix to allow up-to 256 concurrent runtime invocations

This is not a proper fix as certain configuration settings can still
make this number to be exceeded, but the changes there are a little more
involved than what I'd be comfortable with making in a rush, especially
for a backport to the 1.37 release.

In order to keep the initial memory use reasonably bounded, I have
chosen to reduce the initial size of each map to a smaller number.
Since the runtime reuses most recently returned memories first, in
use-cases where there are few concurrent accesses, only that many
memories will get resized to fit all the required contracts over time,
thus keeping the memory use reasonably limited. This will result in
slightly less predictable function call performance for the first
invocation of the very few first contract invocations that
require more memory, but this seems like a required tradeoff to avoid
preallocating dozens of GB of mappings that will later go mostly unused.

---

runtime: avoid losing memories when failing to resize them

This is not something that I am aware of happening but a fresh look at
the code immediately spotted it as icky part of the code. The new
ordering will still fail to resize and most likely lead to the
termination of the node, but if the caller chooses to do something else
about the error, this new setup will correctly handle returning the
previous memory to the memory pool.

---


runtime: switch LimitedMemoryPool to a LIFO queue

crossbeam's implementation is a FIFO queue, which will repeatedly touch
all of the memories, regardless of the concurrency which then can result
in more memory use than strictly necessary in low-concurrency scenarios.
posvyatokum pushed a commit that referenced this pull request Mar 8, 2024
… the concurrency of the contract runtime too much (#10736)

This is a backport of #10733 for
inclusion in 1.37.x series.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

near-vm: the LimitedMemoryPool should allocate an appropriate number of memories for the maximum concurrency
3 participants