Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokio-epoll-uring: "Cannot allocate memory" in staging on 2024-02-07 #6667

Closed
10 of 13 tasks
Tracked by #6665
problame opened this issue Feb 7, 2024 · 0 comments
Closed
10 of 13 tasks
Tracked by #6665
Assignees

Comments

@problame
Copy link
Contributor

problame commented Feb 7, 2024

Sentry-captured stack trace

OS Version: Linux 5.10.0-18-cloud-amd64 (None)
Report Version: 104


Application Specific Information:
called `Result::unwrap()` on an `Err` value: IoUringBuild(Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" })

Thread 0 Crashed:
0   std                             0x55b2df65de36      std::sys_common::backtrace::__rust_end_short_backtrace (backtrace.rs:170)
1   <unknown>                       0x55b2df65f3a2      rust_begin_unwind (panicking.rs:645)
2   core                            0x55b2df68e7c5      core::panicking::panic_fmt (panicking.rs:72)
3   core                            0x55b2df68ed03      core::result::unwrap_failed (result.rs:1653)
4   core                            0x55b2de490643      [inlined] core::result::Result<T>::unwrap (result.rs:1077)
5   tokio_epoll_uring               0x55b2de490643      [inlined] tokio_epoll_uring::system::lifecycle::thread_local::thread_local_system::{{closure}}::{{closure}}::{{closure}} (thread_local.rs:16)
6   tokio                           0x55b2de490643      [inlined] tokio::sync::once_cell::OnceCell<T>::get_or_init::{{closure}} (once_cell.rs:368)
7   tokio_epoll_uring               0x55b2de490643      tokio_epoll_uring::system::lifecycle::thread_local::thread_local_system::{{closure}} (thread_local.rs:17)
8   pageserver                      0x55b2de4dc5e8      pageserver::virtual_file::open_options::OpenOptions::open::{{closure}} (open_options.rs:100)
9   pageserver                      0x55b2de4dd254      pageserver::virtual_file::VirtualFile::open_with_options::{{closure}} (virtual_file.rs:375)
10  pageserver                      0x55b2de4c96a3      [inlined] pageserver::virtual_file::VirtualFile::create::{{closure}} (virtual_file.rs:337)
11  pageserver                      0x55b2de4c96a3      [inlined] pageserver::tenant::storage_layer::delta_layer::DeltaLayerWriterInner::new::{{closure}} (delta_layer.rs:392)
12  pageserver                      0x55b2de4c96a3      [inlined] pageserver::tenant::storage_layer::delta_layer::DeltaLayerWriter::new::{{closure}} (delta_layer.rs:572)
13  pageserver                      0x55b2de4c96a3      pageserver::tenant::storage_layer::inmemory_layer::InMemoryLayer::write_to_disk::{{closure}} (inmemory_layer.rs:363)
14  tokio                           0x55b2de4b565c      [inlined] tokio::runtime::park::CachedParkThread::block_on::{{closure}} (park.rs:282)
15  tokio                           0x55b2de4b565c      [inlined] tokio::runtime::coop::with_budget (coop.rs:107)
16  tokio                           0x55b2de4b565c      [inlined] tokio::runtime::coop::budget (coop.rs:73)
17  tokio                           0x55b2de4b565c      [inlined] tokio::runtime::park::CachedParkThread::block_on (park.rs:282)
18  tokio                           0x55b2de4b565c      tokio::runtime::context::blocking::BlockingRegionGuard::block_on (blocking.rs:66)
19  tokio                           0x55b2de4054c0      [inlined] tokio::runtime::handle::Handle::block_on::{{closure}} (handle.rs:310)
20  tokio                           0x55b2de4054c0      tokio::runtime::context::runtime::enter_runtime (runtime.rs:65)
21  tokio                           0x55b2de48bebc      [inlined] tokio::runtime::handle::Handle::block_on (handle.rs:309)
22  pageserver                      0x55b2de48bebc      [inlined] pageserver::tenant::timeline::Timeline::create_delta_layer::{{closure}}::{{closure}} (timeline.rs:3086)
23  tokio                           0x55b2de48bebc      tokio::runtime::blocking::task::BlockingTask<T>::poll (task.rs:42)
24  tokio                           0x55b2de7c29b6      [inlined] tokio::runtime::task::core::Core<T>::poll::{{closure}} (core.rs:328)
25  tokio                           0x55b2de7c29b6      [inlined] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut (unsafe_cell.rs:16)
26  tokio                           0x55b2de7c29b6      tokio::runtime::task::core::Core<T>::poll (core.rs:317)
27  tokio                           0x55b2de3481fc      [inlined] tokio::runtime::task::harness::poll_future::{{closure}} (harness.rs:485)
28  core                            0x55b2de3481fc      [inlined] core::panic::unwind_safe::AssertUnwindSafe<T>::call_once (unwind_safe.rs:272)
29  std                             0x55b2de3481fc      [inlined] std::panicking::try::do_call (panicking.rs:552)
30  std                             0x55b2de3481fc      [inlined] std::panicking::try (panicking.rs:516)
31  std                             0x55b2de3481fc      [inlined] std::panic::catch_unwind (panic.rs:142)
32  tokio                           0x55b2de3481fc      [inlined] tokio::runtime::task::harness::poll_future (harness.rs:473)
33  tokio                           0x55b2de3481fc      [inlined] tokio::runtime::task::harness::Harness<T>::poll_inner (harness.rs:208)
34  tokio                           0x55b2de3481fc      tokio::runtime::task::harness::Harness<T>::poll (harness.rs:153)
35  tokio                           0x55b2df541fdb      tokio::runtime::blocking::pool::Inner::run
36  std                             0x55b2df53b597      std::sys_common::backtrace::__rust_begin_short_backtrace
37  core                            0x55b2df53bdb9      core::ops::function::FnOnce::call_once{{vtable.shim}}
38  alloc                           0x55b2df6664f5      [inlined] alloc::boxed::Box<T>::call_once (boxed.rs:2007)
39  alloc                           0x55b2df6664f5      [inlined] alloc::boxed::Box<T>::call_once (boxed.rs:2007)
40  std                             0x55b2df6664f5      std::sys::unix::thread::Thread::new::thread_start (thread.rs:108)
41  <unknown>                       0x7f0bc3c48ea7      start_thread
42  <unknown>                       0x7f0bc3a1ca2f      clone
43  <unknown>                       0x0                 <unknown>

tl;dr: a Handle::block_on call inside spawn_blocking

Kernel version is 5.10.0-18-cloud-amd64

=> As per our findings in #6373 (comment) , this means the process ran out of memlock rusage quota.


Action Items

Preview Give feedback
@problame problame self-assigned this Feb 7, 2024
problame added a commit that referenced this issue Feb 7, 2024
problame added a commit to neondatabase/tokio-epoll-uring that referenced this issue Feb 7, 2024
problame added a commit to neondatabase/tokio-epoll-uring that referenced this issue Feb 7, 2024
problame added a commit that referenced this issue Feb 14, 2024
… callers (#6731)

Some callers of `VirtualFile::crashsafe_overwrite` call it on the
executor thread, thereby potentially stalling it.

Others are more diligent and wrap it in `spawn_blocking(...,
Handle::block_on, ... )` to avoid stalling the executor thread.

However, because `crashsafe_overwrite` uses
VirtualFile::open_with_options internally, we spawn a new thread-local
`tokio-epoll-uring::System` in the blocking pool thread that's used for
the `spawn_blocking` call.

This PR refactors the situation such that we do the `spawn_blocking`
inside `VirtualFile::crashsafe_overwrite`. This unifies the situation
for the better:

1. Callers who didn't wrap in `spawn_blocking(..., Handle::block_on,
...)` before no longer stall the executor.
2. Callers who did it before now can avoid the `block_on`, resolving the
problem with the short-lived `tokio-epoll-uring::System`s in the
blocking pool threads.

A future PR will build on top of this and divert to tokio-epoll-uring if
it's configures as the IO engine.

Changes
-------

- Convert implementation to std::fs and move it into `crashsafe.rs`
- Yes, I know, Safekeepers (cc @arssher ) added `durable_rename` and
`fsync_async_opt` recently. However, `crashsafe_overwrite` is different
in the sense that it's higher level, i.e., it's more like
`std::fs::write` and the Safekeeper team's code is more building block
style.
- The consequence is that we don't use the VirtualFile file descriptor
cache anymore.
- I don't think it's a big deal because we have plenty of slack wrt
production file descriptor limit rlimit (see [this
dashboard](https://neonprod.grafana.net/d/e4a40325-9acf-4aa0-8fd9-f6322b3f30bd/pageserver-open-file-descriptors?orgId=1))

- Use `tokio::task::spawn_blocking` in
`VirtualFile::crashsafe_overwrite` to call the new
`crashsafe::overwrite` API.
- Inspect all callers to remove any double-`spawn_blocking`
- spawn_blocking requires the captures data to be 'static + Send. So,
refactor the callers. We'll need this for future tokio-epoll-uring
support anyway, because tokio-epoll-uring requires owned buffers.

Related Issues
--------------

- overall epic to enable write path to tokio-epoll-uring: #6663
- this is also kind of relevant to the tokio-epoll-uring System creation
failures that we encountered in staging, investigation being tracked in
#6667
- why is it relevant? Because this PR removes two uses of
`spawn_blocking+Handle::block_on`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant