Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate tokio panic (assertion failed: self.ref_count() > 0) #7162

Closed
VladLazar opened this issue Mar 18, 2024 · 10 comments
Closed

Investigate tokio panic (assertion failed: self.ref_count() > 0) #7162

VladLazar opened this issue Mar 18, 2024 · 10 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@VladLazar
Copy link
Contributor

VladLazar commented Mar 18, 2024

Sentry link: https://neondatabase.sentry.io/issues/5076699787/?alert_rule_id=13489026&alert_type=issue&notification_uuid=821438c5-fe8b-4aef-94a9-4f932dd3c959&project=4504220031582208&referrer=slack

Slack thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1710750099615559

Current theories are:

  1. tokio bug
  2. leaky_bucket bug

Call site:

let poll = acquire.as_mut().poll(cx);

        std::future::poll_fn(|cx| {
            use std::future::Future;
            let poll = acquire.as_mut().poll(cx);
            did_throttle = did_throttle || poll.is_pending();
            poll
        })
        .await;

Backtrace:

panic{thread=compute request worker location=/home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/state.rs:592:9}: assertion failed: self.ref_count() > 0
Stack backtrace:
   0: utils::logging::tracing_panic_hook
             at /home/nonroot/libs/utils/src/logging.rs:206:21
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2029:9
   2: std::panicking::rust_panic_with_hook
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:783:13
   3: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:649:13
   4: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys_common/backtrace.rs:171:18
   5: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   6: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   7: core::panicking::panic
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:144:5
   8: tokio::runtime::task::state::State::transition_to_notified_by_val
   9: tokio::runtime::task::waker::wake_by_val
  10: leaky_bucket::AcquireState::drain_wait_queue
  11: leaky_bucket::AcquireState::drain_core
  12: <leaky_bucket::Acquire as core::future::future::Future>::poll
  13: <tokio::task::unconstrained::Unconstrained<F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/unconstrained.rs:25:57
  14: tokio::runtime::coop::with_budget
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/coop.rs:107:5
  15: tokio::runtime::coop::with_unconstrained
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/coop.rs:80:5
  16: <tokio::task::unconstrained::Unconstrained<F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/unconstrained.rs:25:13
  17: pageserver::tenant::throttle::Throttle<M>::throttle::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/tenant/throttle.rs:146:24
  18: <core::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/future/poll_fn.rs:151:9
  19: pageserver::tenant::throttle::Throttle<M>::throttle::{{closure}}
             at /home/nonroot/pageserver/src/tenant/throttle.rs:150:10
  20: pageserver::tenant::timeline::Timeline::get::{{closure}}
             at /home/nonroot/pageserver/src/tenant/timeline.rs:657:53
  21: pageserver::pgdatadir_mapping::Version::get::{{closure}}
             at /home/nonroot/pageserver/src/pgdatadir_mapping.rs:1654:63
  22: pageserver::pgdatadir_mapping::<impl pageserver::tenant::timeline::Timeline>::get_rel_page_at_lsn::{{closure}}
             at /home/nonroot/pageserver/src/pgdatadir_mapping.rs:199:37
  23: pageserver::page_service::PageServerHandler::handle_get_page_at_lsn_request::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:1156:14
  24: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  25: pageserver::page_service::PageServerHandler::handle_get_page_at_lsn_request::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:1108:5
  26: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  27: pageserver::page_service::PageServerHandler::handle_pagerequests::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:646:30
  28: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  29: pageserver::page_service::PageServerHandler::handle_pagerequests::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:548:5
  30: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:1408:18
  31: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  32: <pageserver::page_service::PageServerHandler as postgres_backend::Handler<IO>>::process_query::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:1375:5
  33: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/future/future.rs:124:9
  34: postgres_backend::PostgresBackend<IO>::process_message::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:747:75
  35: postgres_backend::PostgresBackend<IO>::run_message_loop::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:485:80
  36: postgres_backend::PostgresBackend<IO>::run::{{closure}}
             at /home/nonroot/libs/postgres_backend/src/lib.rs:414:14
  37: pageserver::page_service::page_service_conn_main::{{closure}}::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:275:10
  38: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.37/src/instrument.rs:272:9
  39: pageserver::page_service::page_service_conn_main::{{closure}}
             at /home/nonroot/pageserver/src/page_service.rs:211:1
  40: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::future::future::Future>::poll
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panic/unwind_safe.rs:297:9
  41: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/future/catch_unwind.rs:36:42
  42: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panic/unwind_safe.rs:272:9
  43: std::panicking::try::do_call
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:552:40
  44: std::panicking::try
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:516:19
  45: std::panic::catch_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panic.rs:142:14
  46: <futures_util::future::future::catch_unwind::CatchUnwind<Fut> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-util-0.3.30/src/future/future/catch_unwind.rs:36:9
  47: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/task_local.rs:347:31
  48: tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/task_local.rs:217:19
  49: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/task_local.rs:343:19
  50: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/task_local.rs:347:31
  51: tokio::task::task_local::LocalKey<T>::scope_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/task_local.rs:217:19
  52: <tokio::task::task_local::TaskLocalFuture<T,F> as core::future::future::Future>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/task/task_local.rs:343:19
  53: pageserver::task_mgr::task_wrapper::{{closure}}
             at /home/nonroot/pageserver/src/task_mgr.rs:398:10
  54: tokio::runtime::task::core::Core<T,S>::poll::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:328:17
  55: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/loom/std/unsafe_cell.rs:16:9
  56: tokio::runtime::task::core::Core<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/core.rs:317:13
  57: tokio::runtime::task::harness::poll_future::{{closure}}
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:485:19
  58: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panic/unwind_safe.rs:272:9
  59: std::panicking::try::do_call
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:552:40
  60: std::panicking::try
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:516:19
  61: std::panic::catch_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panic.rs:142:14
  62: tokio::runtime::task::harness::poll_future
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:473:18
  63: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:208:27
  64: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.36.0/src/runtime/task/harness.rs:153:15
  65: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
@VladLazar VladLazar added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Mar 18, 2024
@VladLazar VladLazar self-assigned this Mar 18, 2024
@jcsp jcsp added the triaged bugs that were already triaged label Mar 21, 2024
@jcsp
Copy link
Collaborator

jcsp commented Mar 21, 2024

90% confident that it's not a neon bug

@jcsp
Copy link
Collaborator

jcsp commented Mar 22, 2024

@jcsp jcsp changed the title Investigate throttling panic Investigate throttling panic (assertion failed: self.ref_count() > 0) Mar 22, 2024
@VladLazar
Copy link
Contributor Author

Turns out we are on the latest tokio version: e62baa9

@Noah-Kennedy
Copy link

@VladLazar I'm investigating this right now on the tokio side. We think this is a memory corruption issue unrelated to tokio which tokio is the first to yell about.

Have you folks done any other dependency updates recently that line up?

Also a possibility is that it's actually a recent neon issue, but I'm strongly leaning towards this being another common crate, likely in tokio's tree, given the number of reports.

@jcsp jcsp changed the title Investigate throttling panic (assertion failed: self.ref_count() > 0) Investigate tokio panic (assertion failed: self.ref_count() > 0) Mar 26, 2024
@VladLazar
Copy link
Contributor Author

Got another panic today (sentry link). The stack trace is identical to the other throttling panic. This makes me suspect that that something is wrong on our side (or in leaky bucket).

I'll try to throttle aggressively under pagebench. Perhaps it reproes.

@Noah-Kennedy
Copy link

Interesting!

Keep me in the loop here, if this ends up being an issue on your end that helps us narrow down what could be going on with our other reports, as our current theory is that there's a common issue

@VladLazar
Copy link
Contributor Author

VladLazar commented Apr 3, 2024

Interesting observation: Peter's benchmarks seem to repro this. We have seen the throttling rooted panic two times in staging in the past 7 days. Both times it was on Peter's tenants. Metric collection is broken during the benchmarks so, we are a bit blind, but CPU usage is high.

@VladLazar
Copy link
Contributor Author

Got a new panic in prod eu-central-1@pageserver-4

	2024-04-23 20:14:23.837	
2024-04-23T20:14:23.750193Z ERROR panic{thread=compute request worker location=/home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/state.rs:195:9}: current: 1, sub: 2
Stack backtrace:
   0: utils::logging::tracing_panic_hook
             at /home/nonroot/libs/utils/src/logging.rs:206:21
   1: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2029:9
   2: std::panicking::rust_panic_with_hook
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:785:13
   3: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:659:13
   4: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
   5: rust_begin_unwind
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
   6: core::panicking::panic_fmt
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
   7: tokio::runtime::task::state::State::transition_to_terminal
   8: tokio::runtime::task::harness::Harness<T,S>::complete
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:335:12
   9: tokio::runtime::task::harness::Harness<T,S>::poll
             at /home/nonroot/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.37.0/src/runtime/task/harness.rs:168:17
  10: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  11: tokio::runtime::scheduler::multi_thread::worker::Context::run
  12: tokio::runtime::context::runtime::enter_runtime
  13: tokio::runtime::scheduler::multi_thread::worker::run
  14: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  15: tokio::runtime::task::core::Core<T,S>::poll
  16: tokio::runtime::task::harness::Harness<T,S>::poll
  17: tokio::runtime::blocking::pool::Inner::run
  18: std::sys_common::backtrace::__rust_begin_short_backtrace
  19: core::ops::function::FnOnce::call_once{{vtable.shim}}
  20: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2015:9
  21: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2015:9
  22: std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys/pal/unix/thread.rs:108:17
  23: start_thread
             at ./nptl/pthread_create.c:477:8
  24: clone
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@conradludgate
Copy link
Contributor

We think this is a memory corruption issue unrelated to tokio which tokio is the first to yell about.

This is my current understanding as well. I found a way to trigger both panics seen by memcpy-ing wakers and immediately dropping them (violating the Drop/!Copy contract).

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=5d2e870fb3e5bf36dfd11e0233c438d5

I cannot find any other way at the moment to trigger the same behaviour with any code in tokio 1.37.0. When investigating any dependencies, looking for potentially double-free bugs is what we should be looking for

@erikgrinaker
Copy link
Contributor

Haven't seen any recent instances of this. Can we close this issue out @VladLazar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

5 participants