Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB][Pg cron] Intermittent Failure of Cron Job Execution in Stress Test Runs Despite Latest Fixes on Master #24658

Closed
1 task done
shishir2001-yb opened this issue Oct 28, 2024 · 1 comment
Assignees
Labels
2024.2_blocker 2024.2.0_blocker 2024.2.1_blocker area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority qa_stress Bugs identified via Stress automation QA QA filed bugs

Comments

@shishir2001-yb
Copy link

shishir2001-yb commented Oct 28, 2024

Jira Link: DB-13724

Description

Version: 2.25.0.0-b214
Logs: Added in jira

 DB name: postgres_24 
Validation failed for Job Id: 389 between [2024-10-28 06:51:08.113757+00:00, 2024-10-28 07:01:08.113838+00:00],
 Actual: 0
 Expected: 10

Additionally, the count of executed jobs remains constant, suggesting that no new cron jobs are being triggered.

yugabyte=# select count(*) from cron.job_run_details ;
 count 
-------
 69826
(1 row)
yugabyte=# select count(*) from cron.job;
 count 
-------
  2840
(1 row)

yugabyte=# select count(*) from cron.job_run_details ;
 count 
-------
 69826
(1 row)

yugabyte=# select jobid, runid, job_pid, database, username, status, return_message, start_time, end_time from cron.job_run_details where status = 'running';
 jobid | runid | job_pid |  database   | username | status  | return_message |          start_time           | end_time 
-------+-------+---------+-------------+----------+---------+----------------+-------------------------------+----------
   379 | 69969 |  273161 | postgres_23 | yugabyte | running |                | 2024-10-28 06:40:09.115879+00 | 
   336 | 69967 |  273138 | postgres_19 | yugabyte | running |                | 2024-10-28 06:40:09.049794+00 | 
  1794 | 69971 |  273178 | postgres_41 | yugabyte | running |                | 2024-10-28 06:40:09.132326+00 | 
(3 rows)

yugabyte=# SHOW cron.max_running_jobs;
 cron.max_running_jobs 
-----------------------
 5
(1 row)

yugabyte=# SHOW max_worker_processes;
 max_worker_processes 
----------------------
 8
(1 row)

yugabyte=# select count(*) from cron.job_run_details ;
 count 
-------
 69826
(1 row)

Test description

"""
        1. Create a cluster with enable_pg_cron g-flag and other required g-flags.
        2. Create 100 databases (50 colocated + 50 non-colocated).
        3. Start SqlCrossDBLoadWithDDL workload this should create the below
             - 2 tables
             - A materialized view on each table
        4. Create an age table which will be used to INSERT,UPDATE AND DELETE ROWS
        5. Create PG cron extension on yugabyte keyspace
        6. Connect to 'yugabyte' database and schedule the following Cron Jobs:
          a. Refresh Materialized views (2-minute interval, every week, every month and
             every year)
          b. Insert data (1-minute, 2 minutes, 5 minutes, every week, every month and
             every year)
          c. Update data (1-minute, 2 minutes, 5 minutes, every week, every month and
             every year)
          d. Delete data (1-minute, 2 minutes, 5 minutes, every week, every month and
             every year)
          e. Explicit Transaction(INSERT,UPDATE AND DELETE) (1-minute)
          e. Validate these cron jobs are getting executed.
        7. Start a loop and execute the following for 4 hours:
           a. Start nemesis (Node restart, Network slowdown, Instance termination,
              Master/tserver stop/start, NetworkPartitionNemesis).
           b. Deactivate most cron jobs(Cron jobs from 98 DBs)
           c. Verify deactivated jobs are not running
           d. Update some active cron jobs(From remaining 2 DBs)
           e. Verify updated jobs are running correctly
           f. Reactivate deactivated jobs
           g. Verify reactivated jobs are running
           h. With a 10% chance:
              i.   Create a new database
              ii.  Change ysql_cron_database_name g-flag to the new database
              iii. Create pg cron extension on this new DB
              iv.  Verify old cron jobs are not running
               v.  Schedule new cron jobs via the new database
              vi.  Verify new cron jobs are running

        9. Clean up and end the test.
        """

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shishir2001-yb shishir2001-yb added area/ysql Yugabyte SQL (YSQL) priority/high High Priority QA QA filed bugs status/awaiting-triage Issue awaiting triage qa_stress Bugs identified via Stress automation 2024.2_blocker labels Oct 28, 2024
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label Oct 28, 2024
@shishir2001-yb shishir2001-yb added area/docdb YugabyteDB core features and removed area/ysql Yugabyte SQL (YSQL) labels Oct 28, 2024
@shishir2001-yb shishir2001-yb changed the title [YSQL][Pg cron] Intermittent Failure of Cron Job Execution in Stress Test Runs Despite Latest Fixes on Master [DocDB][Pg cron] Intermittent Failure of Cron Job Execution in Stress Test Runs Despite Latest Fixes on Master Oct 28, 2024
@yugabyte-ci yugabyte-ci added 2024.2.1_blocker and removed status/awaiting-triage Issue awaiting triage labels Oct 28, 2024
@hari90
Copy link
Contributor

hari90 commented Oct 29, 2024

$ yb-llvm-v19.1.0-yb-1-1726858152-a2a6b655-almalinux8-x86_64/bin/lldb /home/yugabyte/yb-software/yugabyte-2.25.0.0-b214-centos-x86_64/postgres/bin/postgres -c core_yb.1730150812.postgres.269406.859641 --batch -o "bt all"
(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.25.0.0-b214-centos-x86_64/postgres/bin/postgres" --core "core_yb.1730150812.postgres.269406.859641"
Core file '/mnt/d0/cores/core_yb.1730150812.postgres.269406.859641' (x86_64) was loaded.
(lldb) bt all
* thread #1, name = 'postgres', stop reason = signal SIGSEGV
  * frame #0: 0x00007fa2709b7247 libc.so.6`epoll_wait + 87
    frame #1: 0x000056321ea90821 postgres`WaitEventSetWait [inlined] WaitEventSetWaitBlock(set=0x000005fa7fd26ed0, cur_timeout=<unavailable>, occurred_events=0x00007fff68fefc20, nevents=1) at latch.c:1503:7
    frame #2: 0x000056321ea90806 postgres`WaitEventSetWait(set=0x000005fa7fd26ed0, timeout=-1, occurred_events=<unavailable>, nevents=1, wait_event_info=<unavailable>) at latch.c:1449:8
    frame #3: 0x000056321e9df8c3 postgres`WaitForBackgroundWorkerShutdown [inlined] WaitLatch(latch=<unavailable>, wakeEvents=17, timeout=0, wait_event_info=134217733) at latch.c:511:6
    frame #4: 0x000056321e9df88b postgres`WaitForBackgroundWorkerShutdown(handle=0x000005fa7f7aed98) at bgworker.c:1195:8
    frame #5: 0x00007fa26bbeedad pg_cron.so`PgCronLauncherMain [inlined] ManageCronTask(task=0x000005fa7f7aed38, currentTime=783412809163780) at pg_cron.c:1839:5
    frame #6: 0x00007fa26bbee250 pg_cron.so`PgCronLauncherMain [inlined] ManageCronTasks(taskList=0x000005fa7f90a188, currentTime=783412809163780) at pg_cron.c:1364:3
    frame #7: 0x00007fa26bbee1db pg_cron.so`PgCronLauncherMain(arg=<unavailable>) at pg_cron.c:721:3
    frame #8: 0x000056321e9de6a5 postgres`StartBackgroundWorker at bgworker.c:870:2
    frame #9: 0x000056321e9ea289 postgres`maybe_start_bgworkers [inlined] do_start_bgworker(rw=0x000005fa7fddc000) at postmaster.c:6310:4
    frame #10: 0x000056321e9ea217 postgres`maybe_start_bgworkers at postmaster.c:6534:9
    frame #11: 0x000056321e9ea6d7 postgres`ServerLoop at postmaster.c:1955:4
    frame #12: 0x000056321e9e6824 postgres`PostmasterMain(argc=25, argv=0x000005fa7fd241a0) at postmaster.c:1538:11
    frame #13: 0x000056321e8e038c postgres`PostgresServerProcessMain(argc=25, argv=0x000005fa7fd241a0) at main.c:208:3
    frame #14: 0x000056321e529012 postgres`main + 34
    frame #15: 0x00007fa2708c2d85 libc.so.6`__libc_start_main + 229
    frame #16: 0x000056321e528f2e postgres`_start + 46
  thread #2, stop reason = signal 0
    frame #0: 0x00007fa270c5b45c libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 + 508
    frame #1: 0x00007fa26cf7a617 libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] void boost::asio::detail::posix_event::wait<boost::asio::detail::conditionally_enabled_mutex::scoped_lock>(this=0x000005fa7fd6cb70, lock=0x00007fa260db1098) at posix_event.hpp:119:7
    frame #2: 0x00007fa26cf7a5fa libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::conditionally_enabled_event::wait(this=0x000005fa7fd6cb68, lock=0x00007fa260db1098) at conditionally_enabled_event.hpp:97:14
    frame #3: 0x00007fa26cf7a5f0 libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::scheduler::do_run_one(this=0x000005fa7fd6cb00, lock=0x00007fa260db1098, this_thread=0x00007fa260db0fc0, ec=0x00007fa260db10f0) at scheduler.ipp:501:21
    frame #4: 0x00007fa26cf7a4f1 libyrpc.so`boost::asio::detail::scheduler::run(this=0x000005fa7fd6cb00, ec=0x00007fa260db10f0) at scheduler.ipp:210:10
    frame #5: 0x00007fa26cf79d47 libyrpc.so`yb::rpc::IoThreadPool::Impl::Execute() [inlined] boost::asio::io_context::run(this=<unavailable>, ec=0x00007fa260db10f0) at io_context.ipp:71:16
    frame #6: 0x00007fa26cf79d3a libyrpc.so`yb::rpc::IoThreadPool::Impl::Execute(this=<unavailable>) at io_thread_pool.cc:76:17
    frame #7: 0x00007fa26c9df2c0 libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000005fa7fdd43c0) const at function.h:517:16
    frame #8: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000005fa7fdd43c0) const at function.h:1168:12
    frame #9: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(arg=0x000005fa7fdd4360) at thread.cc:884:3
    frame #10: 0x00007fa270c551ca libpthread.so.0`start_thread + 234
    frame #11: 0x00007fa2708c1e73 libc.so.6`__clone + 67
  thread #3, stop reason = signal 0
    frame #0: 0x00007fa270c5b45c libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 + 508
    frame #1: 0x00007fa26cf7a617 libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] void boost::asio::detail::posix_event::wait<boost::asio::detail::conditionally_enabled_mutex::scoped_lock>(this=0x000005fa7fd6cb70, lock=0x00007fa260f34098) at posix_event.hpp:119:7
    frame #2: 0x00007fa26cf7a5fa libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::conditionally_enabled_event::wait(this=0x000005fa7fd6cb68, lock=0x00007fa260f34098) at conditionally_enabled_event.hpp:97:14
    frame #3: 0x00007fa26cf7a5f0 libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::scheduler::do_run_one(this=0x000005fa7fd6cb00, lock=0x00007fa260f34098, this_thread=0x00007fa260f33fc0, ec=0x00007fa260f340f0) at scheduler.ipp:501:21
    frame #4: 0x00007fa26cf7a4f1 libyrpc.so`boost::asio::detail::scheduler::run(this=0x000005fa7fd6cb00, ec=0x00007fa260f340f0) at scheduler.ipp:210:10
    frame #5: 0x00007fa26cf79d47 libyrpc.so`yb::rpc::IoThreadPool::Impl::Execute() [inlined] boost::asio::io_context::run(this=<unavailable>, ec=0x00007fa260f340f0) at io_context.ipp:71:16
    frame #6: 0x00007fa26cf79d3a libyrpc.so`yb::rpc::IoThreadPool::Impl::Execute(this=<unavailable>) at io_thread_pool.cc:76:17
    frame #7: 0x00007fa26c9df2c0 libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000005fa7fdd4720) const at function.h:517:16
    frame #8: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000005fa7fdd4720) const at function.h:1168:12
    frame #9: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(arg=0x000005fa7fdd46c0) at thread.cc:884:3
    frame #10: 0x00007fa270c551ca libpthread.so.0`start_thread + 234
    frame #11: 0x00007fa2708c1e73 libc.so.6`__clone + 67
  thread #4, stop reason = signal 0
    frame #0: 0x00007fa2709b7247 libc.so.6`epoll_wait + 87
    frame #1: 0x00007fa26c634c42 libev.so.4`epoll_poll + 82
    frame #2: 0x00007fa26c637ed4 libev.so.4`ev_run + 1956
    frame #3: 0x00007fa26c9df2c0 libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000005fa7fdd5020) const at function.h:517:16
    frame #4: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this= Function = yb::pggate::PgApiImpl::Interrupter::RunThread() ) const at function.h:1168:12
    frame #5: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(arg=0x000005fa7fdd4fc0) at thread.cc:884:3
    frame #6: 0x00007fa270c551ca libpthread.so.0`start_thread + 234
    frame #7: 0x00007fa2708c1e73 libc.so.6`__clone + 67
  thread #5, stop reason = signal 0
    frame #0: 0x00007fa2709b7247 libc.so.6`epoll_wait + 87
    frame #1: 0x00007fa26cf7d31f libyrpc.so`boost::asio::detail::epoll_reactor::run(this=0x000005fa7fc4eb60, usec=<unavailable>, ops=0x00007fa260e32020) at epoll_reactor.ipp:501:20
    frame #2: 0x00007fa26cf7a696 libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::scheduler::do_run_one(this=0x000005fa7fd6cb00, lock=0x00007fa260e32098, this_thread=0x00007fa260e31fc0, ec=0x00007fa260e320f0) at scheduler.ipp:476:16
    frame #3: 0x00007fa26cf7a4f1 libyrpc.so`boost::asio::detail::scheduler::run(this=0x000005fa7fd6cb00, ec=0x00007fa260e320f0) at scheduler.ipp:210:10
    frame #4: 0x00007fa26cf79d47 libyrpc.so`yb::rpc::IoThreadPool::Impl::Execute() [inlined] boost::asio::io_context::run(this=<unavailable>, ec=0x00007fa260e320f0) at io_context.ipp:71:16
    frame #5: 0x00007fa26cf79d3a libyrpc.so`yb::rpc::IoThreadPool::Impl::Execute(this=<unavailable>) at io_thread_pool.cc:76:17
    frame #6: 0x00007fa26c9df2c0 libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000005fa7fdd4960) const at function.h:517:16
    frame #7: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000005fa7fdd4960) const at function.h:1168:12
    frame #8: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(arg=0x000005fa7fdd4900) at thread.cc:884:3
    frame #9: 0x00007fa270c551ca libpthread.so.0`start_thread + 234
    frame #10: 0x00007fa2708c1e73 libc.so.6`__clone + 67
  thread #6, stop reason = signal 0
    frame #0: 0x00007fa270c5b45c libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 + 508
    frame #1: 0x00007fa26cf7a617 libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] void boost::asio::detail::posix_event::wait<boost::asio::detail::conditionally_enabled_mutex::scoped_lock>(this=0x000005fa7fd6cb70, lock=0x00007fa260eb3098) at posix_event.hpp:119:7
    frame #2: 0x00007fa26cf7a5fa libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::conditionally_enabled_event::wait(this=0x000005fa7fd6cb68, lock=0x00007fa260eb3098) at conditionally_enabled_event.hpp:97:14
    frame #3: 0x00007fa26cf7a5f0 libyrpc.so`boost::asio::detail::scheduler::run(boost::system::error_code&) [inlined] boost::asio::detail::scheduler::do_run_one(this=0x000005fa7fd6cb00, lock=0x00007fa260eb3098, this_thread=0x00007fa260eb2fc0, ec=0x00007fa260eb30f0) at scheduler.ipp:501:21
    frame #4: 0x00007fa26cf7a4f1 libyrpc.so`boost::asio::detail::scheduler::run(this=0x000005fa7fd6cb00, ec=0x00007fa260eb30f0) at scheduler.ipp:210:10
    frame #5: 0x00007fa26cf79d47 libyrpc.so`yb::rpc::IoThreadPool::Impl::Execute() [inlined] boost::asio::io_context::run(this=<unavailable>, ec=0x00007fa260eb30f0) at io_context.ipp:71:16
    frame #6: 0x00007fa26cf79d3a libyrpc.so`yb::rpc::IoThreadPool::Impl::Execute(this=<unavailable>) at io_thread_pool.cc:76:17
    frame #7: 0x00007fa26c9df2c0 libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000005fa7fdd44e0) const at function.h:517:16
    frame #8: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this=0x000005fa7fdd44e0) const at function.h:1168:12
    frame #9: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(arg=0x000005fa7fdd4480) at thread.cc:884:3
    frame #10: 0x00007fa270c551ca libpthread.so.0`start_thread + 234
    frame #11: 0x00007fa2708c1e73 libc.so.6`__clone + 67
  thread #7, stop reason = signal 0
    frame #0: 0x00007fa2709b7247 libc.so.6`epoll_wait + 87
    frame #1: 0x00007fa26c634c42 libev.so.4`epoll_poll + 82
    frame #2: 0x00007fa26c637ed4 libev.so.4`ev_run + 1956
    frame #3: 0x00007fa26cfa1c79 libyrpc.so`yb::rpc::Reactor::RunThread() [inlined] ev::loop_ref::run(this=0x000005fa7f875438, flags=0) at ev++.h:211:7
    frame #4: 0x00007fa26cfa1c6e libyrpc.so`yb::rpc::Reactor::RunThread(this=0x000005fa7f875400) at reactor.cc:733:9
    frame #5: 0x00007fa26c9df2c0 libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000005fa7fdd4cc0) const at function.h:517:16
    frame #6: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this= Function = yb::rpc::Reactor::RunThread() ) const at function.h:1168:12
    frame #7: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(arg=0x000005fa7fdd4c60) at thread.cc:884:3
    frame #8: 0x00007fa270c551ca libpthread.so.0`start_thread + 234
    frame #9: 0x00007fa2708c1e73 libc.so.6`__clone + 67
  thread #8, stop reason = signal 0
    frame #0: 0x00007fa2709b7247 libc.so.6`epoll_wait + 87
    frame #1: 0x00007fa26c634c42 libev.so.4`epoll_poll + 82
    frame #2: 0x00007fa26c637ed4 libev.so.4`ev_run + 1956
    frame #3: 0x00007fa26cfa1c79 libyrpc.so`yb::rpc::Reactor::RunThread() [inlined] ev::loop_ref::run(this=0x000005fa7f874f38, flags=0) at ev++.h:211:7
    frame #4: 0x00007fa26cfa1c6e libyrpc.so`yb::rpc::Reactor::RunThread(this=0x000005fa7f874f00) at reactor.cc:733:9
    frame #5: 0x00007fa26c9df2c0 libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000005fa7fdd4a80) const at function.h:517:16
    frame #6: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this= Function = yb::rpc::Reactor::RunThread() ) const at function.h:1168:12
    frame #7: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(arg=0x000005fa7fdd4a20) at thread.cc:884:3
    frame #8: 0x00007fa270c551ca libpthread.so.0`start_thread + 234
    frame #9: 0x00007fa2708c1e73 libc.so.6`__clone + 67
  thread #9, stop reason = signal 0
    frame #0: 0x00007fa270c5b45c libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 + 508
    frame #1: 0x00007fa270ef1602 libc++.so.1`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
    frame #2: 0x00007fa26cfe4235 libyrpc.so`yb::rpc::(anonymous namespace)::Worker::Execute() [inlined] yb::rpc::(anonymous namespace)::Worker::PopTask(this=0x000005fa7fd2bdc0, task=0x00007fa260b990b8) at thread_pool.cc:144:13
    frame #3: 0x00007fa26cfe3db8 libyrpc.so`yb::rpc::(anonymous namespace)::Worker::Execute(this=0x000005fa7fd2bdc0) at thread_pool.cc:114:11
    frame #4: 0x00007fa26c9df2c0 libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::__function::__value_func<void ()>::operator()[abi:ue170006](this=0x000005fa7fdd5380) const at function.h:517:16
    frame #5: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(void*) [inlined] std::__1::function<void ()>::operator()(this= Function = yb::rpc::(anonymous namespace)::Worker::Execute() ) const at function.h:1168:12
    frame #6: 0x00007fa26c9df2aa libyb_util.so`yb::Thread::SuperviseThread(arg=0x000005fa7fdd5320) at thread.cc:884:3
    frame #7: 0x00007fa270c551ca libpthread.so.0`start_thread + 234
    frame #8: 0x00007fa2708c1e73 libc.so.6`__clone + 67


@hari90 hari90 closed this as completed in d07daec Oct 30, 2024
@yugabyte-ci yugabyte-ci reopened this Oct 31, 2024
hari90 added a commit that referenced this issue Oct 31, 2024
Summary:
Original commit: d07daec / D39552
Switch pg_cron to normal using normal `SIGTERM` and `SIGQUIT` handlers.

In an attempt to fix stuck pg_cron workers the handler of `SIGTERM` was changed to `quickdie` 4b4c201/D37591. But in `quickdie` mode postgres does not invoke `ReportBackgroundWorkerExit` which signals the parent process when its child backend exits. This causes pg_cron launcher to get stuck in `WaitForBackgroundWorkerShutdown`.

With 111b65d/D39207 the issues with pg shutdown have been addressed and it is safe to now switch to the normal SIGTERM behavior which makes the background workers `die` in a clean manner.

#24706 tracks the issue where in yb `ReportBackgroundWorkerExit` is not called when the backend has a non graceful exit.

Fixes #24658
Jira: DB-13724

Test Plan: PgCronTest, DeactivateRunningJob

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39617
hari90 added a commit that referenced this issue Oct 31, 2024
Summary:
Original commit: d07daec / D39552
Switch pg_cron to normal using normal `SIGTERM` and `SIGQUIT` handlers.

In an attempt to fix stuck pg_cron workers the handler of `SIGTERM` was changed to `quickdie` 4b4c201/D37591. But in `quickdie` mode postgres does not invoke `ReportBackgroundWorkerExit` which signals the parent process when its child backend exits. This causes pg_cron launcher to get stuck in `WaitForBackgroundWorkerShutdown`.

With 111b65d/D39207 the issues with pg shutdown have been addressed and it is safe to now switch to the normal SIGTERM behavior which makes the background workers `die` in a clean manner.

#24706 tracks the issue where in yb `ReportBackgroundWorkerExit` is not called when the backend has a non graceful exit.

Fixes #24658
Jira: DB-13724

Test Plan: PgCronTest, DeactivateRunningJob

Reviewers: telgersma

Reviewed By: telgersma

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39621
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024.2_blocker 2024.2.0_blocker 2024.2.1_blocker area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority qa_stress Bugs identified via Stress automation QA QA filed bugs
Projects
None yet
Development

No branches or pull requests

3 participants