Add self-check pg_notify messages for silently dropped connections #14749

AlanCoding · 2024-01-09T16:15:09Z

SUMMARY

The problem this addresses is roughly established by this form of testing:

awx-manage run_dispatcher --status
strace -f -s 1024 -p 8491
lsof -p 8491 | grep 28
# run as root
iptables -I OUTPUT -p tcp --sport 34586 -j DROP

The dropping of the connection is undetected by the dispatcher, which is bad. This leaves the dispatcher in a state where it no longer receives messages. Worse, it doesn't know that it cannot receive messages. I have poked around several options to test the connection outside of this, and they all seem terrible. I think I've concluded I need threading to testing doing SELECT 1, because the very scenario I want to diagnose causes that to hang which makes it even less functional than if we had not added that check..

That leads me to the nuclear option - this. This is how it looks from logs:

tools_awx_1 | 2024-01-09 15:56:38,547 ERROR    [-] awx.main.dispatch Error consuming new events from postgres, will retry for 40 s
tools_awx_1 | Traceback (most recent call last):
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/worker/base.py", line 267, in run
tools_awx_1 |     conn.select_timeout = self.run_periodic_tasks(conn)
tools_awx_1 |   File "/awx_devel/awx/main/dispatch/worker/base.py", line 237, in run_periodic_tasks
tools_awx_1 |     raise db.DatabaseError(f'pg_notify self-check missing after {delta:.3f}s, did you drop the connection?')
tools_awx_1 | django.db.utils.DatabaseError: pg_notify self-check missing after 26.015s, did you drop the connection?
tools_awx_1 | 2024-01-09 15:56:39,576 INFO     [-] awx.main.dispatch Dispatcher listener connection established

Obvious drawback - this is slow to detect problems. I've set this to send every 60 seconds, and then receiving has a tolerance of 20 seconds + whatever task interval we encounter... thus that 26 seconds. This is additive, and we add at least 1 second of delay in retry loop, so call it 87 seconds here to recover. To articulate my feelings:

The advantages of this are that it detects this state when nothing else can, and it goes into the error handling loop we want which re-connects but doesn't arbitrarily fail jobs in progress.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

AlanCoding added 5 commits January 5, 2024 10:28

Assure dispatcher listener connection on re-establishing

6107560

Agressive connection checking

a10bad4

Add a self-check alive message for pg_notify drops

ced1803

Remove new arg not being used

0e08556

Log and startup adjustment

2649450

github-actions bot added the component:api label Jan 9, 2024

AlanCoding mentioned this pull request Jan 10, 2024

New setting for pg_notify listener DB settings, add keepalive #14755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add self-check pg_notify messages for silently dropped connections #14749

Add self-check pg_notify messages for silently dropped connections #14749

AlanCoding commented Jan 9, 2024

Add self-check pg_notify messages for silently dropped connections #14749

Are you sure you want to change the base?

Add self-check pg_notify messages for silently dropped connections #14749

Conversation

AlanCoding commented Jan 9, 2024

SUMMARY

ISSUE TYPE

COMPONENT NAME