Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlcapture: Add a watchdog timeout for no data from backfill queries #1977

Open
willdonnelly opened this issue Sep 24, 2024 · 0 comments
Open
Assignees
Labels
change:unplanned Unplanned change, useful for things like doc updates

Comments

@willdonnelly
Copy link
Member

Backfill queries typically take 1 to 30 seconds to complete. There are exceptions which we'll get to, but in general if we're in the middle of issuing a backfill query and then don't see any data for multiple minutes (let's say 15m here, just for the sake of argument) then something is very wrong.

And usually what's wrong is that the database connection has silently dropped or something along those lines. It doesn't happen often, but if it does there's no indication what's going on, we just sit there indefinitely waiting for backfill result rows which will never arrive.

We should add a basic watchdog timeout to the backfillStream() function so that if >15m has elapsed since we last saw any data we fail the capture.

There is one situation where this could break something that previously "worked" however, so we should exercise a bit of care and think this part through. Sometimes (in certain tables with certain backfill modes) a backfill query will require a full-table sort. If the table is small there's no problem: the sort takes a few seconds, we issue a few backfill queries each requiring such a sort, and then we're done. But if the table is sufficiently big then we have a quadratic problem: each sort takes many minutes and we have to issue a whole bunch of backfill queries. In this case the backfill is unlikely to ever complete anyway, because we're averaging some pathetic data rate because every 50k rows requires another full-table sort.

I'm going to arbitrarily say that 15 minutes is a reasonable dividing line between the two cases. This is probably overly generous even -- just consider the size of a dataset that takes 15 minutes to sort in memory, it's unlikely that at 50k/15 rows/minute we're ever going to finish the backfill before someone gets tired of waiting. When this sort of thing happens we usually go and either manually change the backfill mode for the table or fix the connector bug which caused it anyway.

TL;DR: We should add a watchdog timeout in backfillStream, but in order to not break marginally-slow cases it should be fairly generous. I think 15 minutes is probably a reasonable compromise between those concerns.

@willdonnelly willdonnelly added the change:unplanned Unplanned change, useful for things like doc updates label Sep 24, 2024
@willdonnelly willdonnelly self-assigned this Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:unplanned Unplanned change, useful for things like doc updates
Projects
None yet
Development

No branches or pull requests

1 participant