-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Events based cache could lose events at startup #5151
Comments
The solution I am exploring is to do uncommitted reads for a few minutes after the first event comes in after start up. Using the below sequence of events:
Server 3 would miss event 20. The next cache refresh it would do an uncommitted read and see event 20 and then can wait for it to commit. If server 1 never writes the row it wouldn't show up in the transaction buffer. The time between when the auto inc number is requested and the row gets written is going to be small as its part of the same statement. so we wont need to check for a long time for a missed event to come in. |
Faisal asked me to write up his design. TLDR: Improve tracking of the skipped database events by detecting them directly instead of detecting them by noticing skipped elements. The changes in this effort will include reading the database using a special READ_UNCOMMITTED option in the read transaction. This will present the data with the committed and uncommitted data combined, such that uncommitted data is present (hiding the older committed row). A second read will also be performed, only on the committed data. By comparing the two, we can determine which Event IDs are not yet committed, and add those as tracked elements without potentially skipping any. The prior solution might potentially skip the first uncommitted elements if they had an EventID that preceded the first committed EventID. The previous solution heavily uses the "last seen EventID" to keep a marker where it would then pick up from in the next scan for database events. While the work in #5071 would notice skipped EventIDs and periodically rescan for them each polling loop, it could not detect skipped items prior to the "last seen EventID" being set. By reading uncommitted EventIDs, the "skipped" list of EventIDs will not be surmised, but read directly from the database. This presents a small issue, as the only means of determining which elements are uncommitted is to difference them against the same read without the READ_UNCOMMITTED option set. The overall algorithim to determine which elements need period polling is roughly:
Items that appear in the non-READ_UNCOMMITTED query without ever appearing in the READ_UNCOMMITTED query will not see different processing, as they aren't in long lived transaction. The existing algorithm already properly handles such items. |
For MySQL that is 8 hours, postgres (since version 9.6) defaults to 24 hours, sqlite doesn't support a timeout (and is unsupported a shared database setup). For this reason, the maximum effective scanning time should be 24 hours.
|
When the server starts up there could be a skipped event that hasn't resolved. So upon restart we have to get a list of all the events and look for any gaps. This leads to the below issue:
In the scenario event 20 will never get processed by server 3.
The text was updated successfully, but these errors were encountered: