Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataShard: read iterators may have a broken external reference precharging #7769

Closed
snaury opened this issue Aug 14, 2024 · 0 comments · Fixed by #8592
Closed

DataShard: read iterators may have a broken external reference precharging #7769

snaury opened this issue Aug 14, 2024 · 0 comments · Fixed by #8592
Assignees
Labels
area/datashard Issues related to datashard tablets (relational table partitions) bug Something isn't working

Comments

@snaury
Copy link
Member

snaury commented Aug 14, 2024

When reviewing code for #7674 I noticed that LastProcessedKey value when precharging missing references may be incorrect. The problem is that @SammyVimes changed the LastProcessedKeyErased to LastProcessedKeyErasedOrMissing, which seemed ok at first glance. In reality this flag is not serialized in responses and may cause rows to be missing in the result.

Previously LastProcessedKeyErased was used as a hack/optimization when resuming internally, to force erased sub-ranges to overlap and have a side-effect of stitching them together. The result remains correct even when query is restarted externally with the last reported continuation token's key. When the query is restarted externally this flag is lost and effectively becomes false, which is totally fine because erased rows are skipped and the last erased row was already accounted for. Resuming with this flag set to false is correct (in a sense that the same row is not visited twice), though may produce sub-optimal caching behavior.

Note however that when we start "precharing" missing references we are positioned on the result row, which is not part of the result yet. When LastProcessedKey is set to that row's key and we already have some rows processed, we would produce a result that marks this row as "processed", when in reality it is not. After a temporary network failure the read actor will restart the query externally from this last known key, which would cause it to skip the row. This row, however, has never been part of the result, and now it never will.

Since iterator cannot travel in time and the last known key is already updated, we need to always page fault when encountering missing references. The LastProcessedKeyErased flag can only be used as an erased range stitching hack.

@snaury snaury added bug Something isn't working area/datashard Issues related to datashard tablets (relational table partitions) labels Aug 14, 2024
@snaury snaury self-assigned this Aug 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datashard Issues related to datashard tablets (relational table partitions) bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant