Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectored get image layer skipping #9012

Closed
jcsp opened this issue Sep 16, 2024 · 0 comments · Fixed by #9026 · May be fixed by #9025
Closed

Vectored get image layer skipping #9012

jcsp opened this issue Sep 16, 2024 · 0 comments · Fixed by #9026 · May be fixed by #9025
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Collaborator

jcsp commented Sep 16, 2024

BLUF
Vectored read might skip image layers when faced with certain layer map layouts.
No risk of corruption: the bug doesn't impact the correctness of images returned,
but may cause us to do more work or return an error instead.

Root cause

This issue requires an image layer nested inside a delta layer and a vectored read on a keyspace that includes both keys not covered by the image and keys covered by the image (in that order). Like the picture below.

A              B
+--------------+ l1
|              |      ^ LSN
|    ===== l2  |      |
|    C E D     |      |
+--------------+ l3   +-----> Key

Let's assume that we are reading the [A, E) key range and the current search LSN (or cont LSN) for this range is greater than l1. We will perform a ranged search to see which layers to visit next. It will yield two results (both pointing to the same layer):

  • delta layer [A, B): key_range=[A, C) lsn_floor=l3
  • delta layer [A, B): key_range=[C, E) lsn_floor=l2

Now as Christian pointed out, the fringe makes the incorrect assumption that all reads within a layer will be at the same LSN. Note how we don't use the lsn_range argument if the layer was already in the fringe.
Hence, we use the first LSN range for both key ranges. This skips the image layer for the [C, E) range.

Risk Assesment
There is no risk of corruption since this bug doesn't make us return an incorrect page. Let's consider two cases:

  • The skipped image is the last image for the key range in question. This will result in the read failing with a missing key error.
  • The skipped image is not the last image. We will descend further and stop at the next image. This is extra work that we shouldn't be doing and we don't know how often we end up in this case, but we know that it cannot happen for get page requests.

Initial investigation was done by @problame here.
I verified it independently and reached the same conclusions.

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Sep 16, 2024
@VladLazar VladLazar changed the title Vectored get availability bug Vectored get image layer skipping Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment