-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper handling for download failing to replace layer #3533
Comments
Because of compaction never downloading a remote layer and gc depending on the presence of image layers, I haven't yet figured out a way for this to be reproduced in a test. We certainly have the conditions for it. If we supported compaction doing on-demand download while not holding (The "both wait on same semaphore" issue has been overlooked on a number of on-demand related guesses for replacement failures.) |
Now suspected failure to replace layer happened in production while migrating a tenant, which caused a retry loop which killed the logging with:
This comes from the semaphore being closed, which will be looped forever after one or more threads discover this RemoteLayer still in LayerMap (commit unrelated): neon/pageserver/src/tenant/timeline.rs Line 3496 in a974602
I suspect it is similar to #3387 in that it happened via This was on a version before #3558 fixes (we don't have git commits for production right now, I thought closest would be 248563c). @LizardWizzard could you link this issue if you write something up about the production issue? |
Looking at the scrapes for the tenant and timeline, this problematic layer must have been the last remote layer. This is something not discovered at least on #3387. |
Add an AtomicBool per RemoteLayer, use it to mark together with closed semaphore that remotelayer is unusable until restart or ignore+load. #3533 (comment)
Discussed today with @problame: this shouldn't be an issue now that #3613 is merged. If a replacement fails:
|
So, I think we're good here, for a low-acceptance-bar version of 'proper'? IMO since we're not deleting the layer file on replace failure, it's correct to not decrement resident size. |
I think so as well, on both, I assume we are not decrementing the size either after successful download right now. Closing, feel free to reopen in case I've misunderstood. |
First discussed in #3387 (comment), made more robust in #3513, but still remains to be done:
The text was updated successfully, but these errors were encountered: