Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage-mon: Don't use random seeking when O_DIRECT is enabled #1968

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kaneter
Copy link

@kaneter kaneter commented Aug 30, 2024

Hi all,

I propose modifying to set random seeks only when the O_DIRECT flag is disabled.

I understand that the purpose of random seeks is to avoid false negatives that arise from reading from the page cache.
With the O_DIRECT flag enabled, it allows accessing the file system without going through the page cache.
This means that even without setting random seeks, it is possible to bypass the cache and read from the device directly.

Copy link

knet-jenkins bot commented Aug 30, 2024

Can one of the admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-1968/1/input

@fabbione
Copy link
Member

the reason to do random seek is also to probe different part of the disk and not just sector 0.

What problem are we trying to solve with this change?

@chrissie-c
Copy link
Contributor

I'm not sure why we wouldn't want to do random checks around the disk in O_DIRECT mode; that seek was there mainly to run checks on different parts of the disk (not just the page cache), where random errors may be hiding.

Always checking sector 0 doesn't seem quite so useful, though I admit I'm no storage expert.

@wenningerk
Copy link

If using a random sector was just to not always read the same from cache not moving a potentially physical head around if not needed with O_DIRECT might make sense.
But loosing the possibility to scan a larger portion of the surface over time would be a pity as well.
Maybe adding an option to scan a fixed sector or within a range might keep both parties happy.
Depending on the use of the device in parallel 0 might not be the best guess anyway - given we're talking about performance reasons.

@kaneter
Copy link
Author

kaneter commented Sep 2, 2024

the reason to do random seek is also to probe different part of the disk and not just sector 0.

Thank you for your guidance.
I was not aware that random seeks were intended for probing bad sectors.
This is because storage_mon reads the device only once per call.

Assuming a sector size of 512 bytes, and assuming no duplicates in the randomly chosen seek positions,
we can express the number of years required to inspect all sectors using the following equation:

time_required [years] = ((disk_size [bytes]) / 512 [bytes per sector]) / ((24 [hours] * 60 [minutes] * 60 [seconds]) / (interval_to_test [seconds])) / 365 [days]

For example, let's assume the following:

  • disk_size = 16 [GiB] = 17,179,869,184 [bytes]
  • interval_to_test = 30 [seconds]
    • (30 seconds is the default value for storage_mon.)

Then, the time_required is approximately 31 years and 10 months.

While there is a possibility of random seeks randomly finding bad sectors,
I believe that the bad sector probing performed by storage_mon is too slow.

However, if you expect storage_mon to find bad sectors, I think it would be best to retain these random seeks.
I personally do not have high expectations, so I thought it would be better to simplify the implementation.

Maybe adding an option to scan a fixed sector or within a range might keep both parties happy.

I agree. This idea respects both sides of the argument.
If we were to add an option, I think it would be best to keep random seeks enabled by default, so as not to change the existing behavior.

@wenningerk
Copy link

Guess you have to find some kind of a trade-off between a full surface-check and something that has so little impact that it can be done during operation.
I'm not an expert in this field but when we're thinking of a classical rotating disk we could at least assume that a sector from each of the surfaces would be checked within a reasonable time as to see if all of the heads can read properly. Same for sectors from inner tracks and sectors from outer tracks to see if the heads are able to move freely.
For shingled disks there would as well be quite some chance that you would hit a sector within these packs that are erased together (don't know the correct term here - in case of an SSD it would be an erase-block) regularly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants