Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSD with node id later detected as HDD #1970

Closed
scottyeager opened this issue May 22, 2023 · 5 comments
Closed

SSD with node id later detected as HDD #1970

scottyeager opened this issue May 22, 2023 · 5 comments
Assignees
Labels
type_bug Something isn't working
Milestone

Comments

@scottyeager
Copy link

Farmer reported that their node 3791 detected the cache SSD as HDD after the node rebooted. There's two concerns here:

  1. Disk type should not change (I thought this was now stored somewhere after initial detection?)
  2. Maybe it makes sense to check for the presence of the node id on a disk, even if Zos thinks it's a HDD, to avoid generating a second node id in case there is some problem?
@muhamadazmy
Copy link
Member

The disk type is stored on a volatile storage (tmpfs) after the node is booted because disk type should not change while the node is running. This is like this because after a runtime update, storage daemon restarts and then it will need to re-detect the disks but because disks might be under heavy load detection can be off, hence it uses the stored detection that happened during boot.

On a restart the disk types are wiped out and then has to be re-done, but this happens before any workload is restarted so it should be safe to run.

We also need to rescan the disks anyway on boot in case new disks are added or disks have been swapped.

The failure to detect the disk as SSD can also mean the disk of a low quality that is now not peformant enough to be considered SSD. Note that we don't rely on the "announced" disk type, we instead run a speed test to make sure the disks speed is actually SSD grade.

Last thing is using an HDD as a cache disk will really hurt the node performance, hence zos ignore hdd for cache (and hence id storage)

@scottyeager
Copy link
Author

Ah, I do recall now that disk type was fixed for the time that the node is up.

I wondered a similar thing about a disk quality issue or performance degradation that dropped its seek time into the HDD category. Definitely the node should not use HDD or bad performing SSD as its cache disk. But if the node id or some workload data is already stored on a disk originally detected as SSD and that disk is later detected as HDD, then that data is essentially lost.

It seems the system would be a bit more resilient if something like the following could happen:

  1. Node scans disks and detects that a node identity file or workload data is present on a disk labeled as HDD
  2. We assume this disk has another disk detected as SSD. The identity is copied to this disk and it is used now for cache
  3. In the case or workload data, it remains available to that workload. Performance will be bad but at least the data can be copied elsewhere
  4. Don't store anymore data to this disk and don't count it towards minting rewards. Disks which malfunction in this way are probably close to complete failure and the farmer should replace it anyway

@xmonader xmonader added the type_bug Something isn't working label Aug 8, 2023
@xmonader xmonader added this to 3.13.x Aug 8, 2023
@xmonader xmonader added this to the 3.9.0 milestone Aug 8, 2023
@muhamadazmy
Copy link
Member

I know this is an old issue but we might be able to process this now. I like your suggestion but I also suggest we do this as follows:

  • Right now, detection of the disk speed is done evertime the node is booted then kept during the entire uptime. But next time a node is booted, the disk are retest again. I suggest we instead persist the seektime detected to the disk itself so once a disk is categorized as ssd or hdd this never changes.
  • If during the boot, a HDD had keys, this one is moved (as u suggested) to an available SSD. Another way that i don't really like is to force this disk to be handled as SSD and use it. This will also make sure workloads data are not lost even if performance is not so great.

@muhamadazmy
Copy link
Member

related to #2020

@muhamadazmy muhamadazmy moved this to Accepted in 3.13.x Aug 10, 2023
@rawdaGastan rawdaGastan moved this from Accepted to In Progress in 3.13.x Aug 14, 2023
@muhamadazmy
Copy link
Member

The PR linked above will make sure detection of disks is persisted across reboots. So even if disk performance was degraded over time it will never get detected as HDD drive.

The system will do wort case one last speed check with seektime before it persist that to the disk itself. This does not mean that slow disks will now show up as SSDs (unless they perform better during detection) but new SSDs detected will always be seen as SSDs

This fix is is now available on devnet, and will get release to mainnet with version v3.9.x

@github-project-automation github-project-automation bot moved this from In Progress to Done in 3.13.x Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type_bug Something isn't working
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants