Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow HDD read. #390

Closed
lelik107 opened this issue Apr 8, 2024 · 8 comments
Closed

Slow HDD read. #390

lelik107 opened this issue Apr 8, 2024 · 8 comments

Comments

@lelik107
Copy link

lelik107 commented Apr 8, 2024

Hello guys, BLAKE3 Team! I found out that any b3sum_windows_x64_bin.exe reads a HDD slow. What exactly I did:
I calculated the sums of Windows 10 iso images, 5.3 Gb and 5.8 Gb files, WD RE 1 Tb was used, I measured with a simple stopwatch.The speed of this HDD is 140-80 Mbyte/s. And it's not about a disk cache, because I rebooted after every test.
So v1.5.1, v1.3.3 ~ 5.3 Gb - 1 m 45 s, 5.8 Gb ~ 1 m 57 s; v1.0.0 a bit faster ~ 5.3 Gb - 1 m 22 s, 5.8 Gb ~ 1 m 31 s.
And I hear the disk spinning.
b2sum-bin_20130305.zip, 7-zip h command with any algo ~ 5.3 Gb - 45-48 s, 5.8 Gb ~ 55 s -1 m;.
And I don't hear the disk spinning.
So my simple question is: why b3sum is this slower?

@lulcat
Copy link

lulcat commented May 21, 2024

Yes. internally I have been solving this with some ideas in the C. I forget if rust b3sum actually mmaps properly or is slow too but IF you are on a slow iops media AND do a big file, b3 is sadly the slowest hasher of them all... until cached, theni t's the fastest. (c or rust). I notice with rust b3sum I think the mmapping saturates my test spin.. which is 210ish MB/s, but when I used the C version it does 60MB/s , when not cached. Using some custom mmaping in front is what I/we do here. I can't say anything on how things work in windows, . if you rerun b3sum immediately after doing it once it should take 2-300 ms from what I see. Your times is what I would see too given the size on a disk spinning. My guess is b3sum_windows is C?

I don't think I recall if I found out exactly why b3sum was slower than the others during first load but I suspected something to do with chunking and buffer sizes :D but the authors obv. will have a better idea. Either way, this is the only case I am aware of where it suffers (sadly). Hence, I DO wish, as they mentioned once , that they did the OMP implementation in C as well..

@lelik107
Copy link
Author

lelik107 commented May 21, 2024

@lulcat Yes, the disk cache reading is OK with Windows b3sum. IDK exactly the languages are used but there might be Assembly as well. Most of the third party Windows SW, for example
https://github.com/OV2/RapidCRC-Unicode and Total Commander
use single-threaded reference C code because there’s no multi-threaded C code yet from the developers. As for a raw reading from HDD in RapidCRC Unicode and TC, it shows just the same speed as any other algorithm. If you have 210 Mbytes/s you’ll get it. For a buffered reading single-threaded C code gives 1500-1700 Mbytes/s with all SIMDs but AVX-512F (I don’t have them). It’s fine for any SATA-3 device but maybe a bottleneck for M2s. And, yes, It'll be nice to have multi-threaded C code, maybe OMP.

@lulcat
Copy link

lulcat commented May 25, 2024

Hi again. Let me correct myself. I just checked now.. .the actual RUST binary as well, which does mmap AND MT...runs AT 60 MB/s in an example case I have....

I then evict the file from cache and run ANY Other hashsum, which will beat it... e.g. b2sum? (can be sha*sum, anysum,
all will run at the max ingress speed of the file)! e.g.

time b2sum same_example => 210MB/s over 3 times faster!!

I meant the C code won't do any better because it isn't optimised like rust (no mmapping and no MT , i.e. OMP).

THIS Is the issue you notice and reported I think. and it is a 'edge case' of b3 which I haven't figured out yet WHY happens.

b2sum as you see does NOT suffer from this.. and I DO have avx512, I can run it with that in C etc but this problem persists.

it is on slow spinners and big files which trigger it. fast IO will not spike this case.

Now, again, as I said.. I have solved this in my environment with something on top.. but requires heuristically determining if we have that 'edge case'. I'd prefer obviously it to be solved in the b3 code but I am not sure where to touch it. I don't wanna touch crypto anyway with my edits. :)

in all cases, once said file is in memory however, then both C and rust versions will be the fastest! (rust being faster due to MT). I am fairly sure I saw this case in both reference/portable code and with various machine instruction versions (e.g. avx2 or avx512), can't quite figure out yet why this is happening , so my own guesstimate was something to do with how 'chunks' are fed to the memory but I don't know why this is happening that it can't feed the full read speed, hence I mentioned chunks but anyway; this is something the authors can (and hopefully will) address properly.

SO in your case you are using a 1 TB spinner "slow io" and a BIG file (in the order of GBs) which is why this is 'triggered' .

TO NOTE; this is very important in fact, because it is a very common layout to keep archived files (thus often large) on slower but larger backup media, which renders b3 useless until this is adressed imo. "Fun fact:" I removed b3 as a default hash in systems to sha3-224 presicely due to this uncached-slow-io-big-file issue several years ago, but testing re-introduction (with the heuristics as I said).

EDIT: OH LORD.. just tested my bsum -a:b3 example which is in C,

and it DOES also read at 210MB/s so it's the RUST binary whcih is messing up :p
ye , ok I care less then but in other systems, this will/can be an issue.

So more likely it's the mmapping. YUP. confirmed.. passing --no-mmap makes b3sum run at 210MB/s which is 3.5x faster give or take.

Damn, I never realised this haha. Although I figured it had to do with mmapping since my 'solution' had to do with this..

but ye, in the proper system, b3sum won't be rust anyway so this problem will go away for me in my native environment, (but not guest ones which use rust's b3sum then).

@lelik107
Copy link
Author

lelik107 commented May 31, 2024

@lulcat I'm glad you've found a solution, but I'm neither a developer nor a coder myself rather an end user in Windows, we just don't build SW very often :)

@lelik107
Copy link
Author

lelik107 commented Aug 20, 2024

@lulcat Yes, you are right --no-mmap solves the issue but as I see : "Disable memory mapping. Currently this also disables multithreading."
It's for Rust, because C isn't multithreaded anyway. But what's the point then of using Rust and Rayon if at the and of the day we have exactly what we have in C, and I should say in more understandable form.

@oconnor663
Copy link
Member

oconnor663 commented Aug 20, 2024

Apologies for not commenting here sooner. See this older issue: #31

The problem isn't really specific to BLAKE3 itself, or to mmapping in general, but to the "divide and conquer" multithreading strategy we currently use. When b3sum hashes a file, by default you get a worker thread for every core on your machine, and each of these threads starts working on a different part of the file. From the disk's perspective, this requires a lot of seeking, and for spinning disks / HDDs in particular, it's this seeking that's slow. See also the docs for Hasher::update_mmap_rayon.

The next tricky part is that this performance downside doesn't apply to SDDs or to files cached in RAM. In those cases seeking is cheap or free, and it's common to see a 5-10x performance boost from multithreading. So if we changed the default, the majority of our users would run an order of magnitude slower, and they'd probably never know why. That 5-10x number depends entirely on the number of cores in your CPU of course, but it's going up over time.

In a perfect works there would be some function like File::is_seeking_going_to_be_slow, and we could just ask the OS to tell us what to do. Unfortunately there's no such function.

The --no-mmap flag (or equivalently, taking input with < in the shell) is a fine workaround, but since mmapping provides a slight performance boost even in single-threaded mode, I think the best workaround is currently --num-threads=1.

@lulcat
Copy link

lulcat commented Aug 20, 2024

I guess I can confirm this. I solve this heuristically; however, so far I only did that 'solution' in my old C code. I haven't bothered yet doing it in my MT code, which indeed ironically also suffers from this edge case which I set out to fix initially. It is a shame rust is being used though in the first place, rather sub optimal.

I have created a fat blakes binary which runs 2-3 times faster than your rust implementation. (~7 times faster than coreutil's single-threaded b2sum). I do suspect o'Connor is the rust fan of the team as it seems his projects are in rust. Such a shame too, I love bao, tao and lao but alas! all in rust, again, rather sub optimal but it is what it is.

I will release the code publically later but before release , I only am willing to share the binary. if you wanna check it does indeed run 2-3 times faster than your rust oconnor: Shout and I will pass it on. This is not down to my code by the way, but down to your team's excellent code in assembly. (So not the rust code).

In a perfect works there would be some function like File::is_seeking_going_to_be_slow, and we could just ask the OS to tell us what to do. Unfortunately there's no such function.

Have a look at what vmtouch and the underlying functions do.. I am not sure if they can help in rust, especially since as you say, in MT, the seek pattern(s) will come at a cost. however, with a bit of heuristics one can get these times down in edge cases by a lot.

@lelik107
Copy link
Author

@oconnor663 I understood and I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants