Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] bladebit 3.0.0 ramplot crashes plotting c3 plots #16219

Open
jayhohoho2019 opened this issue Sep 1, 2023 · 30 comments
Open

[Bug] bladebit 3.0.0 ramplot crashes plotting c3 plots #16219

jayhohoho2019 opened this issue Sep 1, 2023 · 30 comments
Assignees
Labels
2.0.0 bug Something isn't working CLI plotting

Comments

@jayhohoho2019
Copy link

jayhohoho2019 commented Sep 1, 2023

What happened?

Please see Chia-Network/bladebit#389 for details.

ramplot, no cuda, no diskplot. happens for c3 as well as c5, and possibly other c levels.

Version

2.0.0

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

~ Half the time:

Finished Phase 2 in 26.67 seconds.
Running Phase 3
  Compressing tables 2 and 3...
STDERR:

STDERR: Fatal Error:

STDERR: Overran park buffer: 6358 / 6352

----
The other half time:
Running Phase 3
Compressing tables 2 and 3...
STDERR: *** Crashed! ***

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_Z12CrashHandleri+0xaa)[0x55a6e54e3eda]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f425d036520]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x1afbba)[0x7f425d1a3bba]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_Z15WriteParkThreadP12WriteParkJob+0x218)[0x55a6e564f1f8]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(_ZN10ThreadPool17FixedThreadRunnerEPv+0x52)[0x55a6e5659822]

STDERR: /home/jh/chia-blockchain/venv/bin/bladebit_cuda(ZN6Thread17ThreadStarterUnixEPS+0x80)[0x55a6e54e4f90]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7f425d088b43]

STDERR: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7f425d11aa00]
@jayhohoho2019 jayhohoho2019 added the bug Something isn't working label Sep 1, 2023
@jayhohoho2019
Copy link
Author

What should I do to get some movement on this bug? To plot a 14TB drive of c3 plots I've had to restart it dozens of times...

@jayhohoho2019
Copy link
Author

Please note I'm running this with CLI. No GUI involved.

@jayhohoho2019
Copy link
Author

Tried using bladebit directly rather than chia plotters:
bladebit -t '"$r"' -c '"$c"' -f '"$f"' -n '"$n"' -v -z '"$compress"' ramplot '"$dst"'
where -t 85, -n 20

It worked until n=12, and again crashed here:
Running Phase 3
Compressing tables 2 and 3...

@jayhohoho2019
Copy link
Author

Tried the binary from bb 3.11-beta1 with the same arguments. Same result. Best case crashed when n=19, worst case n=3.

@jayhohoho2019
Copy link
Author

With 3.11-beta1, had an instance where it crashed working on n=1. Always at this stage:
Running Phase 3
Compressing tables 2 and 3...

@wjblanke
Copy link
Contributor

wjblanke commented Sep 6, 2023

Harold this seems bad

STDERR: Overran park buffer: 6358 / 6352

@harold-b
Copy link
Contributor

harold-b commented Sep 6, 2023

Park overrun can happen in some instances. We increased the size from the minimum and created tons of plots until it wasn't happening, but there's no guarantee it cannot. Some park sizes chose for certain levels might trigger more than others. We crash on purpose when this happens since we don't know what memory might have been touched that shouldn't have.

We might be able to see if we can increase the buffer size used for park writing and then not crash, but ignore the plot, if it overran within the bounds of the buffer allocated for the parks.

@jayhohoho2019
Copy link
Author

jayhohoho2019 commented Sep 6, 2023

Is this specific to compressed plots? I have never had this issue with bb1 or bb2, on this very same plotting computer/harvester. But now I'm hitting it every few plots.

@harold-b
Copy link
Contributor

harold-b commented Sep 6, 2023

Yes, each compression level has new park sizes which are different than the park sizes for uncompressed plots.

Even though it could happen with classic (uncompressed) plots, the park sizes much more generous as to it nearly never happening.

@jayhohoho2019
Copy link
Author

I've run this for c3 plots so far. Crashes way too often, sometimes at the very first plot, at the most at the 19th plot. Would take me years to finish replotting my farm unless this gets fixed :-)

@jayhohoho2019
Copy link
Author

It has nothing to do with the number of threads (-t) value, correct?

@harold-b
Copy link
Contributor

harold-b commented Sep 6, 2023

Threads won't affect anything, park sizes are fixed. But as a workaround you might try a different compression level that might not be triggering overrruns

@jayhohoho2019
Copy link
Author

jayhohoho2019 commented Sep 6, 2023

Most of my remote harvesters are RPi4s and C3 is what I was told the "right" c level for it. Each RPi4 is hooked up to 600TB. I'm not sure if its CPU can handle a higher C level at this size.

@jayhohoho2019
Copy link
Author

I did a run for c5 n=21 and it completed without a crash. But I need c3 though for most of my harvesters that are RPi4s (about 6PiB). Actually I need to plot about 600TB of c3 plots for 1 RPi4 first to make sure it can handle this many c3 plots, before I replot any more.

@jayhohoho2019
Copy link
Author

Doing another run for c5 n=23 (to fill a internal NVMe SSD) and it died at n=4. This time however there is no crash message, and there is no tmp file in the destination directory. This is now on chia 2.0.1 but bladebit --version still shows 3.0.0

Finished forward propagating table 4 in 38.77 seconds.
Forward propagating to table 5...
Pairing L/R groups...
Finished pairing L/R groups in 10.3440 seconds. Created 4294233685 pairs.
Average of 236.1003 pairs per group.
Computing Fx...

@jayhohoho2019
Copy link
Author

I've been plotting c5 with this and it crashes much less often than c3 but still does from time to time, much more so than plotting uncompressed plots using bb v1 or v2. Is this related to ramplot (cpu plot) only? Does it exist for cudaplot as well?

@jayhohoho2019
Copy link
Author

I don't have a gpu in my plotter. With bb3 ramplot, plotting time for c3 is about the same as with v1/v2, and about 30 seconds faster for c5. So for ramplot, bb3 doesn't really offer any performance improvements, but it offers the ability to plot c plots. Is that the correct understanding? And then with cudaplot, gpu plotting?

@jayhohoho2019
Copy link
Author

Here is some system info again. Let me know please what additional info you need.
Ubuntu 22.04.3 LTS (Server)
5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
2x Xeon(R) Gold 6230R
512GB system RAM
Write buffer is 2x7 INTEL SSDPE2KE076T8 raided 0 with mdadm
NO GPU

@jayhohoho2019
Copy link
Author

command to run bb (3.0.0)
cd ~/chia-blockchain && . ./activate && bladebit -t '"$r"' -c '"$c"' -f '"$f"' -n '"$n"' -v -w -z '"$compress"' ramplot '"$dst"'

r=90
z=5 (or 3)
$dst is the INTEL SSDs, nowhere near full when the crashes happen

@jayhohoho2019
Copy link
Author

Sometimes crash.log would contain the following:

bladebit(_Z12CrashHandleri+0xaa)[0x56368afbd91a]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fc8e7999520]
/lib/x86_64-linux-gnu/libc.so.6(+0x1afbba)[0x7fc8e7b06bba]
bladebit(_Z15WriteParkThreadP12WriteParkJob+0x218)[0x56368b1290c8]
bladebit(_ZN10ThreadPool17FixedThreadRunnerEPv+0x52)[0x56368b1336f2]
bladebit(ZN6Thread17ThreadStarterUnixEPS+0x80)[0x56368afbe9d0]
/lib/x86_64-linux-gnu/libc.so.6(+0x94b43)[0x7fc8e79ebb43]
/lib/x86_64-linux-gnu/libc.so.6(+0x126a00)[0x7fc8e7a7da00]

Other times it would say park buffer overrun

@harold-b
Copy link
Contributor

I don't have a gpu in my plotter. With bb3 ramplot, plotting time for c3 is about the same as with v1/v2, and about 30 seconds faster for c5. So for ramplot, bb3 doesn't really offer any performance improvements, but it offers the ability to plot c plots. Is that the correct understanding? And then with cudaplot, gpu plotting?

That's correct. Ramplot is exactly the same, the only difference is compressed plot support.

@jayhohoho2019
Copy link
Author

jayhohoho2019 commented Sep 15, 2023 via email

@wjblanke
Copy link
Contributor

Harold didnt we increase some of these buffers?

@harold-b
Copy link
Contributor

Those were the slice buffers (hold temporary data during plotting). The park buffers are fixed per plot file version. We'd have to generate new estimates and bump a file version to support new park sizes. Or allow the park size to be defined by the plot file itself, and not exceed the default uncompress park sizes

@harold-b
Copy link
Contributor

@jayhohoho2019 I think the best workaround here is to run bladebit CLI directly from a shell script to automatically retry and cleanup any unfinished plots when it exits with an error exit code. If you need help with this I can set you up w/ something for Linux

@jayhohoho2019
Copy link
Author

@jayhohoho2019 I think the best workaround here is to run bladebit CLI directly from a shell script to automatically retry and cleanup any unfinished plots when it exits with an error exit code. If you need help with this I can set you up w/ something for Linux

Yes that'd be nice. Thank you.

How long do you think it will take to get the park buffer sizes increased? If not long I can wait too. Thanks.

@jayhohoho2019
Copy link
Author

jayhohoho2019 commented Oct 11, 2023

fyi I had been replotting c5 using ramplot since sept 24th with a little script of resuming automatically. Here is a list of date stamps when bb crashed (and left an empty .tmp file), the later ones were bb 3.1.0
Sep 24 00:18
Sep 24 21:17
Sep 25 07:41
Sep 25 18:00
Sep 25 18:09
Sep 26 14:06
Sep 27 14:13
Sep 28 07:38
Sep 28 10:43
Sep 28 22:34
Sep 28 22:53
Sep 29 03:50
Sep 30 06:51
Oct 1 04:00
Oct 1 05:34
Oct 1 16:56
Oct 1 23:44
Oct 2 06:10
Oct 2 09:50
Oct 2 22:59
Oct 3 22:24
Oct 7 10:09
Oct 7 11:21
Oct 8 00:17
Oct 8 03:24
Oct 8 06:37
Oct 9 06:35
Oct 9 21:44

@jayhohoho2019
Copy link
Author

I connected a gpu to my plotter and am doing cudaplot now instead. haven't crashed yet. do ramplot and cudaplot handle park buffer differently?

@wjblanke
Copy link
Contributor

Harold are these handled differently?

@harold-b
Copy link
Contributor

No they shouldn't. But I can have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0.0 bug Something isn't working CLI plotting
Projects
None yet
Development

No branches or pull requests

4 participants