Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Increased CPU usage and Higher Partial Times #15836

Open
PhillWalker opened this issue Jul 22, 2023 · 39 comments · Fixed by #15845
Open

[Bug] Increased CPU usage and Higher Partial Times #15836

PhillWalker opened this issue Jul 22, 2023 · 39 comments · Fixed by #15845
Assignees
Labels
2.0.0 bug Something isn't working compression Related to compressed plotting/farming

Comments

@PhillWalker
Copy link

PhillWalker commented Jul 22, 2023

What happened?

Since Chia Compression 4.3 alpha through to the current 2.0 RC1, partial times have increased noticeably, causing lots of lates on lower end equipment.

LebanonJon has been able to verify that the CPU usage has increased since this time, and WallentX has also acknowledged this increased resourced usage for non compression, and wonders if this has also impacted compression.

Version

2.0 RC1

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

No response

@PhillWalker PhillWalker added the bug Something isn't working label Jul 22, 2023
@PhillWalker
Copy link
Author

Screenshot 2023-07-22 at 12 24 22 You can see the sudden rise in partial times.

Previous version running in the screenshot was the last compression_fc branch build.
And the version where the jump happened was 2.0 Beta5 from main branch.

Issue still persists from early data on 2.0 RC1

@Coyote-UK
Copy link

I can confirm I have seen this same issue using my Windows GUI harvester.
Windows 10, 2,453 C5 K32 plots. i7-4500U processor, 16GB DDR3 memory, SH920 mSATA HDD for mainnet db.
Logfile for 2.0.0b5 attached covering startup of node/farmer/harvester & 6 hours of farming, including several lates.
debug.log

@PhillWalker
Copy link
Author

It is worth noting, this is not the config.yaml spelling issue.

These are freshly generated config.yaml files for RC1, using the same values as before

@PhillWalker
Copy link
Author

Screenshot 2023-07-24 at 12 36 28

Lates being caused since the upgrade. These were all 25-30 seconds previously with no lates.

@Coyote-UK
Copy link

I verified, in the settings tab on the GUI, that the correct settings for decompressOrs was being read.

@wallentx wallentx added the compression Related to compressed plotting/farming label Jul 24, 2023
@wallentx wallentx linked a pull request Jul 24, 2023 that will close this issue
@Coyote-UK
Copy link

My setup had fully functional decompressor (previously decompresser) settings.

I do not believe this issue is resolved.

@PhillWalker
Copy link
Author

I concur, my settings are correct too, and the issue is far from resolved

@harold-b
Copy link
Contributor

Apologies I did not see this sooner. My github I have my github notifications messed up due to a mail filter.

Thanks for the report. Any chance you guys can run some tests directly with the bladebit exe on simulate --power n mode, Using the same --seed on both runs, from the tags where you saw the regression first happen?

@PhillWalker
Copy link
Author

Unfortunately the hardware I am using seeing the worst of the issues (with the lates), is a 2014 Synology, and it does not support the newer versions of simulate

Fatal Error:
getrandom syscall failed with error 38.

@Coyote-UK
Copy link

I plotted on a much more powerful, Ubuntu, machine than my power efficient Windows box with mobile CPU.

I will have to attach a suitable SSD to house a plot for a meaningful simulate, so I will get back to you as soon as I can.

@Coyote-UK
Copy link

Sorry for the delay. Here are the stats.

Let me know if you need anything else.

Using 2 threads and 2 parallel decompressors, as that was found to give the least stale partials on my system, so it's what I normally use for farming.

Choosing size 200TB to roughly match my farm

bladebit_cuda alpha 4.3

bladebit_cuda-a4-3 -t 2 simulate --no-cuda -n 100 -p 2 --power 360 --size 200TB --seed e11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd plot-k32-c05-Test.plot
[Simulator for harvester farm capacity for K32 C5 plots]
Random seed: 0xe11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd
Simulating...

Context count : 2
Thread per context instance : 2
Memory used : 167.0MiB ( 0.2GiB )
Proofs / Challenges : 146 / 76 ( 192.11% )
Fetches / Challenges : 92 / 76
Filter bits : 512
Effective partials : 1 ( 1.09% )
Total fetch time elapsed : 30.000 seconds
Average plot lookup time : 0.326 seconds
Worst plot lookup lookup time : 6.282 seconds
Average full proof lookup time: 6.282 seconds
Fastest full proof lookup time: 6.282 seconds

compression | plot count | size TB | size PB

C5 | 12561 | 1096 | 1.10

bladebit_cuda alpha 4.4

bladebit_cuda -t 2 simulate --no-cuda -n 100 -p 2 --power 360 --size 200TB --seed e11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd plot-k32-c05-Test.plot
[Simulator for harvester farm capacity for K32 C5 plots]
Random seed: 0xe11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd
Simulating...

Context count : 2
Thread per context instance : 2
Memory used : 167.0MiB ( 0.2GiB )
Proofs / Challenges : 146 / 76 ( 192.11% )
Fetches / Challenges : 92 / 76
Filter bits : 512
Effective partials : 1 ( 1.09% )
Total fetch time elapsed : 30.865 seconds
Average plot lookup time : 0.335 seconds
Worst plot lookup lookup time : 5.477 seconds
Average full proof lookup time: 5.477 seconds
Fastest full proof lookup time: 5.477 seconds

compression | plot count | size TB | size PB

C5 | 12208 | 1065 | 1.07

@esaung esaung added the 2.0.0 label Aug 1, 2023
@PhillWalker
Copy link
Author

Have tried working with empower to do some troubleshooting...

The issue persists when changing the version of Chiapos used, on a cloned repository of the 2.0 release branch.

Versions tried

chia pos 2.0.0rc1, 2.0.0b5, 2.0.0b4, 2.0.0b3

2.0.0b1 and 2.0.0b2 will not run as it displays this error

2023-08-04T08:04:07.159 harvester chia.plotting.manager
• ERROR
refresh callback raised: 'chiapos.DiskProver' obiect has n
o attribute 'get compression level' with the traceback: Traceback (most recent call last): File "/chia-blockchain/chia/plotting/manager.py", line 241, in refresh task
self. refresh callback(PlotRefreshEvents.batch processed, batch result) File "/chia-blockchain/chia/harvester/harvester.py", line 162, in plot_refresh_callback
self.plot_sync
sender .process batch (update result. loaded, update result.remaining)
File "/chia-blockchain/chia/plot sync/sender.py", line 285, in process batch
converted = convert plot info list loaded)
File "/chia-blockchain/chia/plot sync/sender.py", line 48, in _convert _plot_info_list compression level=plot info.prover.get compression level (),
AttributeError:
'chiapos .DiskProver' object has no attribute 'get compression level'

and the same with 1.0.12b16, which was the last known working version as far as we know, used in Alpha 4.3 build.

Alpha 4.3: chiapos=1.0.12b16
Alpha 4.4 (broken release): chiapos=2.0.0b1
Alpha 4.5: chiapos=2.0.0b2

Could be worth comparing what changed between chiapos 1.0.12b16 and 2.0.0b1, assuming the bug was introduced changing up to the major version, if chiapos is even at fault.

We also ran some tests changing to use "blspy==1.0.16", but the results were the same

@Coyote-UK
Copy link

Coyote-UK commented Aug 5, 2023

image
Update:

After updating my rig to 2.0.0rc3 these lines have appeared in my log file (as they have done for several versions)

2023-08-05T16:05:49.900 harvester chia.harvester.harvester: WARNING Looking up qualities on N:\C5\plot-k32-c05-2023-06-03-05-06-3fb805aaed23fcbd6629df024ba4268426f3aab4053142dfd8ace9bda4827702.plot took: 60.26634669303894. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:05:57.574 harvester chia.harvester.harvester: WARNING Looking up qualities on S:\C5\plot-k32-c05-2023-06-21-04-12-648f1cf9f7deb68b697ce5e7616b357a3ae6720a27b1755d412ed631658caef0.plot took: 59.11319851875305. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:21:46.164 harvester chia.harvester.harvester: WARNING Looking up qualities on M:\C5\plot-k32-c05-2023-05-31-11-20-aaf6dd611b6ed4c9b603d1169e27e736301a7bc0c548b8427750a694b1b2c483.plot took: 55.353450536727905. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:23:32.977 harvester chia.harvester.harvester: WARNING Looking up qualities on U:\C5\plot-k32-c05-2023-06-23-16-14-880df202463b5b3fb4f95e5825912ab7d98d206c3a1e6e015dc8500a3b5a1412.plot took: 105.28022837638855. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:23:54.091 harvester chia.harvester.harvester: WARNING Looking up qualities on R:\C5\plot-k32-c05-2023-06-15-11-03-b383db6587781130d747fea28beeae30818fad89c8b2520840b5adaa69337458.plot took: 79.3622350692749. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:25:08.232 harvester chia.harvester.harvester: WARNING Looking up qualities on J:\C5\plot-k32-c05-2023-07-28-14-14-09edcffacc140e67e7410808d33299100a2388041c69903d07b2657d67ae193d.plot took: 43.46801948547363. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:27:15.948 harvester chia.harvester.harvester: WARNING Looking up qualities on K:\C5\plot-k32-c05-2023-05-26-06-18-2103493ca705752950c6f83b2f40d5897ecd1338b49f8c88a152774290f4c20c.plot took: 41.67232918739319. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:32:36.251 harvester chia.harvester.harvester: WARNING Looking up qualities on O:\C5\plot-k32-c05-2023-06-06-07-06-63316b75708e944ced5e578ba028ec763c338c6a8378cef7139004f16fe7df48.plot took: 44.853880643844604. This should be below 5 seconds to minimize risk of losing rewards.

I was watching the Windows Task Manager for most of this time & at no point did the CPU reach 100% utilisation for more than 5seconds.

More pertinent to the overall performance impact. I have not managed to submit a single partial in <5s since the introduction of alpha 4.5. I was hitting about 50% <5s for compressed plots before that release.

@PhillWalker
Copy link
Author

Issue persists in RC3, still see the late full proofs, and lots of random spikes for qualities warnings for non full proof lookups

@Voodoo-Chia
Copy link

I tried rolling back to version 4.3 (from 4.5) and then upgraded to RC3,

I have seen no noticabale timing differences between those three builds. Average is consistently around 1.5 seconds, with peaks in the 5-10 second range.

Farming 4900 C4 + 50 C5 plots with CPU

System:
CPU: i3-6100
MB: Supermicro C7Z270-CG
RAM: 1x 8GB DDR4

OS: Windows 10, 22H2, 19045.3208

Thought I'd add this for comparison, maybe it can help to root cause.

@Coyote-UK
Copy link

Coyote-UK commented Aug 9, 2023

Main difference between Voodoo-Chia's environment & both mine & PhillWalker is that Voodoo-Chia has only 50 C5 plots, so statistically small chance of multiple plots passing filter for each challenge, whilst both the others have > 2,000 C5 plots so statistically high chance.

This MAY suggest the problem lies in parallel/sequential lookups for C5+ compression.

@Voodoo-Chia
Copy link

only 50 compressed plots,

4950 actually ;)

@Coyote-UK
Copy link

Coyote-UK commented Aug 9, 2023

only 50 compressed plots,

4950 actually ;)

Sorry, corrected original post. Only 50 c5 plots

@PhillWalker
Copy link
Author

My intel i3-2100 and i3-4130 seem OK and dont deliver lates.

Its the Intel Atom D2700s which have been really affected by these new releases the worst.

I can confirm these lates aren't caused by duplicate proofs at the same time or back to back ones.

@PhillWalker
Copy link
Author

Issue persists in RC4

@PhillWalker
Copy link
Author

Screenshot 2023-08-10 at 14 21 57 Screenshot 2023-08-10 at 14 21 42

Proof of lates over the last week attached, running RC3 and RC4

@Coyote-UK
Copy link

Coyote-UK commented Aug 11, 2023

As there has been no activity on this for a while I bit the bullet and rolled my farm back to alpha 4.3. Same hardware, same plots, same pool, same config.yaml (with the 2 decompressor entries duplicated with the decompresser spelling) and this is the result:
image
If you can compare with the one above it clearly demonstrates a significant performance degradation as the versions progressed.

Average Latency - 2.6s on Alpha 4.1 & 3.4s on RC3 (an increase of 30%)
Highest Latency - 26s on Alpha 4.1 & 105s (but I have seen over 500s recorded) on RC3 (an increase of 300%)

@PhillWalker
Copy link
Author

I have rolled all 4 harvesters back to fc.compression branch (Alpha 1 I believe). All lates have gone away, best case plot lookup times are 5-7 seconds quicker compared to RC4. These harvesters were getting 20% lates on RC4, evenly across them. Now back to no lates...
Screenshot 2023-08-11 at 17 33 00

@Coyote-UK
Copy link

There appear to be 2 issues at play here. Both impact harvester performance, but I am not sure if one is a consequence of the other. (Increased average lookup times & sporadic excessive lookups).

However, 1 issue is clearly occasional unworkable lookup times. I have seen (in RC4) a lookup of over 500s seconds reported in the log. One instance of an excessively long lookup appears to have a lasting effect on the harvester, with output reduced for a period of time (estimated at up to an hour) after it occurs.

Any lookup is worthless if it cannot produce a partial in 30s or less, since most pools enforce 32s limits. Why not just drop any proof if it doesn't complete in say 20s & report the failure to the log?

This would at least minimise the impact of that element.

@emlowe
Copy link
Contributor

emlowe commented Aug 11, 2023

Unfortunately, that is a rather long standing issue (see #7788) and despite seeming relatively straightforward is more complicated than it appears.

But yes, this could be a part, maybe a large part of the problem

@Coyote-UK
Copy link

I've been following my hunch that it is linked to bad plots and am making progress.

I have aggressively removed plots that result in the following criteria from a chia plots check with default 30 iterations (remember ALL of these plots pass the basic plot check):

Plots with less than 16 Proofs found
Plots with "Proof dropped due to line point compression" more than once
Plots with "Quality doesn't match with proof" more than once
Plots that report "Looking up qualities took:" longer than 5 seconds more than once
Plots that report "Finding proof took:" longer than 30 seconds more than once

Replacing these plots (approximately 8% of my total farm after several iterations) has massively reduced the issue.

I am having to use the bladebit_cuda, plotter provided as part of alpha 4.3 as any later ones cannot be checked using the chia plots check command in alpha 4.3.

I want to be able to switch to alpha 4.3 to validate the performance difference still exists.

What has changed in the plot format since the alpha 4.3 plotter that would cause the check to fail? Does this suggest that I may need to completely replot to this newer plot format?

Is it possible that you are unable to reproduce the error in the farming software because the problem is initiated in imperfect plots & the farmer application is not handling the problem?

Plotting in Ubuntu 23.04 with 256GB RAM, Intel CPU, Nvidia Quadro M5000 GPU (CUDA 5.2, 8GB vRAM)

@PhillWalker
Copy link
Author

Issue persists with 2.0 RC5, which uses ChiaPOS 2.0.1. Approx 20% late on the lower end hardware. No lates on fc.comprssion build for comparison.

@PhillWalker
Copy link
Author

Issue persists with 2.0 RC6, which uses ChiaPOS 2.0.2. Approx 20% late on the lower end hardware. No lates on fc.comprssion build for comparison.

@PhillWalker
Copy link
Author

Issue persists with final 2.0 public release. This is using the official chia docker image built by chia, rather than self built now it is available.

@Coyote-UK
Copy link

Coyote-UK commented Aug 24, 2023

image
Strangely I didn't see a single late partial submission whilst using RC6, but have already seen 2 late partial submissions on 2.0.0.

Did the 'timeout' introduced in RC5/6 not make it into the final release?

@PhillWalker
Copy link
Author

Issue persists in 2.0.1

@PhillWalker
Copy link
Author

Issue persists in 2.1.0

@wjblanke
Copy link
Contributor

wjblanke commented Nov 1, 2023

What CPU are you running on the Synology? Have u actually farmed a block successfully with alpha 1 version? We are wondering if it is fully correct or flawed and fast

@PhillWalker
Copy link
Author

Intel Atom D2700

2 cores, 4 threads

@wjblanke
Copy link
Contributor

wjblanke commented Nov 1, 2023

Intel® SSE2, Intel® SSE3, Intel® SSSE3

No AVX. Maybe its an instruction set issue with compile time. Did u have luck with alpha version making blocks? I will look at the green reaper compiler settings

@PhillWalker
Copy link
Author

I can confirm I won multiple blocks on this version across a 1.5PiB farm, on multiple harvesters.

@PhillWalker
Copy link
Author

@Coyote-UK s CPU is i7-4500U for reference with the same issue

@Coyote-UK
Copy link

Confirmed: i7-4500U 16GB RAM SSD for DB, Win 10, 2472 C5 plots (created with alpha4), extensive pruning to remove bad plots, No blocks found between 3,798,055 & 4,155,045 (Equivalent of 2.91 x ETW) which covers ALL compressed plot farming up to the release of 2.0.0. I accept that statistically this is possible, but is a big coincidence.

@wjblanke
Copy link
Contributor

wjblanke commented Nov 8, 2023

it certainly sounds like there is a performance problem if you were able to win blocks with the old version. it will difficult for us to regression test this since the merge was so long ago. if there is any other information you can give us please let us know. i do appreciate the testing you have done so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0.0 bug Something isn't working compression Related to compressed plotting/farming
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants