[Bug] Increased CPU usage and Higher Partial Times #15836

PhillWalker · 2023-07-22T11:23:19Z

What happened?

Since Chia Compression 4.3 alpha through to the current 2.0 RC1, partial times have increased noticeably, causing lots of lates on lower end equipment.

LebanonJon has been able to verify that the CPU usage has increased since this time, and WallentX has also acknowledged this increased resourced usage for non compression, and wonders if this has also impacted compression.

Version

2.0 RC1

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

No response

PhillWalker · 2023-07-22T11:26:25Z

You can see the sudden rise in partial times.

Previous version running in the screenshot was the last compression_fc branch build.
And the version where the jump happened was 2.0 Beta5 from main branch.

Issue still persists from early data on 2.0 RC1

Coyote-UK · 2023-07-22T17:41:34Z

I can confirm I have seen this same issue using my Windows GUI harvester.
Windows 10, 2,453 C5 K32 plots. i7-4500U processor, 16GB DDR3 memory, SH920 mSATA HDD for mainnet db.
Logfile for 2.0.0b5 attached covering startup of node/farmer/harvester & 6 hours of farming, including several lates.
debug.log

PhillWalker · 2023-07-24T11:32:06Z

It is worth noting, this is not the config.yaml spelling issue.

These are freshly generated config.yaml files for RC1, using the same values as before

PhillWalker · 2023-07-24T11:37:50Z

Lates being caused since the upgrade. These were all 25-30 seconds previously with no lates.

Coyote-UK · 2023-07-24T13:21:43Z

I verified, in the settings tab on the GUI, that the correct settings for decompressOrs was being read.

Coyote-UK · 2023-07-26T13:16:03Z

My setup had fully functional decompressor (previously decompresser) settings.

I do not believe this issue is resolved.

PhillWalker · 2023-07-26T16:12:05Z

I concur, my settings are correct too, and the issue is far from resolved

harold-b · 2023-07-27T00:03:23Z

Apologies I did not see this sooner. My github I have my github notifications messed up due to a mail filter.

Thanks for the report. Any chance you guys can run some tests directly with the bladebit exe on simulate --power n mode, Using the same --seed on both runs, from the tags where you saw the regression first happen?

PhillWalker · 2023-07-27T11:31:39Z

Unfortunately the hardware I am using seeing the worst of the issues (with the lates), is a 2014 Synology, and it does not support the newer versions of simulate

Fatal Error:
getrandom syscall failed with error 38.

Coyote-UK · 2023-07-27T12:49:47Z

I plotted on a much more powerful, Ubuntu, machine than my power efficient Windows box with mobile CPU.

I will have to attach a suitable SSD to house a plot for a meaningful simulate, so I will get back to you as soon as I can.

Coyote-UK · 2023-07-28T22:27:09Z

Sorry for the delay. Here are the stats.

Let me know if you need anything else.

Using 2 threads and 2 parallel decompressors, as that was found to give the least stale partials on my system, so it's what I normally use for farming.

Choosing size 200TB to roughly match my farm

bladebit_cuda alpha 4.3

bladebit_cuda-a4-3 -t 2 simulate --no-cuda -n 100 -p 2 --power 360 --size 200TB --seed e11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd plot-k32-c05-Test.plot
[Simulator for harvester farm capacity for K32 C5 plots]
Random seed: 0xe11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd
Simulating...

Context count : 2
Thread per context instance : 2
Memory used : 167.0MiB ( 0.2GiB )
Proofs / Challenges : 146 / 76 ( 192.11% )
Fetches / Challenges : 92 / 76
Filter bits : 512
Effective partials : 1 ( 1.09% )
Total fetch time elapsed : 30.000 seconds
Average plot lookup time : 0.326 seconds
Worst plot lookup lookup time : 6.282 seconds
Average full proof lookup time: 6.282 seconds
Fastest full proof lookup time: 6.282 seconds

compression | plot count | size TB | size PB

C5 | 12561 | 1096 | 1.10

bladebit_cuda alpha 4.4

bladebit_cuda -t 2 simulate --no-cuda -n 100 -p 2 --power 360 --size 200TB --seed e11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd plot-k32-c05-Test.plot
[Simulator for harvester farm capacity for K32 C5 plots]
Random seed: 0xe11c75d4ef981ed097bf40452de79eec8c4ce836236f0474573aae143f2fd7fd
Simulating...

Context count : 2
Thread per context instance : 2
Memory used : 167.0MiB ( 0.2GiB )
Proofs / Challenges : 146 / 76 ( 192.11% )
Fetches / Challenges : 92 / 76
Filter bits : 512
Effective partials : 1 ( 1.09% )
Total fetch time elapsed : 30.865 seconds
Average plot lookup time : 0.335 seconds
Worst plot lookup lookup time : 5.477 seconds
Average full proof lookup time: 5.477 seconds
Fastest full proof lookup time: 5.477 seconds

compression | plot count | size TB | size PB

C5 | 12208 | 1065 | 1.07

PhillWalker · 2023-08-04T11:36:56Z

Have tried working with empower to do some troubleshooting...

The issue persists when changing the version of Chiapos used, on a cloned repository of the 2.0 release branch.

Versions tried

chia pos 2.0.0rc1, 2.0.0b5, 2.0.0b4, 2.0.0b3

2.0.0b1 and 2.0.0b2 will not run as it displays this error

2023-08-04T08:04:07.159 harvester chia.plotting.manager
• ERROR
refresh callback raised: 'chiapos.DiskProver' obiect has n
o attribute 'get compression level' with the traceback: Traceback (most recent call last): File "/chia-blockchain/chia/plotting/manager.py", line 241, in refresh task
self. refresh callback(PlotRefreshEvents.batch processed, batch result) File "/chia-blockchain/chia/harvester/harvester.py", line 162, in plot_refresh_callback
self.plot_sync sender .process batch (update result. loaded, update result.remaining)
File "/chia-blockchain/chia/plot sync/sender.py", line 285, in process batch
converted = convert plot info list loaded)
File "/chia-blockchain/chia/plot sync/sender.py", line 48, in _convert _plot_info_list compression level=plot info.prover.get compression level (),
AttributeError:
'chiapos .DiskProver' object has no attribute 'get compression level'

and the same with 1.0.12b16, which was the last known working version as far as we know, used in Alpha 4.3 build.

Alpha 4.3: chiapos=1.0.12b16
Alpha 4.4 (broken release): chiapos=2.0.0b1
Alpha 4.5: chiapos=2.0.0b2

Could be worth comparing what changed between chiapos 1.0.12b16 and 2.0.0b1, assuming the bug was introduced changing up to the major version, if chiapos is even at fault.

We also ran some tests changing to use "blspy==1.0.16", but the results were the same

Coyote-UK · 2023-08-05T15:53:50Z

Update:

After updating my rig to 2.0.0rc3 these lines have appeared in my log file (as they have done for several versions)

2023-08-05T16:05:49.900 harvester chia.harvester.harvester: WARNING Looking up qualities on N:\C5\plot-k32-c05-2023-06-03-05-06-3fb805aaed23fcbd6629df024ba4268426f3aab4053142dfd8ace9bda4827702.plot took: 60.26634669303894. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:05:57.574 harvester chia.harvester.harvester: WARNING Looking up qualities on S:\C5\plot-k32-c05-2023-06-21-04-12-648f1cf9f7deb68b697ce5e7616b357a3ae6720a27b1755d412ed631658caef0.plot took: 59.11319851875305. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:21:46.164 harvester chia.harvester.harvester: WARNING Looking up qualities on M:\C5\plot-k32-c05-2023-05-31-11-20-aaf6dd611b6ed4c9b603d1169e27e736301a7bc0c548b8427750a694b1b2c483.plot took: 55.353450536727905. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:23:32.977 harvester chia.harvester.harvester: WARNING Looking up qualities on U:\C5\plot-k32-c05-2023-06-23-16-14-880df202463b5b3fb4f95e5825912ab7d98d206c3a1e6e015dc8500a3b5a1412.plot took: 105.28022837638855. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:23:54.091 harvester chia.harvester.harvester: WARNING Looking up qualities on R:\C5\plot-k32-c05-2023-06-15-11-03-b383db6587781130d747fea28beeae30818fad89c8b2520840b5adaa69337458.plot took: 79.3622350692749. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:25:08.232 harvester chia.harvester.harvester: WARNING Looking up qualities on J:\C5\plot-k32-c05-2023-07-28-14-14-09edcffacc140e67e7410808d33299100a2388041c69903d07b2657d67ae193d.plot took: 43.46801948547363. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:27:15.948 harvester chia.harvester.harvester: WARNING Looking up qualities on K:\C5\plot-k32-c05-2023-05-26-06-18-2103493ca705752950c6f83b2f40d5897ecd1338b49f8c88a152774290f4c20c.plot took: 41.67232918739319. This should be below 5 seconds to minimize risk of losing rewards.

2023-08-05T16:32:36.251 harvester chia.harvester.harvester: WARNING Looking up qualities on O:\C5\plot-k32-c05-2023-06-06-07-06-63316b75708e944ced5e578ba028ec763c338c6a8378cef7139004f16fe7df48.plot took: 44.853880643844604. This should be below 5 seconds to minimize risk of losing rewards.

I was watching the Windows Task Manager for most of this time & at no point did the CPU reach 100% utilisation for more than 5seconds.

More pertinent to the overall performance impact. I have not managed to submit a single partial in <5s since the introduction of alpha 4.5. I was hitting about 50% <5s for compressed plots before that release.

PhillWalker · 2023-08-05T16:00:59Z

Issue persists in RC3, still see the late full proofs, and lots of random spikes for qualities warnings for non full proof lookups

Voodoo-Chia · 2023-08-09T10:58:28Z

I tried rolling back to version 4.3 (from 4.5) and then upgraded to RC3,

I have seen no noticabale timing differences between those three builds. Average is consistently around 1.5 seconds, with peaks in the 5-10 second range.

Farming 4900 C4 + 50 C5 plots with CPU

System:
CPU: i3-6100
MB: Supermicro C7Z270-CG
RAM: 1x 8GB DDR4

OS: Windows 10, 22H2, 19045.3208

Thought I'd add this for comparison, maybe it can help to root cause.

Coyote-UK · 2023-08-09T11:14:55Z

Main difference between Voodoo-Chia's environment & both mine & PhillWalker is that Voodoo-Chia has only 50 C5 plots, so statistically small chance of multiple plots passing filter for each challenge, whilst both the others have > 2,000 C5 plots so statistically high chance.

This MAY suggest the problem lies in parallel/sequential lookups for C5+ compression.

Voodoo-Chia · 2023-08-09T11:19:12Z

only 50 compressed plots,

4950 actually ;)

Coyote-UK · 2023-08-09T11:44:54Z

only 50 compressed plots,

4950 actually ;)

Sorry, corrected original post. Only 50 c5 plots

PhillWalker · 2023-08-09T16:28:14Z

My intel i3-2100 and i3-4130 seem OK and dont deliver lates.

Its the Intel Atom D2700s which have been really affected by these new releases the worst.

I can confirm these lates aren't caused by duplicate proofs at the same time or back to back ones.

PhillWalker · 2023-08-10T13:16:52Z

Issue persists in RC4

PhillWalker · 2023-08-10T13:25:32Z

Proof of lates over the last week attached, running RC3 and RC4

Coyote-UK · 2023-08-11T16:04:53Z

As there has been no activity on this for a while I bit the bullet and rolled my farm back to alpha 4.3. Same hardware, same plots, same pool, same config.yaml (with the 2 decompressor entries duplicated with the decompresser spelling) and this is the result:

If you can compare with the one above it clearly demonstrates a significant performance degradation as the versions progressed.

Average Latency - 2.6s on Alpha 4.1 & 3.4s on RC3 (an increase of 30%)
Highest Latency - 26s on Alpha 4.1 & 105s (but I have seen over 500s recorded) on RC3 (an increase of 300%)

PhillWalker · 2023-08-11T16:33:43Z

I have rolled all 4 harvesters back to fc.compression branch (Alpha 1 I believe). All lates have gone away, best case plot lookup times are 5-7 seconds quicker compared to RC4. These harvesters were getting 20% lates on RC4, evenly across them. Now back to no lates...

Coyote-UK · 2023-08-11T17:10:32Z

There appear to be 2 issues at play here. Both impact harvester performance, but I am not sure if one is a consequence of the other. (Increased average lookup times & sporadic excessive lookups).

However, 1 issue is clearly occasional unworkable lookup times. I have seen (in RC4) a lookup of over 500s seconds reported in the log. One instance of an excessively long lookup appears to have a lasting effect on the harvester, with output reduced for a period of time (estimated at up to an hour) after it occurs.

Any lookup is worthless if it cannot produce a partial in 30s or less, since most pools enforce 32s limits. Why not just drop any proof if it doesn't complete in say 20s & report the failure to the log?

This would at least minimise the impact of that element.

emlowe · 2023-08-11T17:38:11Z

Unfortunately, that is a rather long standing issue (see #7788) and despite seeming relatively straightforward is more complicated than it appears.

But yes, this could be a part, maybe a large part of the problem

Coyote-UK · 2023-08-16T11:34:59Z

I've been following my hunch that it is linked to bad plots and am making progress.

I have aggressively removed plots that result in the following criteria from a chia plots check with default 30 iterations (remember ALL of these plots pass the basic plot check):

Plots with less than 16 Proofs found
Plots with "Proof dropped due to line point compression" more than once
Plots with "Quality doesn't match with proof" more than once
Plots that report "Looking up qualities took:" longer than 5 seconds more than once
Plots that report "Finding proof took:" longer than 30 seconds more than once

Replacing these plots (approximately 8% of my total farm after several iterations) has massively reduced the issue.

I am having to use the bladebit_cuda, plotter provided as part of alpha 4.3 as any later ones cannot be checked using the chia plots check command in alpha 4.3.

I want to be able to switch to alpha 4.3 to validate the performance difference still exists.

What has changed in the plot format since the alpha 4.3 plotter that would cause the check to fail? Does this suggest that I may need to completely replot to this newer plot format?

Is it possible that you are unable to reproduce the error in the farming software because the problem is initiated in imperfect plots & the farmer application is not handling the problem?

Plotting in Ubuntu 23.04 with 256GB RAM, Intel CPU, Nvidia Quadro M5000 GPU (CUDA 5.2, 8GB vRAM)

PhillWalker · 2023-08-17T16:09:21Z

Issue persists with 2.0 RC5, which uses ChiaPOS 2.0.1. Approx 20% late on the lower end hardware. No lates on fc.comprssion build for comparison.

PhillWalker · 2023-08-19T10:56:42Z

Issue persists with 2.0 RC6, which uses ChiaPOS 2.0.2. Approx 20% late on the lower end hardware. No lates on fc.comprssion build for comparison.

PhillWalker · 2023-08-24T18:13:30Z

Issue persists with final 2.0 public release. This is using the official chia docker image built by chia, rather than self built now it is available.

Coyote-UK · 2023-08-24T20:09:17Z

Strangely I didn't see a single late partial submission whilst using RC6, but have already seen 2 late partial submissions on 2.0.0.

Did the 'timeout' introduced in RC5/6 not make it into the final release?

PhillWalker · 2023-09-13T21:37:41Z

Issue persists in 2.0.1

PhillWalker · 2023-10-06T16:17:59Z

Issue persists in 2.1.0

wjblanke · 2023-11-01T17:42:23Z

What CPU are you running on the Synology? Have u actually farmed a block successfully with alpha 1 version? We are wondering if it is fully correct or flawed and fast

PhillWalker · 2023-11-01T17:44:11Z

Intel Atom D2700

2 cores, 4 threads

wjblanke · 2023-11-01T17:45:13Z

Intel® SSE2, Intel® SSE3, Intel® SSSE3

No AVX. Maybe its an instruction set issue with compile time. Did u have luck with alpha version making blocks? I will look at the green reaper compiler settings

PhillWalker · 2023-11-01T17:48:01Z

I can confirm I won multiple blocks on this version across a 1.5PiB farm, on multiple harvesters.

PhillWalker · 2023-11-01T18:18:28Z

@Coyote-UK s CPU is i7-4500U for reference with the same issue

Coyote-UK · 2023-11-02T10:25:50Z

Confirmed: i7-4500U 16GB RAM SSD for DB, Win 10, 2472 C5 plots (created with alpha4), extensive pruning to remove bad plots, No blocks found between 3,798,055 & 4,155,045 (Equivalent of 2.91 x ETW) which covers ALL compressed plot farming up to the release of 2.0.0. I accept that statistically this is possible, but is a big coincidence.

wjblanke · 2023-11-08T17:16:06Z

it certainly sounds like there is a performance problem if you were able to win blocks with the old version. it will difficult for us to regression test this since the merge was so long ago. if there is any other information you can give us please let us know. i do appreciate the testing you have done so far.

PhillWalker added the bug Something isn't working label Jul 22, 2023

wallentx added the compression Related to compressed plotting/farming label Jul 24, 2023

wallentx linked a pull request Jul 24, 2023 that will close this issue

Fix default parallel decompressor count #15845

Merged

wjblanke assigned wjblanke and harold-b Jul 26, 2023

esaung added the 2.0.0 label Aug 1, 2023

arvidn mentioned this issue Aug 6, 2023

Revert "Add fields to BlockRecord. (#15695)" #15947

Merged

[Bug] Increased CPU usage and Higher Partial Times #15836

[Bug] Increased CPU usage and Higher Partial Times #15836

Comments

PhillWalker commented Jul 22, 2023 • edited Loading

What happened?

Version

What platform are you using?

What ui mode are you using?

Relevant log output

PhillWalker commented Jul 22, 2023

Coyote-UK commented Jul 22, 2023

PhillWalker commented Jul 24, 2023

PhillWalker commented Jul 24, 2023

Coyote-UK commented Jul 24, 2023

Coyote-UK commented Jul 26, 2023

PhillWalker commented Jul 26, 2023

harold-b commented Jul 27, 2023

PhillWalker commented Jul 27, 2023

Coyote-UK commented Jul 27, 2023

Coyote-UK commented Jul 28, 2023

bladebit_cuda alpha 4.3

compression | plot count | size TB | size PB

bladebit_cuda alpha 4.4

compression | plot count | size TB | size PB

PhillWalker commented Aug 4, 2023

Coyote-UK commented Aug 5, 2023 • edited Loading

PhillWalker commented Aug 5, 2023

Voodoo-Chia commented Aug 9, 2023

Coyote-UK commented Aug 9, 2023 • edited Loading

Voodoo-Chia commented Aug 9, 2023

Coyote-UK commented Aug 9, 2023 • edited Loading

PhillWalker commented Aug 9, 2023

PhillWalker commented Aug 10, 2023

PhillWalker commented Aug 10, 2023

Coyote-UK commented Aug 11, 2023 • edited Loading

PhillWalker commented Aug 11, 2023

Coyote-UK commented Aug 11, 2023

emlowe commented Aug 11, 2023

Coyote-UK commented Aug 16, 2023

PhillWalker commented Aug 17, 2023

PhillWalker commented Aug 19, 2023

PhillWalker commented Aug 24, 2023

Coyote-UK commented Aug 24, 2023 • edited Loading

PhillWalker commented Sep 13, 2023

PhillWalker commented Oct 6, 2023

wjblanke commented Nov 1, 2023 • edited Loading

PhillWalker commented Nov 1, 2023

wjblanke commented Nov 1, 2023 • edited Loading

PhillWalker commented Nov 1, 2023

PhillWalker commented Nov 1, 2023

Coyote-UK commented Nov 2, 2023

wjblanke commented Nov 8, 2023

PhillWalker commented Jul 22, 2023 •

edited

Loading

Coyote-UK commented Aug 5, 2023 •

edited

Loading

Coyote-UK commented Aug 9, 2023 •

edited

Loading

Coyote-UK commented Aug 9, 2023 •

edited

Loading

Coyote-UK commented Aug 11, 2023 •

edited

Loading

Coyote-UK commented Aug 24, 2023 •

edited

Loading

wjblanke commented Nov 1, 2023 •

edited

Loading

wjblanke commented Nov 1, 2023 •

edited

Loading