Skip to content

Optimizations for TimeLord

voidxno edited this page Jul 26, 2024 · 11 revisions

Running a TimeLord is optional (default is disabled). Not needed to run a fully functional node.

Blockchain, as a whole, need at least one active timelord to move forward. A few more, spread around, is preferred for redundancy and security.

If you want to contribute by running one, check requirements below. Enable timelord in WebGUI, or set true in config/local/timelord file. Check that running, and speed, in NODE / LOG / TIMELORD tab in WebGUI. Probably lower than NODE / VDF Speed, unless you are the fastest timelord.

No more is needed. Standard Windows binaries and Linux compile gives good performance for a timelord. Unless you want to optimize for fastest timelord, or the fun of it.

TLDR;

I want to run a fast timelord:

  • Use Linux (Clang15) or Windows
  • Compile with AVX baseline
  • Have a GPU/iGPU verify VDF
  • Clock CPU as high as possible

NOTE: Optimize and overclock at your own risk.

Logic

A timelord performs a very simple mathematical operation (SHA256, Secure Hash Algorithm). It is performed recursively and cannot be parallelized. Previous result is needed, before repeating, as fast as possible. Like a very specific mathematical single-threaded benchmark.

Timelord runs 3x of these operations, VDF (verifiable delay function) streams, in parallel. But their individual workload cannot be parallelized more.

A third VDF stream was introduced for on-chain timelord rewards with testnet10. Can be turned off, leaving 2x VDF streams. Still possible to be fastest timelord, but no timelord rewards will be given.

Only fastest timelord, at any time, produces VDF for block being created. And can receive a timelord reward.

Requirements

CPU: Intel or AMD (w/ SHA extensions).
Model: Intel 11th-gen (Rocket Lake), AMD Zen, or later (a few exceptions).
GPU/iGPU: Any compatible (offload verify VDF).

Windows: CPU-Z (Instructions) or HWiNFO64 (Features), look for SHA.
Linux: grep -o 'sha_ni' /proc/cpuinfo, empty if not available.

You can run timelord on a CPU without SHA extensions. Will fallback to AVX2. In reality SHA extensions are needed. SHA256 calculations are ~5-8x faster with SHA vs AVX2. About 25-40 MH/s with SHA, ~5 MH/s with AVX2.

Why not GPU/FPGA/ASIC

Timelord logic will only use CPU, not GPU.

GPU is great for parallel SHA256 calculations, beating CPU in both speed and efficiency. GPU is used for verify VDF operation on a node, if available.

For a single SHA256 calculation, CPU's SHA extensions will beat GPU on speed. As timelord SHA256 workload is not parallelizable, CPU wins the serial SHA256 race.

Feedback welcome on other contenders. As of now, nothing observed beating a high-GHz CPU with SHA extensions (optimized silicon circuits inside CPU). Too low speed (GHz) on FPGA, work not parallelizable. Prohibitive cost to produce a high-GHz ASIC that beats Intel/AMD optimized silicon.

Optimize TimeLord

Standard Windows binaries and Linux compile gives good performance for a timelord. Still important to tune surrounding environment. Either aiming for fastest timelord rewards, or just the challenge.

Information and numbers in this article might be superseded. Sections below are known info at date of publish. To help get started or give ideas. Not the absolute answer. Probably angles not discovered yet, and other ways to go about it.

You do not need to complicate it like below. Try to run a timelord. Measure speed. Try out a tip. Measure if faster.

Discuss and share in #mmx-timelord channel on Discord. No requirement to divulge all your secrets. But a good place to get tips, or kickstart new ideas.

NOTE: Optimize and overclock at your own risk.

Test Environment

CPU: Intel 13th-gen (Raptor Lake)
GPU: Nvidia GT 1030 2GB (GP108)
OS: Windows 10 (22H2)
OS: Ubuntu 23.04 (Lunar Lobster)
mmx-node: v0.11.3 (+ AVX)

Most instructions below are transferable to AMD.

Measuring Speed

Timelord speed is measured in MH/s (million hashes per second).

Current blockchain network speed, fastest timelord, is found in NODE / VDF Speed in WebGUI. Your own timelord speed is found under NODE / LOG / TIMELORD tab in WebGUI.

To make it easier to measure own improvements, baseline numbers, compare with others. It is recommended to measure MH/s speed per 0.1 GHz (MH/s/0.1GHz). Yes, absolute speed is the end goal. But timelord speed, at least observed for now, is linear given CPU GHz. A 2-step process is recommended:

  1. Optimize for best possible MH/s/0.1GHz
  2. Clock 3x CPU cores as high as possible

In this case the Intel 13th-gen has been all-core locked to 5.0 GHz (could be lower, not important). Hyperthreading and E-cores disabled (more on that later). Makes optimization measurements easier and controlled.

Intel 13th-gen (v0.11.3 + AVX):

Environment Measured Locked Speed Measured Per Unit
Windows/VC++ 35.31 MH/s /50 (5.0 GHz) 0.706 MH/s/0.1GHz
Ubuntu/Clang15 35.27 MH/s /50 (5.0 GHz) 0.705 MH/s/0.1GHz
Ubuntu/gcc12 34.97 MH/s /50 (5.0 GHz) 0.699 MH/s/0.1GHz

These numbers represent absolute speed per 0.1 GHz, given the environment and tuning. Easy to compare against yourself or others. Top speed after that is dependent on how high you can clock 3x CPU cores (more on that later). In this case, 3x cores stable at 6.0 GHz, would give a timelord speed of ~42.3 MH/s.

As a sidenote. Numbers from 12th-gen Intel gives exact same performance per 0.1 GHz. Basically, no IPC (instructions per clock) uplift for SHA extensions between them (for this specific use-case). But 13th-gen have potential to clock higher.

Testing of AMD Zen4-core in AMD 7040-series gives 0.755 MH/s/0.1GHz. Giving AMD the edge per 0.1 GHz vs. Intel who often can clock higher. In the end, an overclocking race.

We also have the E-cores on Intel 13th-gen. Even higher at 0.975 MH/s/0.1GHz. More efficient, but clocks lower than P-cores. With 3x E-cores cores at max official boost of 4.3 GHz, would give a timelord speed of ~41.9 MH/s.

Fastest timelord speeds observed on testnets (as of Apr2024):

Continuous Peak
~43.2 MH/s ~43.5 MH/s

Windows or Linux

Numbers above, in previous releases, have switched between Windows binary or Linux being fastest. With current source code, Linux (Clang15) and Windows are very close.

Instructions shown in sections below are done on Linux. But most aspects are applicable to Windows too.

Linux distribution and kernel often have an effect on different types of workloads. When it comes to timelord logic, not much observed. Logic for creation of a VDF stream is very small. A few instructions repeated in a CPU core.

Optimize: Establish defaults

Follow default mmx-node installation for Linux (in this case Ubuntu, with default compiler gcc12). Get mmx-node up and running. Enable timelord in WebGUI, or set true in config/local/timelord file.

Let it run for a while. Check average speed in NODE / LOG / TIMELORD tab in WebGUI. With this setup:

Environment Measured Locked Speed Measured Per Unit
Ubuntu/gcc12 34.36 MH/s /50 (5.0 GHz) 0.687 MH/s/0.1GHz

Optimize: Compiler

Compiler has an effect on how good source code is translated to binary objects. Default for Ubuntu 23.04 is GCC (GNU Compiler Collection), or gcc12. An alternative is Clang (LLVM), or Clang15. There are others. Here Clang15 looks to do a better job than gcc12.

One way to install, enable and compile with Clang15:

sudo apt install clang lld libomp-dev
export CC=/usr/bin/clang-15
export CPP=/usr/bin/clang-cpp-15
export CXX=/usr/bin/clang++-15
export LD=/usr/bin/ld.lld-15
./clean_all.sh
./make_devel.sh

NOTE: You need to perform export statements in terminal environment before compile, or gcc12 will be used.
NOTE: When you switch compiler, or compiler options. Always do ./clean_all.sh before new compile.
NOTE: You will get a lot of unused -fmax-errors=1 warnings. Just ignore, or remove from compiler options.

Default Clang15 compile (./make_devel.sh):

Environment Measured Locked Speed Measured Per Unit
Ubuntu/Clang15 34.80 MH/s /50 (5.0 GHz) 0.696 MH/s/0.1GHz

Small, but noticeable jump from gcc12's 0.687 MH/s/0.1GHz.

Optimize: Compiler options

Compiler options can have a big effect on how source code is transformed to a binary object. Often focus is on speed vs size. Several options have an effect on timelord logic. Much have been tried with both gcc12 and Clang15.

For now, Clang15 with default options in ./make_devel.sh gives best performance. Maybe remove -fmax-errors=1 to get rid of warnings.

Some elements to experiment with (./make_devel.sh):

  • Switch between Release and RelWithDebInfo (-DCMAKE_BUILD_TYPE)
  • Remove -fno-omit-frame-pointer (-DCMAKE_CXX_FLAGS)
  • Add -march=native (-DCMAKE_CXX_FLAGS)
  • Variants of -O optimization option (-DCMAKE_CXX_FLAGS)

There are others. Look up optimization in relevant compiler documentation.

NOTE: When you switch compiler, or compiler options. Always do ./clean_all.sh before new compile.

Optimize: Source code

One thing to be aware of is that we want to optimize a tiny part of whole mmx-node. Even a tiny subset of whole timelord logic. The calculation of a VDF stream. Performed through hash_t TimeLord::compute(...) (/src/TimeLord.cpp) calling recursive_sha256_ni(...) (/src/sha256_ni_rec.cpp). We do not care about the rest. As long as this part goes as fast as possible. Unless surrounding elements has an effect. Not observed for now.

That part of source code is already written to be fast when translated to binary objects by compiler (inline, intrinsics, asm).

Several iterations were made for it to end up like that. Still, this is the place to adjust source code if you think there is a way to optimize it even more.

Optimize: Source code (AVX vs SSE4.2)

Current default compile combines the usage of SHA extensions and SSE4.2 instructions. Raising the SSE4.2 baseline to AVX instructions gives about ~1% boost on 13th-gen Intel. Has to do with compiler using identical AVX versions of certain SSE4.2 instructions. Though, on 11th-gen Intel this looks to degrade performance (better with SSE4.2).

To implement AVX vs SSE4.2 baseline (./CMakeLists.txt):

Change -msse4.2 to -mavx on two lines (Linux compile part):

set_source_files_properties(src/sha256_ni.cpp PROPERTIES COMPILE_FLAGS "-mavx -msha")
set_source_files_properties(src/sha256_ni_rec.cpp PROPERTIES COMPILE_FLAGS "-mavx -msha")

Add two lines (Windows compile part):

set_source_files_properties(src/sha256_ni.cpp PROPERTIES COMPILE_FLAGS "/arch:AVX")
set_source_files_properties(src/sha256_ni_rec.cpp PROPERTIES COMPILE_FLAGS "/arch:AVX")

Easier to see location in closed PR#210 request.

Small, but real jump from Clang15's 0.696 MH/s/0.1GHz.

Environment Measured Locked Speed Measured Per Unit
Ubuntu/Clang15 35.27 MH/s /50 (5.0 GHz) 0.705 MH/s/0.1GHz

NOTE: For now, official releases have SSE4.2 as baseline.

Optimize: CPU speed

At this stage we know what to expect for each 0.1 GHz, 0.705 MH/s. All testing we have observed have given linear increase, given CPU GHz. Now it is time to clock CPU as high as possible.

NOTE: Optimize and overclock at your own risk.

First a boring observation. Many elements surrounding raw GHz of CPU cores have been tested:

  • RAM type/speed/latency/bandwidth
  • HyperThreading on/off
  • Virtualization (VT-d)
  • Mitigations (Spectre/Meltdown)
  • CPU cache/ring ratio
  • CPU L1/L2/L3 cache size
  • CPU core-to-core latency

Nothing looks to affect timelord speed, except CPU core clock (GHz). Remember, timelord logic for creation of VDF streams is very small. Not much outside a few instructions repeated in a CPU core.

Timelord logic has 3x process threads. Each wants 100% of 1x CPU core, to calculate a VDF stream. Goal is to create an environment that makes these 3x process threads run with high GHz continuously.

One way, and valid strategy, is to let the OS process scheduler do its job (Linux or Windows). Distribute and use resources as best possible, depending on requirements and state of system. Maybe tune some aspects of OS, together with BIOS adjustments to clock CPU as high as possible. Gives great results. All modern CPUs have logic to boost individual CPU cores in combination with OS scheduler, power management and other logic.

Another, more manual way, is to dedicate specific CPU cores to the 3x timelord process threads. Locking OS and other processes away from them. In this case an Intel CPU with 8x P-cores, numbered 0-7. Hyperthreading and E-cores disabled. Going to dedicate core 4,5,7 to timelord process threads. One way to achieve it (Linux, in this case Ubuntu):

  • Force OS process scheduler to not use core 4,5,7. Add isolcpus=4,5,7 to GRUB_CMDLINE_LINUX (/etc/default/grub). Easily observed through htop and CPU core utilization.
  • When timelord up and running, you should have 5x process threads with command name of 'TimeLord':
    ps -A -T -o tid,comm,pcpu | grep 'TimeLord'
    In practice, the three last are the 100% CPU creating VDF stream process threads. Can also find them with htop. Let's say they have pid(tid) 5111, 5112, 5113. Assign each of them an isolated CPU core:
    taskset -cp 4 5111
    taskset -cp 5 5112
    taskset -cp 7 5113
    Check result through htop. Should have cores 4,5,7 at 100% all the time through the 3x VDF creation streams.

Reason for disable of Hyperthreading and E-cores. No penalty observed on timelord speed. Less complications, more overclocking potential. In same category, if motherboard supports it. Manually clock cores 4,5,7 high (GHz), and lower for 0,1,2,3,6.

Now it is a game of getting highest possible GHz, while keeping CPU cool and stable.

It is possible to be fastest timelord, produce VDF for block being created, with only 2x VDF streams (option in SETTINGS). No timelord rewards will be given, blockchain still 100% operational.

Mentioned because newer Intel CPUs boosts (GHz) 2x favored cores higher than others, if workload is optimal. Usually these 2x favored cores have higher overclock potential. By adding the third VDF stream for on-chain timelord rewards, 3x high-GHz cores are needed. You might be able to clock 2x cores higher than 3x, no rewards. Choices, choices.

Fastest TimeLord

First. Timelord rewards in testnets are not incentivized. Unlike block wins from testnet8, and later. Basically, no timelord rewards from testnets will transfer to mainnet.

On-chain timelord rewards was introduced with testnet10. Now part of blockchain logic. Before that, a temporary centralized solution existed.

How do you know if you are fastest timelord. Ultimate indicator is very easy. There is a wallet address set up as 'TimeLord Reward Address' target in SETTINGS in WebGUI. Timelord rewards will show up as 0.01 MMX of type VDF_REWARD. Not necessarily all blocks. Depends on farmer verifying timelord reward, if close to 5sec verify VDF limit.

Another indicator is looking for Broadcasting VDF for height x messages in NODE / LOG / ROUTER tab in WebGUI. Not given you are the fastest timelord. But you are close to the threshold, and broadcasting VDF.

Overtaking as fastest timelord is usually not instant. Unless current fastest timelord outright stops, or new one is faster by a good margin. Your timelord starts behind because of network and verify VDF latency. Not easy to quantify given internet itself and other nodes. Will not hurt having a fast VDF verify at start (GPU/iGPU). Is where your timelord starts calculating its VDF streams from. If you are faster (MH/s), should get ahead in the end.

To illustrate. With test setup above, running +0.2 MH/s over perceived speed of fastest timelord (network VDF speed). It took a few minutes to get first timelord reward, overtaking as fastest timelord. Verify VDF (not a fast GPU) was 2.5sec at the time. Took about 36 blocks (6min) to overtake.

Feedback

Please contradict findings above, or tell of new ones. Use #mmx-timelord channel on Discord.