-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Magic bitboards #628
Magic bitboards #628
Conversation
FYI there are pure movegen tests (without any mcts) in board_test.cc. |
As there have been recent search improvements, I have merged this with current master (6a639b6) and tested it against current master with 2 threads and starting position on an 8-core Xeon with L1 cache 32k/32kx8 L2 cache 8MB and L3 cache 11MB on Ubuntu, with your opening position fen and ending fen. I took the mean of 20 tests. PR628 was 1.25% faster than master on the opening FEN and 2.3% faster than master on the ending FEN. |
The chessboard test is 25% faster with this PR. Thanks for the tests @gsobala. I still obtain more or less the same speedup results as mentioned earlier with current master. I should also mention the single-run measurements of starting position of sv in discord (on his 2x Intel Xeon silver 4108, 11M L3 cache each @1.8 GHz).
Before moving on, I'd like to investigate the exact cause of the slowdown in some cases in my machine by trying to pinpoint the root cause (cache misses?) and add possible optimizations. |
I verified that cache misses are indeed the cause of the slowdown in some cases. Tried a lot of code optimizations to further minimize CPU cycles (e.g. by trying faster PEXT instructions present on modern CPUs to index the attacks tables, effectively eliminating index calculation with magic numbers), to no avail, as this doesn't solve cache misses. I think that This explains why |
This PR adds "fancy magic bitboards" to move generation.
The following measurements all share the command line
lc0 benchmark --backend=random --smart-pruning-factor=0 --nodes=1000000 --nncache=1000000
. A 95% confidence interval for the average of the last reported nps value over 25 runs is given for the starting position, the opening position https://lichess.org/editor/rn2kb1r/pp3ppp/2p1pnb1/8/3P1N1q/4B1N1/PPPQ1PPP/R3KB1R_b_KQkq_-_5_9 and the endgame position https://lichess.org/editor/8/2R2P2/1p4K1/6R1/1r6/5rP1/1k6/2q5_w_-_-_0_55 .The measurements are performed on an Intel Core i5-5300U CPU @ 2.30GHz with 4 cores and L1d cache: 32K, L1i cache: 32K, L2 cache: 256K, L3 cache: 3072K.
For 1 search thread there is a substantial speedup for all positions:
0af66eee
For 2 search threads the performance drops slightly on my system for all positions:
0af66eee
I think this can be completely attributed to the fact that there are more cache misses as the 3MB L3 cache is shared between all cores on my CPU (and not that big really). Note that the rook attack tables are ~800KB and the bishop attack tables are ~40KB.
The speedup will probably be positive for 2 search threads on higher-end CPUs. Feel free to test!