Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Magic bitboards #628

Closed
wants to merge 10 commits into from
Closed

Magic bitboards #628

wants to merge 10 commits into from

Conversation

ddobbelaere
Copy link
Contributor

@ddobbelaere ddobbelaere commented Dec 29, 2018

This PR adds "fancy magic bitboards" to move generation.

The following measurements all share the command line lc0 benchmark --backend=random --smart-pruning-factor=0 --nodes=1000000 --nncache=1000000. A 95% confidence interval for the average of the last reported nps value over 25 runs is given for the starting position, the opening position https://lichess.org/editor/rn2kb1r/pp3ppp/2p1pnb1/8/3P1N1q/4B1N1/PPPQ1PPP/R3KB1R_b_KQkq_-_5_9 and the endgame position https://lichess.org/editor/8/2R2P2/1p4K1/6R1/1r6/5rP1/1k6/2q5_w_-_-_0_55 .

The measurements are performed on an Intel Core i5-5300U CPU @ 2.30GHz with 4 cores and L1d cache: 32K, L1i cache: 32K, L2 cache: 256K, L3 cache: 3072K.

For 1 search thread there is a substantial speedup for all positions:

position master 0af66eee #628 speedup
start 186968 ± 313 189317 ± 229 1.3%
opening 141347 ± 340 144938 ± 354 2.5%
endgame 154753 ± 248 162247 ± 208 4.8%

For 2 search threads the performance drops slightly on my system for all positions:

position master 0af66eee #628 speedup
start 227397 ± 565 226473 ± 745 -0.4%
opening 180583 ± 442 178087 ± 569 -1.4%
endgame 191124 ± 530 188983 ± 463 -1.1%

I think this can be completely attributed to the fact that there are more cache misses as the 3MB L3 cache is shared between all cores on my CPU (and not that big really). Note that the rook attack tables are ~800KB and the bishop attack tables are ~40KB.

The speedup will probably be positive for 2 search threads on higher-end CPUs. Feel free to test!

@mooskagh
Copy link
Member

FYI there are pure movegen tests (without any mcts) in board_test.cc.
They don't measure time, but that can be added. (also can be measured by external program, like bash time)

@gsobala
Copy link
Contributor

gsobala commented Dec 30, 2018

As there have been recent search improvements, I have merged this with current master (6a639b6) and tested it against current master with 2 threads and starting position on an 8-core Xeon with L1 cache 32k/32kx8 L2 cache 8MB and L3 cache 11MB on Ubuntu, with your opening position fen and ending fen. I took the mean of 20 tests.

PR628 was 1.25% faster than master on the opening FEN and 2.3% faster than master on the ending FEN.

@ddobbelaere
Copy link
Contributor Author

ddobbelaere commented Dec 30, 2018

The chessboard test is 25% faster with this PR.

Thanks for the tests @gsobala. I still obtain more or less the same speedup results as mentioned earlier with current master.

I should also mention the single-run measurements of starting position of sv in discord (on his 2x Intel Xeon silver 4108, 11M L3 cache each @1.8 GHz).

master nps
1 thread: 112908
2 threads: 129790
4 threads: 125918
8 threads: 109047
16 threads: 103957
32 threads: 95132

PR628 nps
1 thread: 112711
2 threads: 116351
4 threads: 129308
8 threads: 98344.8
16 threads: 90542.6
32 threads: 82001.5

Before moving on, I'd like to investigate the exact cause of the slowdown in some cases in my machine by trying to pinpoint the root cause (cache misses?) and add possible optimizations.

@ddobbelaere ddobbelaere changed the title Magic bitboards [WIP] Magic bitboards Dec 30, 2018
@ddobbelaere
Copy link
Contributor Author

ddobbelaere commented Dec 30, 2018

I verified that cache misses are indeed the cause of the slowdown in some cases.

Tried a lot of code optimizations to further minimize CPU cycles (e.g. by trying faster PEXT instructions present on modern CPUs to index the attacks tables, effectively eliminating index calculation with magic numbers), to no avail, as this doesn't solve cache misses.

I think that lc0 really differs from A/B engines in this respect. In A/B engines, nps are orders of magnitudes higher (~several Mnps) and cache misses are reduced by the "batched movegen" behavior. lc0 does a lot of other stuff in between generating positions (MCTS, node cache lookup...) that fight for the same cache.

This explains why chessboard_test runs 25% faster using magic bitboards (even 27% with PEXT instruction), as the caches stay hot during movegen only code, but using magic bitboards in lc0 itself hasn't the hoped effect due to more costly memory access to the rook/bishop attacks tables.

@ddobbelaere ddobbelaere changed the title [WIP] Magic bitboards Magic bitboards Dec 30, 2018
@ddobbelaere
Copy link
Contributor Author

Turns out that with LTO enabled, I experience no slowdown anymore (speedup for all considered cases). Magic bitboards are implemented in #640, as an additional layer on top of #638 to avoid merge conflicts with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants