Q value in MCTS #72
Replies: 4 comments
-
According to this: https://applied-data.science/static/main/res/alpha_go_zero_cheat_sheet.png, |
Beta Was this translation helpful? Give feedback.
-
The problem is initializing Q to 0 does not encourage the MCTS to explore enough in a game where draw can happen often, you have this chick-and-egg situation: Q not evaluated -> initialized to 0 -> considered low potential -> not explored -> not evaluated. |
Beta Was this translation helpful? Give feedback.
-
Keep in mind that the decision is not made entirely based on Q, but on Q+U. When Q is initialized to 0, the objective function is just proportional to P, the policy network. If you believe it's having issues with not exploring enough of the tree, the first to try I think would be increasing the number of MCTS runs per move. (I think the default is 25, which is honestly pretty small.) If it's exploring the tree but 'locking in' to a few of the branches too early, you think, then you can increase the cPUCT parameter to get more exploration. cPUCT is actually a pretty arbitrary value, and what a good number is can depend a lot on specifics like: your typical values of policy/value network; typical branching factor of the game (~300 for Go vs. ~6 for Connect 4); and how many MCTS runs are done per move. (Personally, I think the PUCT algorithm is itself pretty arbitrary -- and DeepMind has never claimed that it had much mathematical basis.) |
Beta Was this translation helpful? Give feedback.
-
I bumped into this issue when debugging my own implementation of Alpha-0 for Hive. I thought I would add a few comments for others finding this thread. For nodes not visited Q=0, as others pointed out, cause potentially good options to "starve" (never be visited). Using v (the predicted score value) I suppose is a good approach but it doesn't work in early stages of training, when v is still very wacky. I've used instead I find using cPUCT to control exploration levels tricky ... because it also controls how much emphasis to put on the predicted Ps[a], as opposed to the results being collected from the MCTS, and it could starve things that were given a bad Ps[a] (an issue early game). Also it's a lever affected by too many things (per Timeroot@ comment). So increasing it to increase exploration may break quality. Btw, in the game I'm doing a toy implementation for, Hive, it's very easy to get a draw for bad players, and the moves loop indefinitely. So I need to set a maximum depth to the MCTS, in which case it yields "v" as a reward. So the predicted "v" is heavily used here... |
Beta Was this translation helpful? Give feedback.
-
Hello,
In MCTS.py, the upper bound is defined as follows for a state that hasn't been explored (line 107):
As commented, Q is assumed to be 0.
I think the Q function should be evaluated using the neural network, as follows:
In which case the MCTS is much longer due to evaluation using the neural network.
I think you need to use that anyway, otherwise the value output by the neural network (v in line 78 for instance) would never be used in any part of the algo.
I came to that conclusion because I've trained the model for Connect 4, and it does not seem to improve because the arena always yield 0 win / 0 lose / 40 draws, and it should be 20/20/0 under perfect play (because the first player can force a win). Putting 0 as state-action value could be responsible for that behavior.
Thanks !
Beta Was this translation helpful? Give feedback.
All reactions