Q value in MCTS #72

pjcorp · 2018-07-22T01:05:37Z

pjcorp
Jul 22, 2018

Hello,
In MCTS.py, the upper bound is defined as follows for a state that hasn't been explored (line 107):

    u = self.args.cpuct*self.Ps[s][a]*math.sqrt(self.Ns[s] + EPS)

As commented, Q is assumed to be 0.
I think the Q function should be evaluated using the neural network, as follows:

    next_s, next_player = self.game.getNextState(canonicalBoard, 1, a)
    next_s = self.game.getCanonicalForm(next_s, next_player)
    Q = self.nnet.predict(canonicalBoard)[next_s][1]
    u = Q + self.args.cpuct*self.Ps[s][a]*math.sqrt(self.Ns[s] + EPS)

In which case the MCTS is much longer due to evaluation using the neural network.

I think you need to use that anyway, otherwise the value output by the neural network (v in line 78 for instance) would never be used in any part of the algo.

I came to that conclusion because I've trained the model for Connect 4, and it does not seem to improve because the arena always yield 0 win / 0 lose / 40 draws, and it should be 20/20/0 under perfect play (because the first player can force a win). Putting 0 as state-action value could be responsible for that behavior.

Thanks !

TimYuenior · 2018-08-08T02:10:18Z

TimYuenior
Aug 8, 2018

According to this: https://applied-data.science/static/main/res/alpha_go_zero_cheat_sheet.png,
Q in an unexplored node is supposed to be 0, but v does need to be used. I don't think there currently is an attribute for edges that keeps track of the total value (W), which is why v isn't used. When encountering the node again in a simulation, the Q should not be 0 anymore.

0 replies

pjcorp · 2018-08-11T15:02:32Z

pjcorp
Aug 11, 2018
Author

The problem is initializing Q to 0 does not encourage the MCTS to explore enough in a game where draw can happen often, you have this chick-and-egg situation: Q not evaluated -> initialized to 0 -> considered low potential -> not explored -> not evaluated.
In this setting the Connect 4 NN I trained is stuck with always drawing against itself (or the previous iteration). In self-play-like evaluation, 20/20/0 is not considered a better outcome than 0/0/40, but it should be because theoretically the perfect play is known.
I managed to kick it out of this local optimum and to improve it by evaluating the Q function using the neural network approximation.
BTW who is the author of this cheat sheet?

0 replies

Timeroot · 2018-08-13T05:37:53Z

Timeroot
Aug 13, 2018

Keep in mind that the decision is not made entirely based on Q, but on Q+U. When Q is initialized to 0, the objective function is just proportional to P, the policy network. If you believe it's having issues with not exploring enough of the tree, the first to try I think would be increasing the number of MCTS runs per move. (I think the default is 25, which is honestly pretty small.) If it's exploring the tree but 'locking in' to a few of the branches too early, you think, then you can increase the cPUCT parameter to get more exploration.

cPUCT is actually a pretty arbitrary value, and what a good number is can depend a lot on specifics like: your typical values of policy/value network; typical branching factor of the game (~300 for Go vs. ~6 for Connect 4); and how many MCTS runs are done per move. (Personally, I think the PUCT algorithm is itself pretty arbitrary -- and DeepMind has never claimed that it had much mathematical basis.)

0 replies

janpfeifer · 2018-11-19T08:28:33Z

janpfeifer
Nov 19, 2018

I bumped into this issue when debugging my own implementation of Alpha-0 for Hive. I thought I would add a few comments for others finding this thread.

For nodes not visited Q=0, as others pointed out, cause potentially good options to "starve" (never be visited).

Using v (the predicted score value) I suppose is a good approach but it doesn't work in early stages of training, when v is still very wacky.

I've used instead mean(Q(s,a)) instead of Q=0. It works well during early training ... but might be a bit too optimistic (making the model too exploratory) during play. But since the "default Q" is only used once, after the first traverse it's dropped, it should be fine too.

I find using cPUCT to control exploration levels tricky ... because it also controls how much emphasis to put on the predicted Ps[a], as opposed to the results being collected from the MCTS, and it could starve things that were given a bad Ps[a] (an issue early game). Also it's a lever affected by too many things (per Timeroot@ comment). So increasing it to increase exploration may break quality.

Btw, in the game I'm doing a toy implementation for, Hive, it's very easy to get a draw for bad players, and the moves loop indefinitely. So I need to set a maximum depth to the MCTS, in which case it yields "v" as a reward. So the predicted "v" is heavily used here...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q value in MCTS #72

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Q value in MCTS #72

pjcorp Jul 22, 2018

Replies: 4 comments

TimYuenior Aug 8, 2018

pjcorp Aug 11, 2018 Author

Timeroot Aug 13, 2018

janpfeifer Nov 19, 2018

pjcorp
Jul 22, 2018

TimYuenior
Aug 8, 2018

pjcorp
Aug 11, 2018
Author

Timeroot
Aug 13, 2018

janpfeifer
Nov 19, 2018