DeepCFR bug fixes #248

yellowbunchu · 2020-06-17T19:09:03Z

I wasn't able to get a proper policy for Kuhn Poker before making these changes.

Predicted advantages were being inserted into advantage memory instead of sampled regret.
The same info was inserted into advantage memory multiple times. (wrong loop)
Using multiple epochs to train advantage networks and the policy network. Not sure if there was a reason for doing only one pass?
Reset advantage network for the current player only. This doesn't matter in two-player games but does in 3+ player games. Otherwise "other player" (unless it's the last one trained) will always act randomly in the traversal.

yellowbunchu · 2020-06-17T22:08:55Z

open_spiel/python/algorithms/deep_cfr.py

+        sampled_regret_arr[action] = sampled_regret[action]
+      self._advantage_memories[player].add(
+          AdvantageMemory(state.information_state_tensor(), self._iteration,
+                          sampled_regret_arr, action))


I am not so sure about this as it sets SR for illegal actions to 0.

Setting SR for illegal actions to 0 should be fine, I believe.

Thanks @EricSteinberger, that was my next question. Truly appreciate the quick responses!

lanctot · 2020-06-18T05:45:24Z

Thanks! @rfaulkner can you look into this?

rfaulkner · 2020-06-19T13:07:21Z

Sure I can take a look at this PR.

lanctot · 2020-06-23T01:57:02Z

Hi @aliyakamercan , given that in the example which runs Deep CFR on Kuhn poker, you set:

policy_network_epochs to 400,
advantage_network_epochs to 20,

Shouldn't the defaults not be set to 1? How did you choose these values (are they documented anywhere?)

Do these epochs come from the paper..? What are they called in the paper? @noambrown, any advice on how what are sensible defaults or how we should refer to these from the paper?

rfaulkner · 2020-06-23T08:49:47Z

+1 to Marc's comments, otherwise this looks good to me.

lanctot · 2020-06-23T15:04:35Z

@aliyakamercan can you explain what these epochs are, preferably by aligning the terminology with that from the paper? From a quick look at the code, seems to me like this epochs might correspond to iterations. Is that true?

lanctot · 2020-06-23T15:12:04Z

Actually, I'm not sure they can be iterations because how could each one have different values? Are these epochs "number of training passes/batches per iteration"? Either way, I would really like us to (i) document what these are properly, (ii) connect it to the part of the paper that mentions to allow for easier reproducibility and understanding.

I spoke briefly to Noam and regarding hyperparamater settings, he suggested we look at the implementation in the DREAM code: https://www.google.com/url?q=https://github.com/EricSteinberger/DREAM .

@EricSteinberger, do you have any advice for us on these points?

EricSteinberger · 2020-06-23T15:27:04Z

Hi guys! Epochs usually refer to the number of full training iterations of the NN on the full buffer. We don't look at it that way in either paper, so it's hard to tell what is meant here without digging deep into the codebase.

The DREAM codebase (and the isolated DeepCFR one https://github.com/EricSteinberger/Deep-CFR too) is aligned with the paper and has been used for the results in both SD-CFR and DREAM, so you might want to compare hyperparameters, as Noam suggested.

As Mark pointed out, it'd be great to align terminologies with the paper to answer any questions. If you have any specific technical questions about hyperparameters or the algorithms, I'm happy to help! Just post them below

Cheers,
Eric

lanctot · 2020-06-23T15:39:06Z

Awesome, thanks @EricSteinberger for the quick response, the extra link, and open-sourcing all of your implementations!

As Mark pointed out, it'd be great to align terminologies with the paper to answer any questions. If you have any specific technical questions about hyperparameters or the algorithms, I'm happy to help! Just post them below

Cool, yeah.. that makes sense. By "full training iterations of the NN on the full buffer", do you literally mean the entire buffer, e.g. not the number of mini-batches sampled from it? As in, one big batch whose size is that of the buffer?

I think what we should aim for is the most basic approach that is aligned with the original Deep CFR. In that case, what do you advise we do? And are there any discussions of these implementation details in the original paper that we should cite?

yellowbunchu · 2020-06-23T16:57:35Z

policy_network_epochs to 400,

advantage_network_epochs to 20,

These are the number of iterations on the NNs. I think I misused the name epoch here as it means full training iterations of the NN on the full buffer.

I think these are just called iterations on the paper, quoting section 5.2:

We allocate a maximum size of 40 million infosets to each
player’s advantage memory MV,p and the strategy memory
MΠ. The value model is trained from scratch each CFR
iteration, starting from a random initialization. We perform
4,000 mini-batch stochastic gradient descent (SGD) iterations using a batch size of 10,000 and perform parameter
updates using the Adam optimizer (Kingma & Ba, 2014)
with a learning rate of 0.001, with gradient norm clipping
to 1. For HULH we use 32,000 SGD iterations and a batch
size of 20,000. Figure 4 shows that training the model from

I am not sure about batch sizes, neither. Do we only train on sampled batch-size entries or the whole buffer and batch size is a hyper param?

I am looking at @EricSteinberger's DREAM and DEEP-CFR repos

lanctot · 2020-06-23T17:19:46Z

Ok I just looked more closely at the code. I think these are most often called "training steps" (or "learning steps"). In the case of algorithms that do other things in their iteration (like, e.g. a tree pass), I usually call this something like training steps per iteration or learning steps per iteration.

I just checked our RCFR test and it seems like @dmorrill10 also called it epochs :) Dustin, would you say the num_epochs in rcfr_test is equivalent to what I'm referring to as "training steps per iteration" ?

lanctot · 2020-06-23T17:23:54Z

One small preference: I would prefer to avoid also calling these iterations since that term is also used for the algorithm's iteration. I'm not strongly in favor nor against epochs vs. training_steps_per_iteration as long as it's clear from the doc string. So I'm happy for anyone wants to convince me one way or the other. :)

yellowbunchu · 2020-06-23T17:26:28Z

I think training_steps_per_iteration makes more sense, I will make that change. What do you think about batch sizes?

lanctot · 2020-06-23T17:35:39Z

No need to change it, I can do this on my end (as I have already imported it).

Batch sizes are fine and the most common approach, and we can let people customize these as they want: that's what they did too from the text you cited. So fine to leave that as you have it now.

lanctot · 2020-06-23T17:51:56Z

@aliyakamercan what are the values that you are getting when running the example with those settings? Mine are still off:

Computed player 0 value: 0.0810588105365439
Expected player 0 value: -0.05555555555555555
Computed player 1 value: -0.0810588105365439
Expected player 1 value: 0.05555555555555555

I'm using num_iterations=500 and num_traversals=40. Seems to get worse when I increase the iterations.

yellowbunchu · 2020-06-23T18:00:49Z

I tried thrice with 400/40, getting

Computed player 0 value: -0.05581075681515041 / -0.04983329590598612 / -0.04533366139742423
Expected player 0 value: -0.05555555555555555 
Computed player 1 value:  0.05581075681515041 / 0.04983329590598612 / 0.04533366139742423
Expected player 1 value: 0.05555555555555555

Are you on the correct branch?

lanctot · 2020-06-23T18:01:16Z

My bad, sorry! I was accidentally running an internal test which we have set to 2 iterations and 2 traversals (just to ensure that it runs). The real values with 500/40 are:

Computed player 0 value: -0.036085575995136177
Expected player 0 value: -0.05555555555555555
Computed player 1 value: 0.036085575995136177
Expected player 1 value: 0.05555555555555555

~~Does this match yours?~~ Hmm, ok I will try 400/40. This is internally (after importing your PR).

Your numbers look good. Do they look the same using 500/40 ?

lanctot · 2020-06-23T18:10:47Z

My three tries at 400/40:

Computed player 0 value: -0.05007666549999429 / -0.042776478582621014 / -0.054319129381460995
Expected player 0 value: -0.05555555555555555
Computed player 1 value: 0.05007666549999429 / 0.042776478582621014 / 0.054319129381460995
Expected player 1 value: 0.05555555555555555

Ok, certainly seems within noise. Happy with this!

lanctot · 2020-06-23T18:21:37Z

It's in, thanks again everyone.

@aliyakamercan , just a heads-up: I will be modifying this Deep CFR again soon. We need to remove sonnet to support Ubuntu 20.04. See details here: #249 and #166.

You can, of course, continue to use sonnet yourself but the master branch will be removing its use until we can upgrade to TF2. We'll be replacing it with a custom TF1 implementation of linear nets that is very sonnet-like, so shouldn't be too disruptive, I hope.

dmorrill10 · 2020-06-23T18:34:15Z

Ok I just looked more closely at the code. I think these are most often called "training steps" (or "learning steps"). In the case of algorithms that do other things in their iteration (like, e.g. a tree pass), I usually call this something like training steps per iteration or learning steps per iteration.

I just checked our RCFR test and it seems like @dmorrill10 also called it epochs :) Dustin, would you say the num_epochs in rcfr_test is equivalent to what I'm referring to as "training steps per iteration" ?

@lanctot The way I was thinking about this before is at the beginning of the iteration you have a policy evaluation/data-gathering/tree-walk step, and data is added to or forms a training data buffer. You pass this buffer to the network training algorithm and it does epoch passes over the buffer. If the batch size is >= the size of the buffer, an epoch will consist of a single training step, but If the batch size is smaller, then multiple training steps will be taken to complete the epoch. These two procedures together form an "iteration" in the same sense as in tabular CFR. Specifying the number of training steps directly would be an alternative interface to the network training algorithm that probably makes more sense when you're dealing with really large buffers.

lanctot · 2020-06-24T05:29:22Z

@aliyakamercan I found something quite odd tonight in the removal of sonnet.

It seems that the networks were never actually getting reinitialized. Can you comment out this line here: https://github.com/deepmind/open_spiel/blob/695fad0ac25383e7f66cb0bb30fa8a4ea07d6bb9/open_spiel/python/algorithms/deep_cfr.py#L264 and verify that the Deep CFR example works the same way as with it? That's what happens on my end when using the current code from master.

I thought I had a bug in the my re-implementation of the sonnet's MLP, but turns out the code we have now not actually running the initializers (I cannot explain why, I can't find documentation on sonnet's initializers property nor any sign of it in the current sonnet code). When I do actually reinitialize in this variables, turns out the example becomes broken again (gives bad numbers), presumably because there's not enough training steps per iteration when doing it from scratch.

Anyway, I have added a reinitialize_advantage_networks flag to ensure that users can toggle this: 910f13e

DeepCFR bug fixes

09ea5eb

googlebot added the cla: yes label Jun 17, 2020

fix tests

3108cde

yellowbunchu commented Jun 17, 2020

View reviewed changes

lanctot mentioned this pull request Jun 18, 2020

Fixes necessary to upgrade to TF2.2 / Python 3.8 / Ubuntu 20.04 #249

Merged

lanctot assigned rfaulkner Jun 18, 2020

lanctot self-assigned this Jun 23, 2020

OpenSpiel merged commit 695fad0 into google-deepmind:master Jun 23, 2020

lanctot mentioned this pull request Jun 28, 2020

DeepCFR with Reservoir sampling #264

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepCFR bug fixes #248

DeepCFR bug fixes #248

yellowbunchu commented Jun 17, 2020

yellowbunchu Jun 17, 2020

EricSteinberger Jun 23, 2020

lanctot Jun 23, 2020

lanctot commented Jun 18, 2020

rfaulkner commented Jun 19, 2020

lanctot commented Jun 23, 2020 •

edited

Loading

rfaulkner commented Jun 23, 2020

lanctot commented Jun 23, 2020

lanctot commented Jun 23, 2020 •

edited

Loading

EricSteinberger commented Jun 23, 2020

lanctot commented Jun 23, 2020 •

edited

Loading

yellowbunchu commented Jun 23, 2020

lanctot commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 23, 2020 •

edited

Loading

yellowbunchu commented Jun 23, 2020

lanctot commented Jun 23, 2020

lanctot commented Jun 23, 2020 •

edited

Loading

yellowbunchu commented Jun 23, 2020

lanctot commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 23, 2020

lanctot commented Jun 23, 2020

dmorrill10 commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 24, 2020 •

edited

Loading

DeepCFR bug fixes #248

DeepCFR bug fixes #248

Conversation

yellowbunchu commented Jun 17, 2020

yellowbunchu Jun 17, 2020

Choose a reason for hiding this comment

EricSteinberger Jun 23, 2020

Choose a reason for hiding this comment

lanctot Jun 23, 2020

Choose a reason for hiding this comment

lanctot commented Jun 18, 2020

rfaulkner commented Jun 19, 2020

lanctot commented Jun 23, 2020 • edited Loading

rfaulkner commented Jun 23, 2020

lanctot commented Jun 23, 2020

lanctot commented Jun 23, 2020 • edited Loading

EricSteinberger commented Jun 23, 2020

lanctot commented Jun 23, 2020 • edited Loading

yellowbunchu commented Jun 23, 2020

lanctot commented Jun 23, 2020 • edited Loading

lanctot commented Jun 23, 2020 • edited Loading

yellowbunchu commented Jun 23, 2020

lanctot commented Jun 23, 2020

lanctot commented Jun 23, 2020 • edited Loading

yellowbunchu commented Jun 23, 2020

lanctot commented Jun 23, 2020 • edited Loading

lanctot commented Jun 23, 2020

lanctot commented Jun 23, 2020

dmorrill10 commented Jun 23, 2020 • edited Loading

lanctot commented Jun 24, 2020 • edited Loading

lanctot commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 23, 2020 •

edited

Loading

dmorrill10 commented Jun 23, 2020 •

edited

Loading

lanctot commented Jun 24, 2020 •

edited

Loading