Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepCFR bug fixes #248

Merged
merged 2 commits into from
Jun 23, 2020
Merged

Conversation

yellowbunchu
Copy link
Contributor

I wasn't able to get a proper policy for Kuhn Poker before making these changes.

  1. Predicted advantages were being inserted into advantage memory instead of sampled regret.
  2. The same info was inserted into advantage memory multiple times. (wrong loop)
  3. Using multiple epochs to train advantage networks and the policy network. Not sure if there was a reason for doing only one pass?
  4. Reset advantage network for the current player only. This doesn't matter in two-player games but does in 3+ player games. Otherwise "other player" (unless it's the last one trained) will always act randomly in the traversal.

sampled_regret_arr[action] = sampled_regret[action]
self._advantage_memories[player].add(
AdvantageMemory(state.information_state_tensor(), self._iteration,
sampled_regret_arr, action))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure about this as it sets SR for illegal actions to 0.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting SR for illegal actions to 0 should be fine, I believe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EricSteinberger, that was my next question. Truly appreciate the quick responses!

@lanctot
Copy link
Collaborator

lanctot commented Jun 18, 2020

Thanks! @rfaulkner can you look into this?

@rfaulkner
Copy link
Collaborator

Sure I can take a look at this PR.

@lanctot lanctot self-assigned this Jun 23, 2020
@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

Hi @aliyakamercan , given that in the example which runs Deep CFR on Kuhn poker, you set:

  • policy_network_epochs to 400,
  • advantage_network_epochs to 20,

Shouldn't the defaults not be set to 1? How did you choose these values (are they documented anywhere?)

Do these epochs come from the paper..? What are they called in the paper? @noambrown, any advice on how what are sensible defaults or how we should refer to these from the paper?

@rfaulkner
Copy link
Collaborator

+1 to Marc's comments, otherwise this looks good to me.

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

@aliyakamercan can you explain what these epochs are, preferably by aligning the terminology with that from the paper? From a quick look at the code, seems to me like this epochs might correspond to iterations. Is that true?

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

Actually, I'm not sure they can be iterations because how could each one have different values? Are these epochs "number of training passes/batches per iteration"? Either way, I would really like us to (i) document what these are properly, (ii) connect it to the part of the paper that mentions to allow for easier reproducibility and understanding.

I spoke briefly to Noam and regarding hyperparamater settings, he suggested we look at the implementation in the DREAM code: https://www.google.com/url?q=https://github.com/EricSteinberger/DREAM .

@EricSteinberger, do you have any advice for us on these points?

@EricSteinberger
Copy link

Hi guys! Epochs usually refer to the number of full training iterations of the NN on the full buffer. We don't look at it that way in either paper, so it's hard to tell what is meant here without digging deep into the codebase.

The DREAM codebase (and the isolated DeepCFR one https://github.com/EricSteinberger/Deep-CFR too) is aligned with the paper and has been used for the results in both SD-CFR and DREAM, so you might want to compare hyperparameters, as Noam suggested.

As Mark pointed out, it'd be great to align terminologies with the paper to answer any questions. If you have any specific technical questions about hyperparameters or the algorithms, I'm happy to help! Just post them below

Cheers,
Eric

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

Awesome, thanks @EricSteinberger for the quick response, the extra link, and open-sourcing all of your implementations!

As Mark pointed out, it'd be great to align terminologies with the paper to answer any questions. If you have any specific technical questions about hyperparameters or the algorithms, I'm happy to help! Just post them below

Cool, yeah.. that makes sense. By "full training iterations of the NN on the full buffer", do you literally mean the entire buffer, e.g. not the number of mini-batches sampled from it? As in, one big batch whose size is that of the buffer?

I think what we should aim for is the most basic approach that is aligned with the original Deep CFR. In that case, what do you advise we do? And are there any discussions of these implementation details in the original paper that we should cite?

@yellowbunchu
Copy link
Contributor Author

  • policy_network_epochs to 400,
  • advantage_network_epochs to 20,

These are the number of iterations on the NNs. I think I misused the name epoch here as it means full training iterations of the NN on the full buffer.

I think these are just called iterations on the paper, quoting section 5.2:

We allocate a maximum size of 40 million infosets to each
player’s advantage memory MV,p and the strategy memory
MΠ. The value model is trained from scratch each CFR
iteration, starting from a random initialization. We perform
4,000 mini-batch stochastic gradient descent (SGD) iterations using a batch size of 10,000 and perform parameter
updates using the Adam optimizer (Kingma & Ba, 2014)
with a learning rate of 0.001, with gradient norm clipping
to 1. For HULH we use 32,000 SGD iterations and a batch
size of 20,000. Figure 4 shows that training the model from

I am not sure about batch sizes, neither. Do we only train on sampled batch-size entries or the whole buffer and batch size is a hyper param?

I am looking at @EricSteinberger's DREAM and DEEP-CFR repos

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

Ok I just looked more closely at the code. I think these are most often called "training steps" (or "learning steps"). In the case of algorithms that do other things in their iteration (like, e.g. a tree pass), I usually call this something like training steps per iteration or learning steps per iteration.

I just checked our RCFR test and it seems like @dmorrill10 also called it epochs :) Dustin, would you say the num_epochs in rcfr_test is equivalent to what I'm referring to as "training steps per iteration" ?

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

One small preference: I would prefer to avoid also calling these iterations since that term is also used for the algorithm's iteration. I'm not strongly in favor nor against epochs vs. training_steps_per_iteration as long as it's clear from the doc string. So I'm happy for anyone wants to convince me one way or the other. :)

@yellowbunchu
Copy link
Contributor Author

I think training_steps_per_iteration makes more sense, I will make that change. What do you think about batch sizes?

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

No need to change it, I can do this on my end (as I have already imported it).

Batch sizes are fine and the most common approach, and we can let people customize these as they want: that's what they did too from the text you cited. So fine to leave that as you have it now.

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

@aliyakamercan what are the values that you are getting when running the example with those settings? Mine are still off:

Computed player 0 value: 0.0810588105365439
Expected player 0 value: -0.05555555555555555
Computed player 1 value: -0.0810588105365439
Expected player 1 value: 0.05555555555555555

I'm using num_iterations=500 and num_traversals=40. Seems to get worse when I increase the iterations.

@yellowbunchu
Copy link
Contributor Author

I tried thrice with 400/40, getting

Computed player 0 value: -0.05581075681515041 / -0.04983329590598612 / -0.04533366139742423
Expected player 0 value: -0.05555555555555555 
Computed player 1 value:  0.05581075681515041 / 0.04983329590598612 / 0.04533366139742423
Expected player 1 value: 0.05555555555555555

Are you on the correct branch?

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

My bad, sorry! I was accidentally running an internal test which we have set to 2 iterations and 2 traversals (just to ensure that it runs). The real values with 500/40 are:

Computed player 0 value: -0.036085575995136177
Expected player 0 value: -0.05555555555555555
Computed player 1 value: 0.036085575995136177
Expected player 1 value: 0.05555555555555555

Does this match yours? Hmm, ok I will try 400/40. This is internally (after importing your PR).

Your numbers look good. Do they look the same using 500/40 ?

@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

My three tries at 400/40:

Computed player 0 value: -0.05007666549999429 / -0.042776478582621014 / -0.054319129381460995
Expected player 0 value: -0.05555555555555555
Computed player 1 value: 0.05007666549999429 / 0.042776478582621014 / 0.054319129381460995
Expected player 1 value: 0.05555555555555555

Ok, certainly seems within noise. Happy with this!

@OpenSpiel OpenSpiel merged commit 695fad0 into google-deepmind:master Jun 23, 2020
@lanctot
Copy link
Collaborator

lanctot commented Jun 23, 2020

It's in, thanks again everyone.

@aliyakamercan , just a heads-up: I will be modifying this Deep CFR again soon. We need to remove sonnet to support Ubuntu 20.04. See details here: #249 and #166.

You can, of course, continue to use sonnet yourself but the master branch will be removing its use until we can upgrade to TF2. We'll be replacing it with a custom TF1 implementation of linear nets that is very sonnet-like, so shouldn't be too disruptive, I hope.

@dmorrill10
Copy link
Contributor

dmorrill10 commented Jun 23, 2020

Ok I just looked more closely at the code. I think these are most often called "training steps" (or "learning steps"). In the case of algorithms that do other things in their iteration (like, e.g. a tree pass), I usually call this something like training steps per iteration or learning steps per iteration.

I just checked our RCFR test and it seems like @dmorrill10 also called it epochs :) Dustin, would you say the num_epochs in rcfr_test is equivalent to what I'm referring to as "training steps per iteration" ?

@lanctot The way I was thinking about this before is at the beginning of the iteration you have a policy evaluation/data-gathering/tree-walk step, and data is added to or forms a training data buffer. You pass this buffer to the network training algorithm and it does epoch passes over the buffer. If the batch size is >= the size of the buffer, an epoch will consist of a single training step, but If the batch size is smaller, then multiple training steps will be taken to complete the epoch. These two procedures together form an "iteration" in the same sense as in tabular CFR. Specifying the number of training steps directly would be an alternative interface to the network training algorithm that probably makes more sense when you're dealing with really large buffers.

@lanctot
Copy link
Collaborator

lanctot commented Jun 24, 2020

@aliyakamercan I found something quite odd tonight in the removal of sonnet.

It seems that the networks were never actually getting reinitialized. Can you comment out this line here: https://github.com/deepmind/open_spiel/blob/695fad0ac25383e7f66cb0bb30fa8a4ea07d6bb9/open_spiel/python/algorithms/deep_cfr.py#L264 and verify that the Deep CFR example works the same way as with it? That's what happens on my end when using the current code from master.

I thought I had a bug in the my re-implementation of the sonnet's MLP, but turns out the code we have now not actually running the initializers (I cannot explain why, I can't find documentation on sonnet's initializers property nor any sign of it in the current sonnet code). When I do actually reinitialize in this variables, turns out the example becomes broken again (gives bad numbers), presumably because there's not enough training steps per iteration when doing it from scratch.

Anyway, I have added a reinitialize_advantage_networks flag to ensure that users can toggle this: 910f13e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants