-
Notifications
You must be signed in to change notification settings - Fork 954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepCFR bug fixes #248
DeepCFR bug fixes #248
Conversation
sampled_regret_arr[action] = sampled_regret[action] | ||
self._advantage_memories[player].add( | ||
AdvantageMemory(state.information_state_tensor(), self._iteration, | ||
sampled_regret_arr, action)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure about this as it sets SR for illegal actions to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting SR for illegal actions to 0 should be fine, I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @EricSteinberger, that was my next question. Truly appreciate the quick responses!
Thanks! @rfaulkner can you look into this? |
Sure I can take a look at this PR. |
Hi @aliyakamercan , given that in the example which runs Deep CFR on Kuhn poker, you set:
Shouldn't the defaults not be set to 1? How did you choose these values (are they documented anywhere?) Do these epochs come from the paper..? What are they called in the paper? @noambrown, any advice on how what are sensible defaults or how we should refer to these from the paper? |
+1 to Marc's comments, otherwise this looks good to me. |
@aliyakamercan can you explain what these epochs are, preferably by aligning the terminology with that from the paper? From a quick look at the code, seems to me like this epochs might correspond to iterations. Is that true? |
Actually, I'm not sure they can be iterations because how could each one have different values? Are these epochs "number of training passes/batches per iteration"? Either way, I would really like us to (i) document what these are properly, (ii) connect it to the part of the paper that mentions to allow for easier reproducibility and understanding. I spoke briefly to Noam and regarding hyperparamater settings, he suggested we look at the implementation in the DREAM code: https://www.google.com/url?q=https://github.com/EricSteinberger/DREAM . @EricSteinberger, do you have any advice for us on these points? |
Hi guys! Epochs usually refer to the number of full training iterations of the NN on the full buffer. We don't look at it that way in either paper, so it's hard to tell what is meant here without digging deep into the codebase. The DREAM codebase (and the isolated DeepCFR one https://github.com/EricSteinberger/Deep-CFR too) is aligned with the paper and has been used for the results in both SD-CFR and DREAM, so you might want to compare hyperparameters, as Noam suggested. As Mark pointed out, it'd be great to align terminologies with the paper to answer any questions. If you have any specific technical questions about hyperparameters or the algorithms, I'm happy to help! Just post them below Cheers, |
Awesome, thanks @EricSteinberger for the quick response, the extra link, and open-sourcing all of your implementations!
Cool, yeah.. that makes sense. By "full training iterations of the NN on the full buffer", do you literally mean the entire buffer, e.g. not the number of mini-batches sampled from it? As in, one big batch whose size is that of the buffer? I think what we should aim for is the most basic approach that is aligned with the original Deep CFR. In that case, what do you advise we do? And are there any discussions of these implementation details in the original paper that we should cite? |
These are the number of iterations on the NNs. I think I misused the name epoch here as it means full training iterations of the NN on the full buffer. I think these are just called iterations on the paper, quoting section 5.2:
I am not sure about batch sizes, neither. Do we only train on sampled batch-size entries or the whole buffer and batch size is a hyper param? I am looking at @EricSteinberger's DREAM and DEEP-CFR repos |
Ok I just looked more closely at the code. I think these are most often called "training steps" (or "learning steps"). In the case of algorithms that do other things in their iteration (like, e.g. a tree pass), I usually call this something like training steps per iteration or learning steps per iteration. I just checked our RCFR test and it seems like @dmorrill10 also called it epochs :) Dustin, would you say the num_epochs in rcfr_test is equivalent to what I'm referring to as "training steps per iteration" ? |
One small preference: I would prefer to avoid also calling these iterations since that term is also used for the algorithm's iteration. I'm not strongly in favor nor against epochs vs. training_steps_per_iteration as long as it's clear from the doc string. So I'm happy for anyone wants to convince me one way or the other. :) |
I think |
No need to change it, I can do this on my end (as I have already imported it). Batch sizes are fine and the most common approach, and we can let people customize these as they want: that's what they did too from the text you cited. So fine to leave that as you have it now. |
@aliyakamercan what are the values that you are getting when running the example with those settings? Mine are still off:
I'm using num_iterations=500 and num_traversals=40. Seems to get worse when I increase the iterations. |
I tried thrice with 400/40, getting
Are you on the correct branch? |
My bad, sorry! I was accidentally running an internal test which we have set to 2 iterations and 2 traversals (just to ensure that it runs). The real values with 500/40 are:
Your numbers look good. Do they look the same using 500/40 ? |
My three tries at 400/40:
Ok, certainly seems within noise. Happy with this! |
It's in, thanks again everyone. @aliyakamercan , just a heads-up: I will be modifying this Deep CFR again soon. We need to remove sonnet to support Ubuntu 20.04. See details here: #249 and #166. You can, of course, continue to use sonnet yourself but the master branch will be removing its use until we can upgrade to TF2. We'll be replacing it with a custom TF1 implementation of linear nets that is very sonnet-like, so shouldn't be too disruptive, I hope. |
@lanctot The way I was thinking about this before is at the beginning of the iteration you have a policy evaluation/data-gathering/tree-walk step, and data is added to or forms a training data buffer. You pass this buffer to the network training algorithm and it does |
@aliyakamercan I found something quite odd tonight in the removal of sonnet. It seems that the networks were never actually getting reinitialized. Can you comment out this line here: https://github.com/deepmind/open_spiel/blob/695fad0ac25383e7f66cb0bb30fa8a4ea07d6bb9/open_spiel/python/algorithms/deep_cfr.py#L264 and verify that the Deep CFR example works the same way as with it? That's what happens on my end when using the current code from master. I thought I had a bug in the my re-implementation of the sonnet's MLP, but turns out the code we have now not actually running the initializers (I cannot explain why, I can't find documentation on sonnet's Anyway, I have added a reinitialize_advantage_networks flag to ensure that users can toggle this: 910f13e |
I wasn't able to get a proper policy for Kuhn Poker before making these changes.