-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DQN can't find a good policy #11
Comments
Sorry I cannot open the figures in your dropbox. Probably you did not make it publicly accessible. If possible, please directly post it here or send it to my email qiuhua dot huang at pnnl dot gov. Is the result based on only one random seed? You may also try different random seeds. It could have a huge difference. |
I went through your codes and results, the input and configuration files (*.raw and *.json) and the NN structure are different from our original testing code: I would suggest you changing and making them the same as our original testing code, because we don't know the performance for other combinations/settings And at least 3 random seeds should be tried. |
Thank you! I tried initially the code with your original settings (raw and json files, NN structure), therefore I began to change these settings. However, I will repeat more carefully them as the original settings.
Do you mean to try different np.random.seed()?? |
Set the last parameter 'seed' in DQN class according to https://stable-baselines.readthedocs.io/en/master/modules/dqn.html |
I have repeated training of Kundur system using original settings. I used Stable-Baselines (DQN agent) instead of openAI baseline.
However I've got the same result. For some reason, DQN agent cannot overcome the mark of mean reward equal to ~603. https://photos.app.goo.gl/SSJyQQsA3vDhz1nt7 I decided run your full original testing code with openAI baseline DQN model. However, I've got the same "~603 problem" policy. Case id: 0, Fault bus id: Bus3, fault start time: 1,000000, fault duration: 0,585000| % time spent exploring | 2 |
|
Yes, I still need your help on this issue. |
Hi, I believe we did not correctly commit one RL training configuration file. Please use this updated one: https://github.com/RLGC-Project/RLGC/blob/master/testData/Kundur-2area/json/kundur2area_RL_config_multiStepObsv.json The settings are corresponding to our paper. |
I tried the DQN-agent training with these settings, however, I faced again with "~603 problem" policy.
I think, in your simulation settings, the duration of the short circuits is not long enough to cause loss stability (penalty = - 1000). Therefore, the agent chooses a no-action policy, which is probably consistent with this "~603 problem". Perhaps, in this case, the RL agent has no motivation to find a better policy to reduce negative rewards. |
@qhuang-pnl , I probably figured out a bug where, during training, the agent cannot overcome the reward boundary of -602. The fact is that during training and testing in the environment (Kundur's scheme), short circuits are not simulated. I checked it out. That is, the agent learns purely on the normal operating conditions of the system. In this case, the optimal policy is never to apply the dynamic brake, i.e. actions are always 0. I'm guessing it has something to do with the PowerDynSimEnvDef modifications. Initially, you used PowerDynSimEnvDef_v2, and now I am working with PowerDynSimEnvDef_v7 |
According your advice, I switched to Stable-Baselines instead of openAI baseline in the Kundur system training.
However after 900000 steps of training DQN agent cannot find a good policy. Please see average reward progress plot
https://www.dropbox.com/preview/DQN_adaptivenose.png?role=personal
I used the following env settings
Mu suggestion is that in the baseline scenario
kunder_2area_ver30.raw
(without system loading), short circuit might not lead to loss of stability during the simulation. Therefore, (perhaps) DQN agent finds a "no action" policy, that so as not to receive the actionPenalty = 2.0. Because according the reward progress plot, during training agent cannot find a policy better than mean reward 603.05. When testing, mean_reward = 603.05 means "no action" policy (please see figure bellow)https://www.dropbox.com/preview/no%20actions%20case.png?role=personal
However it's only my suggestion, I can wrong. I thought to try scenarios with increasing load in order to get for sure loss of stability during simulation.
Originally posted by @frostyduck in #9 (comment)
The text was updated successfully, but these errors were encountered: