Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to compute the gradient of reward w.r.t. model parameters? #8

Closed
c4cld opened this issue May 12, 2022 · 2 comments
Closed

How to compute the gradient of reward w.r.t. model parameters? #8

c4cld opened this issue May 12, 2022 · 2 comments

Comments

@c4cld
Copy link

c4cld commented May 12, 2022

My research interest is robustness of RL algorithms to environment parameters. I want to modify currents RL algorithms to make them achieve good performance when they are tested in environments with unfamilar parameters. (For example, an agent is trained in Cartpole environment with 1m pole. I want it achieve good performance in Cartpole environment with 3m pole.)
To achieve this goal, I want to get the relationship between model parameter values and RL algorithm's performance (reward). As a result, I want to get the gradient of reward with respect to the model parameters.
Mujoco simulator has applied in my experiments. But Mujoco simulator is not implemented by pure python. So I cannot get the the gradient of reward with respect to the model parameters.
So my queation is:
Can pytorch_kinematics compute the gradient of reward with respect to the model parameters? If so, how should I use it to achieve this goal?

@LemonPi
Copy link
Member

LemonPi commented Feb 23, 2023

I'm not sure because:

  • PK is not a simulator because it only does kinematics (so no forces or velocity is considered)
  • I don't know what your reward is

If you're using RL to learn a control policy, the first case (that PK is not a simulator) may or may not work for you depending on if your environment and problem is quasi-static. If it is quasi-static, then you could maybe use PK as the mapping from what your RL-learned model outputs and what you need to compute a rewards function.

The second point depends on if your reward depends on the output of kinematics. For example, distance of a transformed link to some goal would be a reward/cost function that you could differentiate through with PK.

Specific to your cartpole environment, I think it's doable. If you formulate the pole length as a prismatic joint, then you could set requires_grad=True on the pole length joint value. Use it in forward kinematics, compute the reward based on the output transform (e.g. distance of end of cartpole link to a goal set)

@PeterMitrano
Copy link
Contributor

Closing since there hasn't been any follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants