this small example is meant to illustrate my struggles and attempts at adapting torchrl to multi agent ppo with composite action spaces.
my pain points were:
- individual log prob keys for the actions cause issues with stacking tensordicts somewhere internally
- non-natively multivariate distributions need special handling when calculating log probs
- ppo loss does not deal well with tensordicts, need to extract the sample log prob tensor
- petting zoo env wrapper does not properly split the action tensors when using dict action spaces