Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory leak? #845

Closed
vmoens opened this issue Jan 19, 2023 · 5 comments
Closed

[BUG] Memory leak? #845

vmoens opened this issue Jan 19, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@vmoens
Copy link
Contributor

vmoens commented Jan 19, 2023

Describe the bug

I’m addressing a “memory-leak” issue, but I’m not sure it’s a real memory leak
With pong the mem on the GPU i use for data collection keeps increasing.
The strange thing is that it correlates with the performance of the training: the better the training, the higher the mem consumption. The obvious explanation (not the full story) is that better perf <=> longer trajectories. Hence for some reason, longer trajs cause the memory to increase.
A few things could explain that:

  • some transform has indeed a memory leak that gets cleared by its reset method (unlikely)
  • the dataloader has a similar leak that gets cleared when calling reset (unlikely)
  • the most likely to me: the split_traj option causes this. We essentially pad the values to fit all the trajs in a [B x max_T] tensordict, where max_T is the maximum length of the trajectories collected. Now imagine you have 8 workers and a batch size of 128 elts per worker. 7 workers collect trajectories all < 10 steps for a batch of length 128 (ie 7 x 128 // 10 = 100 small trajectories), and one of them collects one long trajectory of length 128. The split_trajs will deliver a batch B=101 and a max_T=128 but 90% of the values will be zeros.

Possible solutions

The main thing that worried me and made me use this split traj was using different trajectories sequentially may break some algos.
From my experiments with the advantage functions (TD0, TDLambda, GAE) only TDLambda suffers from this and it's likely that it is because we're not using the done flag appropriately.

  • Using nestedtensors instead of padding (cc @matteobettini)
  • Keep split_trajs but turn it off by default. Fix the value functions to make them work in this context.
  • Refactor the dataloader devices: right now we can choose on which device will the env sit, and on which will the policy. Not sure that really makes sense: when will the policy be so big that we can't transform data in the env with it? The logic was that by default the data collected would sit on the device of the env, not the policy (to avoid that long rollout fill the GPU, one can put the env on CPU and get the data there). What we could do instead is: policy and env are on device and the passing_device or else is the device where the data is dumped at each iteration.

Some more context

I tried using gc.collect() but with the pong example I was running it didn't change anything.

@albertbou92 I know you had a similar issue, interested in having your perspective on this.
@ShahRutav I believe that in your case split_trajs does not have an impact so I doubt that it is the cause of the problem. I'll keep digging

@vmoens vmoens added the bug Something isn't working label Jan 19, 2023
@vmoens vmoens self-assigned this Jan 19, 2023
@matteobettini
Copy link
Contributor

matteobettini commented Jan 19, 2023

Yeah there are multiple alternatives to padding in the torch world, but none of them really having all the features we need. There is for example torch.sparse, torch.masked and torch.nested. All leveraging in some way sparse memory representations. nested is the one that more would suit this type of applications but it has a list of missing features that we recapped in #777 .

This refactoring of split_traj is correlated with the changes proposed in #828 in order to make split_traj work when the batch_sizes are more than one and one of them is the agents. So we might consider cross-fertilizing these two issues.

@vmoens
Copy link
Contributor Author

vmoens commented Jan 19, 2023

thanks @matteobettini
Regarding the collector device(s) do you agree that we should only consider:

  • one device for execution (env + policy)
  • one device for storage of intermediate results
    ?

@matteobettini
Copy link
Contributor

Yes makes sense

@albertbou92
Copy link
Contributor

albertbou92 commented Jan 19, 2023

If split_trajs can explain the extra GPU memory use I would turn it off by default and make value functions work in this context. Using the done flag should be enough in most cases.

Also for me makes sense having one device for execution and one for results

@vmoens
Copy link
Contributor Author

vmoens commented Jan 31, 2023

Closing this as:

  • the mem consumption may indeed increase with split_trajs=True, we will be changing that default in the future
  • Other sources of mem increase could be tensordicts created for init or test and left in the script uncleared. Pro-tip:
    • Clear your data or contain it in dedicated methods. Eg initialize your modules in a dedicated function to ensure that no extra data is left in the main script. Contain your test rollouts + logs in a dedicated function for the same reason.

@vmoens vmoens closed this as completed Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants