[BUG] Memory leak? #845

vmoens · 2023-01-19T09:30:33Z

Describe the bug

I’m addressing a “memory-leak” issue, but I’m not sure it’s a real memory leak
With pong the mem on the GPU i use for data collection keeps increasing.
The strange thing is that it correlates with the performance of the training: the better the training, the higher the mem consumption. The obvious explanation (not the full story) is that better perf <=> longer trajectories. Hence for some reason, longer trajs cause the memory to increase.
A few things could explain that:

some transform has indeed a memory leak that gets cleared by its reset method (unlikely)
the dataloader has a similar leak that gets cleared when calling reset (unlikely)
the most likely to me: the split_traj option causes this. We essentially pad the values to fit all the trajs in a [B x max_T] tensordict, where max_T is the maximum length of the trajectories collected. Now imagine you have 8 workers and a batch size of 128 elts per worker. 7 workers collect trajectories all < 10 steps for a batch of length 128 (ie 7 x 128 // 10 = 100 small trajectories), and one of them collects one long trajectory of length 128. The split_trajs will deliver a batch B=101 and a max_T=128 but 90% of the values will be zeros.

Possible solutions

The main thing that worried me and made me use this split traj was using different trajectories sequentially may break some algos.
From my experiments with the advantage functions (TD0, TDLambda, GAE) only TDLambda suffers from this and it's likely that it is because we're not using the done flag appropriately.

Using nestedtensors instead of padding (cc @matteobettini)
Keep split_trajs but turn it off by default. Fix the value functions to make them work in this context.
Refactor the dataloader devices: right now we can choose on which device will the env sit, and on which will the policy. Not sure that really makes sense: when will the policy be so big that we can't transform data in the env with it? The logic was that by default the data collected would sit on the device of the env, not the policy (to avoid that long rollout fill the GPU, one can put the env on CPU and get the data there). What we could do instead is: policy and env are on device and the passing_device or else is the device where the data is dumped at each iteration.

Some more context

I tried using gc.collect() but with the pong example I was running it didn't change anything.

@albertbou92 I know you had a similar issue, interested in having your perspective on this.
@ShahRutav I believe that in your case split_trajs does not have an impact so I doubt that it is the cause of the problem. I'll keep digging

The text was updated successfully, but these errors were encountered:

matteobettini · 2023-01-19T11:12:41Z

Yeah there are multiple alternatives to padding in the torch world, but none of them really having all the features we need. There is for example torch.sparse, torch.masked and torch.nested. All leveraging in some way sparse memory representations. nested is the one that more would suit this type of applications but it has a list of missing features that we recapped in #777 .

This refactoring of split_traj is correlated with the changes proposed in #828 in order to make split_traj work when the batch_sizes are more than one and one of them is the agents. So we might consider cross-fertilizing these two issues.

vmoens · 2023-01-19T11:47:25Z

thanks @matteobettini
Regarding the collector device(s) do you agree that we should only consider:

one device for execution (env + policy)
one device for storage of intermediate results
?

matteobettini · 2023-01-19T12:01:33Z

Yes makes sense

albertbou92 · 2023-01-19T12:18:16Z

If split_trajs can explain the extra GPU memory use I would turn it off by default and make value functions work in this context. Using the done flag should be enough in most cases.

Also for me makes sense having one device for execution and one for results

vmoens · 2023-01-31T10:27:04Z

Closing this as:

the mem consumption may indeed increase with split_trajs=True, we will be changing that default in the future
Other sources of mem increase could be tensordicts created for init or test and left in the script uncleared. Pro-tip:
- Clear your data or contain it in dedicated methods. Eg initialize your modules in a dedicated function to ensure that no extra data is left in the main script. Contain your test rollouts + logs in a dedicated function for the same reason.

vmoens added the bug Something isn't working label Jan 19, 2023

vmoens self-assigned this Jan 19, 2023

vmoens mentioned this issue Jan 19, 2023

[BugFix] [Feature] Multi-agent collectors #828

Closed

vmoens mentioned this issue Jan 23, 2023

[Feature Request] Turn collectors' split_trajs to False by default #856

Closed

vmoens closed this as completed Jan 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory leak? #845

[BUG] Memory leak? #845

vmoens commented Jan 19, 2023 •

edited

Loading

matteobettini commented Jan 19, 2023 •

edited

Loading

vmoens commented Jan 19, 2023

matteobettini commented Jan 19, 2023

albertbou92 commented Jan 19, 2023 •

edited

Loading

vmoens commented Jan 31, 2023

[BUG] Memory leak? #845

[BUG] Memory leak? #845

Comments

vmoens commented Jan 19, 2023 • edited Loading

Describe the bug

Possible solutions

Some more context

matteobettini commented Jan 19, 2023 • edited Loading

vmoens commented Jan 19, 2023

matteobettini commented Jan 19, 2023

albertbou92 commented Jan 19, 2023 • edited Loading

vmoens commented Jan 31, 2023

vmoens commented Jan 19, 2023 •

edited

Loading

matteobettini commented Jan 19, 2023 •

edited

Loading

albertbou92 commented Jan 19, 2023 •

edited

Loading