Memory Leak Issue Using PyTorch Profiler with PyTorch Lightning #20595
Unanswered
SinaTavakoli
asked this question in
Lightning Trainer API: Trainer, LightningModule, LightningDataModule
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Description:
I've encountered a significant issue while using PyTorch Lightning alongside PyTorch Profiler for resource usage logging and monitoring during training. Roughly two hours into the training process, the system training speed noticeably decreases, and I want to leverage TensorBoard to pinpoint what is causing this slowdown.
Problem:
When utilizing the PyTorch Profiler, my system's RAM becomes fully occupied, leading to what seems like a memory leak. I attempted to mitigate this issue using techniques such as schedulers and setting a row limit, but unfortunately, these efforts did not resolve the problem.
Environment:
Here are the versions of the relevant libraries I'm currently using:
pytorch-lightning
: 2.5.0torch
: 2.3.0torch-tb-profiler
: 0.4.3torchaudio
: 2.5.1torchmetrics
: 1.6.1torchvision
: 0.18.0Configuration:
Below is the configuration I'm using for the trainer and profiler in PyTorch Lightning:
Code Snippet:
Below is the code snippet where I'm utilizing PyTorch Lightning CLI:
Request for Help:
I am seeking assistance or suggestions on how to effectively resolve this memory leak issue. Any insights or potential fixes from the community would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions