Memory Leak Issue Using PyTorch Profiler with PyTorch Lightning #20595

SinaTavakoli · 2025-02-18T17:30:25Z

SinaTavakoli
Feb 18, 2025

Description:

I've encountered a significant issue while using PyTorch Lightning alongside PyTorch Profiler for resource usage logging and monitoring during training. Roughly two hours into the training process, the system training speed noticeably decreases, and I want to leverage TensorBoard to pinpoint what is causing this slowdown.

Problem:

When utilizing the PyTorch Profiler, my system's RAM becomes fully occupied, leading to what seems like a memory leak. I attempted to mitigate this issue using techniques such as schedulers and setting a row limit, but unfortunately, these efforts did not resolve the problem.

Environment:

Here are the versions of the relevant libraries I'm currently using:

pytorch-lightning: 2.5.0
torch: 2.3.0
torch-tb-profiler: 0.4.3
torchaudio: 2.5.1
torchmetrics: 1.6.1
torchvision: 0.18.0

Configuration:

Below is the configuration I'm using for the trainer and profiler in PyTorch Lightning:

trainer:
  max_epochs: 1
  limit_val_batches: 1000
  limit_test_batches: 1000
  val_check_interval: 1000
  log_every_n_steps: 1000
  reload_dataloaders_every_n_epochs: 1
  detect_anomaly: False
  accumulate_grad_batches: 5
  accelerator: "gpu"
  precision: "16-mixed"
  devices:
    - 0
  logger:
    - class_path: pytorch_lightning.loggers.TensorBoardLogger
      init_args:
        save_dir: ./
        name: something
        version: ./v1/logs/tensorboard
    - class_path: pytorch_lightning.loggers.mlflow.MLFlowLogger
      init_args:
        experiment_name: 'something'
        run_name: train
        log_model: False
        prefix: something

  callbacks:
    - class_path: pytorch_lightning.callbacks.ModelCheckpoint
      dict_kwargs:
        dirpath: ./something/v1/logs/checkpoints
        monitor: something/something
        mode: min
        filename: 'last'
        save_top_k: 1
        save_weights_only: false
    - class_path: pytorch_lightning.callbacks.LearningRateMonitor
      dict_kwargs:
        logging_interval: 'step'

  profiler:
    class_path: pytorch_lightning.profilers.PyTorchProfiler
    init_args:
      row_limit: 10

Code Snippet:

Below is the code snippet where I'm utilizing PyTorch Lightning CLI:

from pytorch_lightning.cli import LightningCLI

if __name__ == '__main__':
    cli = LightningCLI(
        save_config_kwargs={"overwrite": True}, run=False, auto_configure_optimizers=False
    )

    cli.trainer.fit(cli.model, datamodule=cli.datamodule, ckpt_path="last")

Request for Help:

I am seeking assistance or suggestions on how to effectively resolve this memory leak issue. Any insights or potential fixes from the community would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Leak Issue Using PyTorch Profiler with PyTorch Lightning #20595

{{title}}

Replies: 0 comments

Select a reply

Memory Leak Issue Using PyTorch Profiler with PyTorch Lightning #20595

SinaTavakoli Feb 18, 2025

Replies: 0 comments

SinaTavakoli
Feb 18, 2025