Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing GPU memory and/or CPU OS memory leaks with use_persistent_mapping enabled in gdrdrv during multi-process termination #313

Open
realarnavgoel opened this issue Jan 10, 2025 · 0 comments
Assignees
Milestone

Comments

@realarnavgoel
Copy link
Collaborator

Impacted platform
All server side products, first observed on Grace-Hopper based system

Impacted gdrcopy versions
2.4.1, 2.4.2, 2.4.3

Impacted gdrcopy configs
gdrdrv driver loaded with module parameter set use_persistent_mapping=1

Scenarios
If gdrcopy persistent mapping mode is enabled,

  1. If one process opens a connection to the driver (via gdr_open), and intents to expliclity share connection (using UNIX Domain socket) with one or more processes to use the underlying connection, then the cleanup of the driver resources (via gdr_close) may be executed by one of the non-owning processes, which would be silently ignored therefore leading to CPU and GPU memory leaks.

  2. If a parent process A forks one or more child process B (instead of linux fork + exec), then connections opened by A can be attempted to be closed by B during an ungraceful termination of processes via signals (SIGSEV or SIGKILL), resulting in OS and GPU memory leaks.

By default, if persistent mode is disabled, under both scenarios, GPU resources cleanup is performed through an independent workflow in CUDA driver and hence dropping the request to close this connection is benign.

Irrespective of persistent mode, this bug may lead to small CPU kernel memory leaks.

Signature of the defect

  • On coherent platforms, e.g. Grace Hopper systems, GPU memory leaks can lead to unexpected side effects. For example, turning off the nvidia-persistenced service may hang, requiring rebooting the machine.
  • On non-coherent platforms, GPU memory leaks may reduce the functionality or performance of CUDA applications.

Known mitigations
Turn off by setting driver module parameter use_persistent_mapping=0 and reloading the driver.

Fixed gdrcopy version
2.4.4

@realarnavgoel realarnavgoel self-assigned this Jan 10, 2025
@realarnavgoel realarnavgoel added bug and removed bug labels Jan 10, 2025
@realarnavgoel realarnavgoel added this to the v2.4 milestone Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant